[linux-mm-cc] [PATCH 0/3] compcache: in-memory compressed swapping v4

Discussion:

Nitin Gupta

2009-09-22 04:56:51 UTC

Project home: http://compcache.googlecode.com/

* Changelog: v4 vs v3
- Remove swap notify callback and related bits. This make ramzswap
contained entirely within drivers/staging/.
- Above changes can cause ramzswap to work poorly since it cannot
cleanup stale data from memory unless overwritten by some other data.
(this will be fixed when swap notifer patches are accepted)
- Some cleanups suggested by Marcin.

* Changelog: v3 vs v2
- All cleanups as suggested by Pekka.
- Move to staging (drivers/block/ramzswap/ -> drivers/staging/ramzswap/).
- Remove swap discard hooks -- swap notify support makes these redundant.
- Unify duplicate code between init_device() fail path and reset_device().
- Fix zero-page accounting.
- Do not accept backing swap with bad pages.

* Changelog: v2 vs initial revision
- Use 'struct page' instead of 32-bit PFNs in ramzswap driver and xvmalloc.
This is to make these 64-bit safe.
- xvmalloc is no longer a separate module and does not export any symbols.
Its compiled directly with ramzswap block driver. This is to avoid any
last bit of confusion with any other allocator.
- set_swap_free_notify() now accepts block_device as parameter instead of
swp_entry_t (interface cleanup).
- Fix: Make sure ramzswap disksize matches usable pages in backing swap file.
This caused initialization error in case backing swap file had intra-page
fragmentation.

It creates RAM based block devices which can be used (only) as swap disks.
Pages swapped to these disks are compressed and stored in memory itself. This
is a big win over swapping to slow hard-disk which are typically used as swap
disk. For flash, these suffer from wear-leveling issues when used as swap disk
- so again its helpful. For swapless systems, it allows more apps to run for a
given amount of memory.

It can create multiple ramzswap devices (/dev/ramzswapX, X = 0, 1, 2, ...).
Each of these devices can have separate backing swap (file or disk partition)
which is used when incompressible page is found or memory limit for device is
reached.

A separate userspace utility called rzscontrol is used to manage individual
ramzswap devices.

* Testing notes

Tested on x86, x64, ARM
ARM:
- Cortex-A8 (Beagleboard)
- ARM11 (Android G1)
- OMAP2420 (Nokia N810)

* Performance

All performance numbers/plots can be found at:
http://code.google.com/p/compcache/wiki/Performance

Below is a summary of this data:

General:
- Swap R/W times are reduced from milliseconds (in case of hard disks)
down to microseconds.

Positive cases:
- Shows 33% improvement in 'scan' benchmark which allocates given amount
of memory and linearly reads/writes to this region. This benchmark also
exposes bottlenecks in ramzswap code (global mutex) due to which this gain
is so small.
- On Linux thin clients, it gives the effect of nearly doubling the amount of
memory.

Negative cases:
Any workload that has active working set w.r.t. filesystem cache that is
nearly equal to amount of RAM while has minimal anonymous memory requirement,
is expected to suffer maximum loss in performance with ramzswap enabled.

Iozone filesystem benchmark can simulate exactly this kind of workload.
As expected, this test shows performance loss of ~25% with ramzswap.

drivers/staging/Kconfig | 2 +
drivers/staging/Makefile | 1 +
drivers/staging/ramzswap/Kconfig | 21 +
drivers/staging/ramzswap/Makefile | 3 +
drivers/staging/ramzswap/ramzswap.txt | 51 +
drivers/staging/ramzswap/ramzswap_drv.c | 1462 +++++++++++++++++++++++++++++
drivers/staging/ramzswap/ramzswap_drv.h | 173 ++++
drivers/staging/ramzswap/ramzswap_ioctl.h | 50 +
drivers/staging/ramzswap/xvmalloc.c | 533 +++++++++++
drivers/staging/ramzswap/xvmalloc.h | 30 +
drivers/staging/ramzswap/xvmalloc_int.h | 86 ++
include/linux/swap.h | 5 +
mm/swapfile.c | 34 +
13 files changed, 2451 insertions(+), 0 deletions(-)

Nitin Gupta

2009-09-22 04:56:52 UTC

Permalink

* Features:
- Low metadata overhead (just 4 bytes per object)
- O(1) Alloc/Free - except when we have to call system page allocator to
get additional memory.
- Very low fragmentation: In all tests, xvmalloc memory usage is within 12%
of "Ideal".
- Pool based allocator: Each pool can grow and shrink.
- It maps pages only when required. So, it does not hog vmalloc area which
is very small on 32-bit systems.

SLUB allocator could not be used due to fragmentation issues:
http://code.google.com/p/compcache/wiki/AllocatorsComparison
Data here shows kmalloc using ~43% more memory than TLSF and xvMalloc
is showed ~2% more space efficiency than TLSF (due to smaller metadata).
Creating various kmem_caches can reduce space efficiency gap but still
problem of being limited to low memory exists. Also, it depends on
allocating higher order pages to reduce fragmentation - this is not
acceptable for ramzswap as it is used under memory crunch (its a swap
device!).

SLOB allocator could not be used do to reasons mentioned here:
http://lkml.org/lkml/2009/3/18/210

* Implementation:
It uses two-level bitmap search to find free list containing block of
correct size. This idea is taken from TLSF (Two-Level Segregate Fit)
allocator and is well explained in its paper (see [Links] below).

* Limitations:
- Poor scalability: No per-cpu data structures (work in progress).

[Links]
1. Details and Performance data:
http://code.google.com/p/compcache/wiki/xvMalloc
http://code.google.com/p/compcache/wiki/xvMallocPerformance

2. TLSF memory allocator:
home: http://rtportal.upv.es/rtmalloc/
paper: http://rtportal.upv.es/rtmalloc/files/MRBC_2008.pdf

Signed-off-by: Nitin Gupta <ngupta at vflare.org>

---
drivers/staging/ramzswap/xvmalloc.c | 507 +++++++++++++++++++++++++++++++
drivers/staging/ramzswap/xvmalloc.h | 30 ++
drivers/staging/ramzswap/xvmalloc_int.h | 86 ++++++
3 files changed, 623 insertions(+), 0 deletions(-)

diff --git a/drivers/staging/ramzswap/xvmalloc.c b/drivers/staging/ramzswap/xvmalloc.c
new file mode 100644
index 0000000..b3e986c
--- /dev/null
+++ b/drivers/staging/ramzswap/xvmalloc.c
@@ -0,0 +1,507 @@
+/*
+ * xvmalloc memory allocator
+ *
+ * Copyright (C) 2008, 2009 Nitin Gupta
+ *
+ * This code is released using a dual license strategy: BSD/GPL
+ * You can choose the licence that better fits your requirements.
+ *
+ * Released under the terms of 3-clause BSD License
+ * Released under the terms of GNU General Public License Version 2.0
+ */
+
+#include <linux/bitops.h>
+#include <linux/errno.h>
+#include <linux/highmem.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+
+#include "xvmalloc.h"
+#include "xvmalloc_int.h"
+
+static void stat_inc(u64 *value)
+{
+ *value = *value + 1;
+}
+
+static void stat_dec(u64 *value)
+{
+ *value = *value - 1;
+}
+
+static int test_flag(struct block_header *block, enum blockflags flag)
+{
+ return block->prev & BIT(flag);
+}
+
+static void set_flag(struct block_header *block, enum blockflags flag)
+{
+ block->prev |= BIT(flag);
+}
+
+static void clear_flag(struct block_header *block, enum blockflags flag)
+{
+ block->prev &= ~BIT(flag);
+}
+
+/*
+ * Given <page, offset> pair, provide a derefrencable pointer.
+ * This is called from xv_malloc/xv_free path, so it
+ * needs to be fast.
+ */
+static void *get_ptr_atomic(struct page *page, u16 offset, enum km_type type)
+{
+ unsigned char *base;
+
+ base = kmap_atomic(page, type);
+ return base + offset;
+}
+
+static void put_ptr_atomic(void *ptr, enum km_type type)
+{
+ kunmap_atomic(ptr, type);
+}
+
+static u32 get_blockprev(struct block_header *block)
+{
+ return block->prev & PREV_MASK;
+}
+
+static void set_blockprev(struct block_header *block, u16 new_offset)
+{
+ block->prev = new_offset | (block->prev & FLAGS_MASK);
+}
+
+static struct block_header *BLOCK_NEXT(struct block_header *block)
+{
+ return (struct block_header *)
+ ((char *)block + block->size + XV_ALIGN);
+}
+
+/*
+ * Get index of free list containing blocks of maximum size
+ * which is less than or equal to given size.
+ */
+static u32 get_index_for_insert(u32 size)
+{
+ if (unlikely(size > XV_MAX_ALLOC_SIZE))
+ size = XV_MAX_ALLOC_SIZE;
+ size &= ~FL_DELTA_MASK;
+ return (size - XV_MIN_ALLOC_SIZE) >> FL_DELTA_SHIFT;
+}
+
+/*
+ * Get index of free list having blocks of size greater than
+ * or equal to requested size.
+ */
+static u32 get_index(u32 size)
+{
+ if (unlikely(size < XV_MIN_ALLOC_SIZE))
+ size = XV_MIN_ALLOC_SIZE;
+ size = ALIGN(size, FL_DELTA);
+ return (size - XV_MIN_ALLOC_SIZE) >> FL_DELTA_SHIFT;
+}
+
+/**
+ * find_block - find block of at least given size
+ * @pool: memory pool to search from
+ * @size: size of block required
+ * @page: page containing required block
+ * @offset: offset within the page where block is located.
+ *
+ * Searches two level bitmap to locate block of at least
+ * the given size. If such a block is found, it provides
+ * <page, offset> to identify this block and returns index
+ * in freelist where we found this block.
+ * Otherwise, returns 0 and <page, offset> params are not touched.
+ */
+static u32 find_block(struct xv_pool *pool, u32 size,
+ struct page **page, u32 *offset)
+{
+ ulong flbitmap, slbitmap;
+ u32 flindex, slindex, slbitstart;
+
+ /* There are no free blocks in this pool */
+ if (!pool->flbitmap)
+ return 0;
+
+ /* Get freelist index correspoding to this size */
+ slindex = get_index(size);
+ slbitmap = pool->slbitmap[slindex / BITS_PER_LONG];
+ slbitstart = slindex % BITS_PER_LONG;
+
+ /*
+ * If freelist is not empty at this index, we found the
+ * block - head of this list. This is approximate best-fit match.
+ */
+ if (test_bit(slbitstart, &slbitmap)) {
+ *page = pool->freelist[slindex].page;
+ *offset = pool->freelist[slindex].offset;
+ return slindex;
+ }
+
+ /*
+ * No best-fit found. Search a bit further in bitmap for a free block.
+ * Second level bitmap consists of series of 32-bit chunks. Search
+ * further in the chunk where we expected a best-fit, starting from
+ * index location found above.
+ */
+ slbitstart++;
+ slbitmap >>= slbitstart;
+
+ /* Skip this search if we were already at end of this bitmap chunk */
+ if ((slbitstart != BITS_PER_LONG) && slbitmap) {
+ slindex += __ffs(slbitmap) + 1;
+ *page = pool->freelist[slindex].page;
+ *offset = pool->freelist[slindex].offset;
+ return slindex;
+ }
+
+ /* Now do a full two-level bitmap search to find next nearest fit */
+ flindex = slindex / BITS_PER_LONG;
+
+ flbitmap = (pool->flbitmap) >> (flindex + 1);
+ if (!flbitmap)
+ return 0;
+
+ flindex += __ffs(flbitmap) + 1;
+ slbitmap = pool->slbitmap[flindex];
+ slindex = (flindex * BITS_PER_LONG) + __ffs(slbitmap);
+ *page = pool->freelist[slindex].page;
+ *offset = pool->freelist[slindex].offset;
+
+ return slindex;
+}
+
+/*
+ * Insert block at <page, offset> in freelist of given pool.
+ * freelist used depends on block size.
+ */
+static void insert_block(struct xv_pool *pool, struct page *page, u32 offset,
+ struct block_header *block)
+{
+ u32 flindex, slindex;
+ struct block_header *nextblock;
+
+ slindex = get_index_for_insert(block->size);
+ flindex = slindex / BITS_PER_LONG;
+
+ block->link.prev_page = 0;
+ block->link.prev_offset = 0;
+ block->link.next_page = pool->freelist[slindex].page;
+ block->link.next_offset = pool->freelist[slindex].offset;
+ pool->freelist[slindex].page = page;
+ pool->freelist[slindex].offset = offset;
+
+ if (block->link.next_page) {
+ nextblock = get_ptr_atomic(block->link.next_page,
+ block->link.next_offset, KM_USER1);
+ nextblock->link.prev_page = page;
+ nextblock->link.prev_offset = offset;
+ put_ptr_atomic(nextblock, KM_USER1);
+ }
+
+ __set_bit(slindex % BITS_PER_LONG, &pool->slbitmap[flindex]);
+ __set_bit(flindex, &pool->flbitmap);
+}
+
+/*
+ * Remove block from head of freelist. Index 'slindex' identifies the freelist.
+ */
+static void remove_block_head(struct xv_pool *pool,
+ struct block_header *block, u32 slindex)
+{
+ struct block_header *tmpblock;
+ u32 flindex = slindex / BITS_PER_LONG;
+
+ pool->freelist[slindex].page = block->link.next_page;
+ pool->freelist[slindex].offset = block->link.next_offset;
+ block->link.prev_page = 0;
+ block->link.prev_offset = 0;
+
+ if (!pool->freelist[slindex].page) {
+ __clear_bit(slindex % BITS_PER_LONG, &pool->slbitmap[flindex]);
+ if (!pool->slbitmap[flindex])
+ __clear_bit(flindex, &pool->flbitmap);
+ } else {
+ /*
+ * DEBUG ONLY: We need not reinitialize freelist head previous
+ * pointer to 0 - we never depend on its value. But just for
+ * sanity, lets do it.
+ */
+ tmpblock = get_ptr_atomic(pool->freelist[slindex].page,
+ pool->freelist[slindex].offset, KM_USER1);
+ tmpblock->link.prev_page = 0;
+ tmpblock->link.prev_offset = 0;
+ put_ptr_atomic(tmpblock, KM_USER1);
+ }
+}
+
+/*
+ * Remove block from freelist. Index 'slindex' identifies the freelist.
+ */
+static void remove_block(struct xv_pool *pool, struct page *page, u32 offset,
+ struct block_header *block, u32 slindex)
+{
+ u32 flindex;
+ struct block_header *tmpblock;
+
+ if (pool->freelist[slindex].page == page
+ && pool->freelist[slindex].offset == offset) {
+ remove_block_head(pool, block, slindex);
+ return;
+ }
+
+ flindex = slindex / BITS_PER_LONG;
+
+ if (block->link.prev_page) {
+ tmpblock = get_ptr_atomic(block->link.prev_page,
+ block->link.prev_offset, KM_USER1);
+ tmpblock->link.next_page = block->link.next_page;
+ tmpblock->link.next_offset = block->link.next_offset;
+ put_ptr_atomic(tmpblock, KM_USER1);
+ }
+
+ if (block->link.next_page) {
+ tmpblock = get_ptr_atomic(block->link.next_page,
+ block->link.next_offset, KM_USER1);
+ tmpblock->link.prev_page = block->link.prev_page;
+ tmpblock->link.prev_offset = block->link.prev_offset;
+ put_ptr_atomic(tmpblock, KM_USER1);
+ }
+}
+
+/*
+ * Allocate a page and add it freelist of given pool.
+ */
+static int grow_pool(struct xv_pool *pool, gfp_t flags)
+{
+ struct page *page;
+ struct block_header *block;
+
+ page = alloc_page(flags);
+ if (unlikely(!page))
+ return -ENOMEM;
+
+ stat_inc(&pool->total_pages);
+
+ spin_lock(&pool->lock);
+ block = get_ptr_atomic(page, 0, KM_USER0);
+
+ block->size = PAGE_SIZE - XV_ALIGN;
+ set_flag(block, BLOCK_FREE);
+ clear_flag(block, PREV_FREE);
+ set_blockprev(block, 0);
+
+ insert_block(pool, page, 0, block);
+
+ put_ptr_atomic(block, KM_USER0);
+ spin_unlock(&pool->lock);
+
+ return 0;
+}
+
+/*
+ * Create a memory pool. Allocates freelist, bitmaps and other
+ * per-pool metadata.
+ */
+struct xv_pool *xv_create_pool(void)
+{
+ u32 ovhd_size;
+ struct xv_pool *pool;
+
+ ovhd_size = roundup(sizeof(*pool), PAGE_SIZE);
+ pool = kzalloc(ovhd_size, GFP_KERNEL);
+ if (!pool)
+ return NULL;
+
+ spin_lock_init(&pool->lock);
+
+ return pool;
+}
+
+void xv_destroy_pool(struct xv_pool *pool)
+{
+ kfree(pool);
+}
+
+/**
+ * xv_malloc - Allocate block of given size from pool.
+ * @pool: pool to allocate from
+ * @size: size of block to allocate
+ * @page: page no. that holds the object
+ * @offset: location of object within page
+ *
+ * On success, <page, offset> identifies block allocated
+ * and 0 is returned. On failure, <page, offset> is set to
+ * 0 and -ENOMEM is returned.
+ *
+ * Allocation requests with size > XV_MAX_ALLOC_SIZE will fail.
+ */
+int xv_malloc(struct xv_pool *pool, u32 size, struct page **page,
+ u32 *offset, gfp_t flags)
+{
+ int error;
+ u32 index, tmpsize, origsize, tmpoffset;
+ struct block_header *block, *tmpblock;
+
+ *page = NULL;
+ *offset = 0;
+ origsize = size;
+
+ if (unlikely(!size || size > XV_MAX_ALLOC_SIZE))
+ return -ENOMEM;
+
+ size = ALIGN(size, XV_ALIGN);
+
+ spin_lock(&pool->lock);
+
+ index = find_block(pool, size, page, offset);
+
+ if (!*page) {
+ spin_unlock(&pool->lock);
+ if (flags & GFP_NOWAIT)
+ return -ENOMEM;
+ error = grow_pool(pool, flags);
+ if (unlikely(error))
+ return error;
+
+ spin_lock(&pool->lock);
+ index = find_block(pool, size, page, offset);
+ }
+
+ if (!*page) {
+ spin_unlock(&pool->lock);
+ return -ENOMEM;
+ }
+
+ block = get_ptr_atomic(*page, *offset, KM_USER0);
+
+ remove_block_head(pool, block, index);
+
+ /* Split the block if required */
+ tmpoffset = *offset + size + XV_ALIGN;
+ tmpsize = block->size - size;
+ tmpblock = (struct block_header *)((char *)block + size + XV_ALIGN);
+ if (tmpsize) {
+ tmpblock->size = tmpsize - XV_ALIGN;
+ set_flag(tmpblock, BLOCK_FREE);
+ clear_flag(tmpblock, PREV_FREE);
+
+ set_blockprev(tmpblock, *offset);
+ if (tmpblock->size >= XV_MIN_ALLOC_SIZE)
+ insert_block(pool, *page, tmpoffset, tmpblock);
+
+ if (tmpoffset + XV_ALIGN + tmpblock->size != PAGE_SIZE) {
+ tmpblock = BLOCK_NEXT(tmpblock);
+ set_blockprev(tmpblock, tmpoffset);
+ }
+ } else {
+ /* This block is exact fit */
+ if (tmpoffset != PAGE_SIZE)
+ clear_flag(tmpblock, PREV_FREE);
+ }
+
+ block->size = origsize;
+ clear_flag(block, BLOCK_FREE);
+
+ put_ptr_atomic(block, KM_USER0);
+ spin_unlock(&pool->lock);
+
+ *offset += XV_ALIGN;
+
+ return 0;
+}
+
+/*
+ * Free block identified with <page, offset>
+ */
+void xv_free(struct xv_pool *pool, struct page *page, u32 offset)
+{
+ void *page_start;
+ struct block_header *block, *tmpblock;
+
+ offset -= XV_ALIGN;
+
+ spin_lock(&pool->lock);
+
+ page_start = get_ptr_atomic(page, 0, KM_USER0);
+ block = (struct block_header *)((char *)page_start + offset);
+
+ /* Catch double free bugs */
+ BUG_ON(test_flag(block, BLOCK_FREE));
+
+ block->size = ALIGN(block->size, XV_ALIGN);
+
+ tmpblock = BLOCK_NEXT(block);
+ if (offset + block->size + XV_ALIGN == PAGE_SIZE)
+ tmpblock = NULL;
+
+ /* Merge next block if its free */
+ if (tmpblock && test_flag(tmpblock, BLOCK_FREE)) {
+ /*
+ * Blocks smaller than XV_MIN_ALLOC_SIZE
+ * are not inserted in any free list.
+ */
+ if (tmpblock->size >= XV_MIN_ALLOC_SIZE) {
+ remove_block(pool, page,
+ offset + block->size + XV_ALIGN, tmpblock,
+ get_index_for_insert(tmpblock->size));
+ }
+ block->size += tmpblock->size + XV_ALIGN;
+ }
+
+ /* Merge previous block if its free */
+ if (test_flag(block, PREV_FREE)) {
+ tmpblock = (struct block_header *)((char *)(page_start) +
+ get_blockprev(block));
+ offset = offset - tmpblock->size - XV_ALIGN;
+
+ if (tmpblock->size >= XV_MIN_ALLOC_SIZE)
+ remove_block(pool, page, offset, tmpblock,
+ get_index_for_insert(tmpblock->size));
+
+ tmpblock->size += block->size + XV_ALIGN;
+ block = tmpblock;
+ }
+
+ /* No used objects in this page. Free it. */
+ if (block->size == PAGE_SIZE - XV_ALIGN) {
+ put_ptr_atomic(page_start, KM_USER0);
+ spin_unlock(&pool->lock);
+
+ __free_page(page);
+ stat_dec(&pool->total_pages);
+ return;
+ }
+
+ set_flag(block, BLOCK_FREE);
+ if (block->size >= XV_MIN_ALLOC_SIZE)
+ insert_block(pool, page, offset, block);
+
+ if (offset + block->size + XV_ALIGN != PAGE_SIZE) {
+ tmpblock = BLOCK_NEXT(block);
+ set_flag(tmpblock, PREV_FREE);
+ set_blockprev(tmpblock, offset);
+ }
+
+ put_ptr_atomic(page_start, KM_USER0);
+ spin_unlock(&pool->lock);
+}
+
+u32 xv_get_object_size(void *obj)
+{
+ struct block_header *blk;
+
+ blk = (struct block_header *)((char *)(obj) - XV_ALIGN);
+ return blk->size;
+}
+
+/*
+ * Returns total memory used by allocator (userdata + metadata)
+ */
+u64 xv_get_total_size_bytes(struct xv_pool *pool)
+{
+ return pool->total_pages << PAGE_SHIFT;
+}
diff --git a/drivers/staging/ramzswap/xvmalloc.h b/drivers/staging/ramzswap/xvmalloc.h
new file mode 100644
index 0000000..010c6fe
--- /dev/null
+++ b/drivers/staging/ramzswap/xvmalloc.h
@@ -0,0 +1,30 @@
+/*
+ * xvmalloc memory allocator
+ *
+ * Copyright (C) 2008, 2009 Nitin Gupta
+ *
+ * This code is released using a dual license strategy: BSD/GPL
+ * You can choose the licence that better fits your requirements.
+ *
+ * Released under the terms of 3-clause BSD License
+ * Released under the terms of GNU General Public License Version 2.0
+ */
+
+#ifndef _XV_MALLOC_H_
+#define _XV_MALLOC_H_
+
+#include <linux/types.h>
+
+struct xv_pool;
+
+struct xv_pool *xv_create_pool(void);
+void xv_destroy_pool(struct xv_pool *pool);
+
+int xv_malloc(struct xv_pool *pool, u32 size, struct page **page,
+ u32 *offset, gfp_t flags);
+void xv_free(struct xv_pool *pool, struct page *page, u32 offset);
+
+u32 xv_get_object_size(void *obj);
+u64 xv_get_total_size_bytes(struct xv_pool *pool);
+
+#endif
diff --git a/drivers/staging/ramzswap/xvmalloc_int.h b/drivers/staging/ramzswap/xvmalloc_int.h
new file mode 100644
index 0000000..03c1a65
--- /dev/null
+++ b/drivers/staging/ramzswap/xvmalloc_int.h
@@ -0,0 +1,86 @@
+/*
+ * xvmalloc memory allocator
+ *
+ * Copyright (C) 2008, 2009 Nitin Gupta
+ *
+ * This code is released using a dual license strategy: BSD/GPL
+ * You can choose the licence that better fits your requirements.
+ *
+ * Released under the terms of 3-clause BSD License
+ * Released under the terms of GNU General Public License Version 2.0
+ */
+
+#ifndef _XV_MALLOC_INT_H_
+#define _XV_MALLOC_INT_H_
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+/* User configurable params */
+
+/* Must be power of two */
+#define XV_ALIGN_SHIFT 2
+#define XV_ALIGN (1 << XV_ALIGN_SHIFT)
+#define XV_ALIGN_MASK (XV_ALIGN - 1)
+
+/* This must be greater than sizeof(link_free) */
+#define XV_MIN_ALLOC_SIZE 32
+#define XV_MAX_ALLOC_SIZE (PAGE_SIZE - XV_ALIGN)
+
+/* Free lists are separated by FL_DELTA bytes */
+#define FL_DELTA_SHIFT 3
+#define FL_DELTA (1 << FL_DELTA_SHIFT)
+#define FL_DELTA_MASK (FL_DELTA - 1)
+#define NUM_FREE_LISTS ((XV_MAX_ALLOC_SIZE - XV_MIN_ALLOC_SIZE) \
+ / FL_DELTA + 1)
+
+#define MAX_FLI DIV_ROUND_UP(NUM_FREE_LISTS, BITS_PER_LONG)
+
+/* End of user params */
+
+enum blockflags {
+ BLOCK_FREE,
+ PREV_FREE,
+ __NR_BLOCKFLAGS,
+};
+
+#define FLAGS_MASK XV_ALIGN_MASK
+#define PREV_MASK (~FLAGS_MASK)
+
+struct freelist_entry {
+ struct page *page;
+ u16 offset;
+ u16 pad;
+};
+
+struct link_free {
+ struct page *prev_page;
+ struct page *next_page;
+ u16 prev_offset;
+ u16 next_offset;
+};
+
+struct block_header {
+ union {
+ /* This common header must be ALIGN bytes */
+ u8 common[XV_ALIGN];
+ struct {
+ u16 size;
+ u16 prev;
+ };
+ };
+ struct link_free link;
+};
+
+struct xv_pool {
+ ulong flbitmap;
+ ulong slbitmap[MAX_FLI];
+ spinlock_t lock;
+
+ struct freelist_entry freelist[NUM_FREE_LISTS];
+
+ /* stats */
+ u64 total_pages;
+};
+
+#endif

Nitin Gupta

2009-09-22 04:56:54 UTC

Permalink

Short guide on how to setup and use ramzswap.

Signed-off-by: Nitin Gupta <ngupta at vflare.org>

---
drivers/staging/ramzswap/ramzswap.txt | 51 +++++++++++++++++++++++++++++++++
1 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/drivers/staging/ramzswap/ramzswap.txt b/drivers/staging/ramzswap/ramzswap.txt
new file mode 100644
index 0000000..e9f1619
--- /dev/null
+++ b/drivers/staging/ramzswap/ramzswap.txt
@@ -0,0 +1,51 @@
+ramzswap: Compressed RAM based swap device
+-------------------------------------------
+
+Project home: http://compcache.googlecode.com/
+
+* Introduction
+
+It creates RAM based block devices which can be used (only) as swap disks.
+Pages swapped to these devices are compressed and stored in memory itself.
+See project home for use cases, performance numbers and a lot more.
+
+Individual ramzswap devices are configured and initialized using rzscontrol
+userspace utility as shown in examples below. See rzscontrol man page for more
+details.
+
+* Usage
+
+Following shows a typical sequence of steps for using ramzswap.
+
+1) Load Modules:
+ modprobe ramzswap num_devices=4
+ This creates 4 (uninitialized) devices: /dev/ramzswap{0,1,2,3}
+ (num_devices parameter is optional. Default: 1)
+
+2) Initialize:
+ Use rzscontrol utility to configure and initialize individual
+ ramzswap devices. Example:
+ rzscontrol /dev/ramzswap2 --init # uses default value of disksize_kb
+
+ *See rzscontrol man page for more details and examples*
+
+3) Activate:
+ swapon /dev/ramzswap2 # or any other initialized ramzswap device
+
+4) Stats:
+ rzscontrol /dev/ramzswap2 --stats
+
+5) Deactivate:
+ swapoff /dev/ramzswap2
+
+6) Reset:
+ rzscontrol /dev/ramzswap2 --reset
+ (This frees all the memory allocated for this device).
+
+
+Please report any problems at:
+ - Mailing list: linux-mm-cc at laptop dot org
+ - Issue tracker: http://code.google.com/p/compcache/issues/list
+
+Nitin Gupta
+ngupta at vflare.org

Greg KH

2009-10-08 22:58:38 UTC

Permalink

Post by Nitin Gupta
Short guide on how to setup and use ramzswap.

Can you also send a patch that adds a drivers/staging/ramzswap/TODO file
that follows the format of the other drivers/staging/*/TODO file
describing what is needed to be done to get the code merged into the
main kernel tree?

Doh, nevermind, you already did that.

stupid me...

greg k-h

Greg KH

2009-10-08 22:57:24 UTC

Permalink

Post by Nitin Gupta
Short guide on how to setup and use ramzswap.

Nitin Gupta

2009-09-22 04:56:53 UTC

Permalink

This post might be inappropriate. Click to display it.

Nitin Gupta

2009-09-24 16:54:50 UTC

Permalink

On Tue, 22 Sep 2009 10:26:53 +0530
<snip>

Post by Nitin Gupta
+ if (unlikely(clen > max_zpage_size)) {
+ if (rzs->backing_swap) {
+ mutex_unlock(&rzs->lock);
+ fwd_write_request = 1;
+ goto out;
+ }
+
+ clen = PAGE_SIZE;
+ page_store = alloc_page(GFP_NOIO | __GFP_HIGHMEM);

Here, and...

Post by Nitin Gupta
+ if (unlikely(!page_store)) {
+ mutex_unlock(&rzs->lock);
+ pr_info("Error allocating memory for incompressible "
+ "page: %u\n", index);
+ stat_inc(rzs->stats.failed_writes);
+ goto out;
+ }
+
+ offset = 0;
+ rzs_set_flag(rzs, index, RZS_UNCOMPRESSED);
+ stat_inc(rzs->stats.pages_expand);
+ rzs->table[index].page = page_store;
+ src = kmap_atomic(page, KM_USER0);
+ goto memstore;
+ }
+
+ if (xv_malloc(rzs->mem_pool, clen + sizeof(*zheader),
+ &rzs->table[index].page, &offset,
+ GFP_NOIO | __GFP_HIGHMEM)) {

Here.
Do we need to wait until here for detecting page-allocation-failure ?
Detecting it here means -EIO for end_swap_bio_write()....unhappy
ALERT messages etc..
Can't we add a hook to get_swap_page() for preparing this ("do we have
enough pool?") and use only GFP_ATOMIC throughout codes ?
(memory pool for this swap should be big to some extent.)

Yes, we do need to wait until this step for detecting alloc failure since
we don't really know when pool grow will (almost) surely wail.
What we can probably do is, hook into OOM notify chain (oom_notify_list)
and whenever we get this callback, we can start sending pages directly
to backing swap and do not even attempt to do any allocation.

Post by Nitin Gupta
From my user support experience for heavy swap customers, extra memory allocation for swapping out is just bad...in many cases.

(*) I know GFP_IO works well to some extent.

We cannot use GFP_IO here as it can cause a deadlock:
ramzswap alloc() --> not enough memory, try to reclaim some --> swap out ...
... some pages to ramzswap --> ramzswap alloc()

Thanks,
Nitin

Pekka Enberg

2009-09-22 06:13:50 UTC

Permalink

Post by Nitin Gupta
drivers/staging/Kconfig | 2 +
drivers/staging/Makefile | 1 +
drivers/staging/ramzswap/Kconfig | 21 +
drivers/staging/ramzswap/Makefile | 3 +
drivers/staging/ramzswap/ramzswap.txt | 51 +
drivers/staging/ramzswap/ramzswap_drv.c | 1462 +++++++++++++++++++++++++++++
drivers/staging/ramzswap/ramzswap_drv.h | 173 ++++
drivers/staging/ramzswap/ramzswap_ioctl.h | 50 +
drivers/staging/ramzswap/xvmalloc.c | 533 +++++++++++
drivers/staging/ramzswap/xvmalloc.h | 30 +
drivers/staging/ramzswap/xvmalloc_int.h | 86 ++
include/linux/swap.h | 5 +
mm/swapfile.c | 34 +
13 files changed, 2451 insertions(+), 0 deletions(-)

This diffstat is not up to date, I think.

Greg, would you mind taking this driver into staging? There are some
issues that need to be ironed out for it to be merged to kernel proper
but I think it would benefit from being exposed to mainline.

Nitin, you probably should also submit a patch that adds a TODO file
similar to other staging drivers to remind us that swap notifiers and
the CONFIG_ARM thing need to be resolved.

Pekka

Nitin Gupta

2009-09-22 08:36:12 UTC

Permalink

Post by Pekka Enberg

?drivers/staging/Kconfig ? ? ? ? ? ? ? ? ? | ? ?2 +
?drivers/staging/Makefile ? ? ? ? ? ? ? ? ?| ? ?1 +
?drivers/staging/ramzswap/Kconfig ? ? ? ? ?| ? 21 +
?drivers/staging/ramzswap/Makefile ? ? ? ? | ? ?3 +
?drivers/staging/ramzswap/ramzswap.txt ? ? | ? 51 +
?drivers/staging/ramzswap/ramzswap_drv.c ? | 1462 +++++++++++++++++++++++++++++
?drivers/staging/ramzswap/ramzswap_drv.h ? | ?173 ++++
?drivers/staging/ramzswap/ramzswap_ioctl.h | ? 50 +
?drivers/staging/ramzswap/xvmalloc.c ? ? ? | ?533 +++++++++++
?drivers/staging/ramzswap/xvmalloc.h ? ? ? | ? 30 +
?drivers/staging/ramzswap/xvmalloc_int.h ? | ? 86 ++
?include/linux/swap.h ? ? ? ? ? ? ? ? ? ? ?| ? ?5 +
?mm/swapfile.c ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ? 34 +
?13 files changed, 2451 insertions(+), 0 deletions(-)

This diffstat is not up to date, I think.

Oh! this is from v3. I forgot to update it. I don't have access to my git tree
here in office, so cannot send updated version right now.

Post by Pekka Enberg
Greg, would you mind taking this driver into staging? There are some
issues that need to be ironed out for it to be merged to kernel proper
but I think it would benefit from being exposed to mainline.
Nitin, you probably should also submit a patch that adds a TODO file
similar to other staging drivers to remind us that swap notifiers and
the CONFIG_ARM thing need to be resolved.

ok, I will send patch for TODO file.

Thanks,
Nitin

Greg KH

2009-09-22 15:37:52 UTC

Permalink

Post by Pekka Enberg

That would be fine, will there be a new set of patches for me to apply,
or is this the correct series?

thanks,

greg k-h

Nitin Gupta

2009-09-22 15:52:30 UTC

Permalink

Post by Greg KH

Post by Pekka Enberg

That would be fine, will there be a new set of patches for me to apply,
or is this the correct series?

This is the correct series.

Thanks,
Nitin

Greg KH

2009-09-23 00:40:20 UTC

Permalink

Post by Nitin Gupta

Post by Greg KH

Post by Pekka Enberg

That would be fine, will there be a new set of patches for me to apply,
or is this the correct series?

This is the correct series.

Ok, thanks, I'll queue it up when I get back from the Plumbers
conference.

greg k-h