[original file is ccache_celf_2007-08-31_jp.pdf by Ikeda, NEC]
[translated by ikoma]
CELF Technical Jamboree #16
Aug. 31, 2007
[See original file for figures] (c) NEC Corporation 2006
Overview and Issues of Compressed Cache
Aug 31st, 2007[BR] IKEDA, Munehiro[BR]
m-ikeda at ds dot jp dot nec dot com
Page 2 (c) NEC Corporation 2006
Contents
* What is Compressed Cache?
* Overview of Implementation
* Evaluation of Effectiveness
* Issues
* Personal View for the Future
* Appendix : Resources
Page 3 (c) NEC Corporation 2006
What is Compressed Cache?
* Overview
- Project to improve performance by compressing and retaining page cache and/or anonymous memory on memory
* History
- For 2.4 kernel
- Rodrigo S. de Castro started the project
- For 2.6 kernel
- Nitin Guputa succeeded the project and completely reimplemented for 2.6 kernel. Now develpment is ongoing.
* Status
- The latest patch is for 2.6.21
- Still a lot of issues remain (see below for details)
Page 4 (c) NEC Corporation 2006
Overview of Implementation
[see original file for figure]
* Relation of data on memory and block device (vanilla)
+-----------+ +-------------+
| swap | | file system |
+-----------+ +-------------+
^ ^ block device
--------|----------------|---------------------------
v v memory
+----------+ +------------+
| anon mem | | page cache |
+----------+ +------------+Page 5 (c) NEC Corporation 2006
Overview of Implementation
[see original file for figure]
* Relation of data on memory and block device (compressed cache)
+-----------+ +-------------+
| swap | | file system |
+-----------+ +-------------+
^ ^ block device
--------|----------------|---------------------------
v v memory
+----------+ +------------+
| anon mem | | page cache |
+----------+ +------------+
^ ^
| |
v |
+ - - - - - - - - + |
+------------+ | |
| |virtual swap|<-------|---------------virtual swap devce
+------------+ | | swap_info_struct.prio==100
| ^ | (default : <0)
| + - - -|- - - - - +
| v v
+-----------------------------+ |
| | ccache |<------compressed data storage area
+-----------------------------+ |
+ - - - - - - - - - - - - - - - - - +
Compressed Cache implements this scopePage 6 (c) NEC Corporation 2006
Overview of Implementation
* State transition of page (vanilla)
+--------+ +----------+
| active | <----- | inactive |
| | -----> | |
+--------+ +----------+
* struct page is maintained with LRU
* data is "raw"Page 7 (c) NEC Corporation 2006
Overview of Implementation
* State transition of page (compressed cache)
+--------+ +----------+
| active | <----- | inactive |
| | -----> | |
+--------+ +----------+
^ /
\ / * struct page is maintained with LRU
\ / * data is "raw"
---------------------------------------------------------
\ / * data is managed with struct chunk_head and struct chunk
find_*_page*() \ / * data is compressed and stored in ccache area
--> handle_cache_fault() / pageout()
--> cc_readpage() \ V --> cc_writepage()
+------------+
| compressed |
+------------+* Pages stored in cache
- anon mem --- pages swapped out
- page cache (fs backed mem) ... clean pages
Page 8 (c) NEC Corporation 2006
Overview of Implementation
* Data management of ccache region
- In ccache, page data is compressed and stored
- Compressed data becomes of variable length. Divided as necessary, and each is managed as a "chunk"
- struct chunk_head and struch chunk are major data structures to manage chunks
/ +-----------------+ +-----------------+
| |struct chunk_head| |struct chunk_head|
(in use)| +-----------------+ +-----------------+
| | +------------+ | +------------+ +------------+ +------------+
| +--|struct chunk|-| +--|struct chunk|--|struct chunk|--|struct chunk|-|
\ +------------+ +------------+ +------------+ +------------+
| | | |
v v v v
------+------------------------------------------------------------------------------------
+ | + | | + | + | +
+ | + | | + | +<--physical page--->+
+ | + | | + | + | +
------+-------------------------------------------------------------------------------------
^ ^ ^ ^ ^
| | | | |
+-------------+ | | +------+ |
/ | | | | |
| +------------+ +------------+ +------------+ +------------+ +------------+
(vacant) | |struct chunk|-->|struct chunk|-->|struct chunk|-->|struct chunk|-->|struct chunk|--|
| +------------+ +------------+ +------------+ +------------+ +------------+
\ ^
|
struct chunk *free_headPage 9 (c) NEC Corporation 2006
Overview of Implementation
* Data storage into ccache
- In radix-tree node in address_space object, a pointer to chunk_head struct, instead of page struct, is stored
- anon page is left in swap cache even when stored in ccache (i.e. leave pointer to chunk_head in radix-tree in swapper_space)
* Data read from ccache
When PG_compressed flaged page is returned for find_*_page*() => radix_tree_*lookup*(), the pointer actually points to chunk_head. The data for the page is in ccache
- At page fault handling, page is first searched from swap cache with find_get_page(). Therefore, the rest of read out processing is done with same path both for page cache/anon page.
Page 10 (c) NEC Corporation 2006
Overview of Implementation
* Compression algorithms
- Three algorithms available at the moment (WKdm, WK4x4, LZO)
- Choice of algorithm is just cyclic at compression
- A patch available to fix an algorithm in proc
* Tuning Knobs
- /proc/sys/vm/max_fs_backed_cc_size
- specifies ccache size for page cache (in pages)
- /proc/sys/vm/max_anon_cc_size
- specifies ccache size for anon mem (in pages)
- /proc/ccache_stats
- statistic information
Page 11 (c) NEC Corporation 2006
Overview of Implementation
* Patch size
- 21[files] / 5700[lines]
- 10[files]/approx. 4300[lines] of wihch are compression algorithms
include/linux/WK4x4.h | 91 +++ include/linux/WKdm.h | 81 +++ include/linux/ccache.h | 155 +++++ include/linux/errno.h | 3 + include/linux/lzoconf.h | 499 ++++++++++++++ include/linux/minilzo.h | 97 +++ include/linux/page-flags.h | 17 + include/linux/swap.h | 1 + include/linux/sysctl.h | 2 + kernel/sysctl.c | 22 + lib/LZO/README.LZO | 123 ++++ lib/LZO/minilzo.c | 1574 ++++++++++++++++++++++++++++++++++++++++++++ lib/Makefile | 3 +- lib/WK4x4/README | 70 ++ lib/WK4x4/WK4x4.c | 999 ++++++++++++++++++++++++++++ lib/WKdm/WKdm.c | 779 ++++++++++++++++++++++ mm/Makefile | 2 +- mm/ccache.c | 889 +++++++++++++++++++++++++ mm/filemap.c | 133 ++++- mm/swapfile.c | 143 ++++- mm/vmscan.c | 39 +- 21 files changed, 5711 insertions(+), 11 deletions(-)
Page 12 (c) NEC Corporation 2006
Evaluation of Effects(1): Effect to Increase Maximum Use of Memory
* Evaluation method
- Repeat memory allocation and read/write of file, till COM Killer runs
- File used
- a) text file
- "I am a cat"(Novel) (794KB) (Aozora Bunko version)
- b) JPEG file
- CELF Tux (1.4MB)
- a) text file
- Compression algorithm is fixed to LZO
Page 13 (c) NEC Corporation 2006
Evaluation of Effects(1): Effect to Increase Maximum Use of Memory
* Evaluation results
[see original file for graphs]
Read size/times till OOM (text file)
Read size/times till OOM (JPEG file)- For compressible data (e.g.text file), effect proved to increase usable memory
- For incompressible data (e.g. JPEG file), no effect observed
- test environment
- Celeron(Mendocino) 400MHz, RAM 96MB(boot with “mem=32m”)
- Linux 2.6.21
- + patch-ccache-alpha-008-2.6.21
- + custom debug-patch
Test command: read_file <text/jpeg file>
- test environment
Page 14 (c) NEC Corporation 2006
Evaluation of Effects(2): Performance
* Evaluation method
- Time to read a file repeatedly for specified times, and to repeat reverse and re-reverse all bits of data for ten times
- Measured with repetition of reading a file and expanding onto memory
- 29 times (memory usage without using swap/ccache)
- 36 times (maximum value when ccache used)
- These values indicate load for memory
- Text file ("I am a cat", 794KB) is used to read
- Compression algorithm is fixed to LZO
Page 15 (c) NEC Corporation 2006
Evaluation of Effects(2): Performance
* Evaluation results (sys time)
[see original file for graphs]
Spent time to read and invert bits
mem=32MB + swap : sys
Spent time to read and invert bits
mem=32MB / ccache=16MB : sys- With ccache, sys time largely increase due to data compression/expansion
- sys time for ccache does not correlate clealy with memory load, and measured values vary widely
- test environment
- Celeron(Mendocino) 400MHz, RAM 96MB(boot with “mem=32m”)
- Linux 2.6.21
- + patch-ccache-alpha-008-2.6.21
- + custom debug-patch
Test command : reverse_file -d <N> -r 10 <text file>
- test environment
Page 16 (c) NEC Corporation 2006
Evaluation of Effects(2): Performance
* Evaluation results (user time)
[see original file for graphs]
Spent time to read and invert bits
mem=32MB + swap : user
Spent time to read and invert bits
mem=32MB / ccache=16MB : user- No significant difference observed for user time
- test environment
- Celeron(Mendocino) 400MHz, RAM 96MB(boot with “mem=32m”)
- Linux 2.6.21
- + patch-ccache-alpha-008-2.6.21
- + custom debug-patch
Test command : reverse_file -d <N> -r 10 <text file>
- test environment
Page 17 (c) NEC Corporation 2006
Evaluation of Effects(2): Performance
* Evaluation results (real time)
[see original file for graphs]
Spent time to read and invert bits
mem=32MB + swap : real
Spent time to read and invert bits
mem=32MB / ccache=16MB : real- For most conditions, the process with real swap completes in shorter time
- For ccache, shorter processing time observed for marginal load (see next slide)
- test environment
- Celeron(Mendocino) 400MHz, RAM 96MB(boot with “mem=32m”)
- Linux 2.6.21
- + patch-ccache-alpha-008-2.6.21
- + custom debug-patch
Test command : reverse_file -d <N> -r 10 <text file>
- test environment
Page 18 (c) NEC Corporation 2006
Evaluation of Effects(2): Performance
* Evaluation resuls (supplement: wait time)
[see original file for graphs]
Spent time to read and invert bits
mem=32MB + swap : wait = real - (user + sys)
Spent time to read and invert bits
mem=32MB / ccache=16MB : wait = real - (user + sys)- For real swap, wait time increases as memory load grows, while for ccache a peak observed in wait time
- Note: The kernel used for this evaluation, our original modification has made to avoid COM Killer. This fix calls schedule(), so scheduler may be executed more than necessary
- test environment
- Celeron(Mendocino) 400MHz, RAM 96MB(boot with “mem=32m”)
- Linux 2.6.21
- + patch-ccache-alpha-008-2.6.21
- + custom debug-patch
Test command : reverse_file -d <N> -r 10 <text file>
- test environment
- Note: The kernel used for this evaluation, our original modification has made to avoid COM Killer. This fix calls schedule(), so scheduler may be executed more than necessary
Page 19 (c) NEC Corporation 2006
Evaluation of Effects(2): Performance
* Evaluation results(supplement: result when abundant memory available)
[see original file for graphs]
Spent time to read and invert bits
mem=96MB : sys
Spent time to read and invert bits
mem=96MB : user
Spent time to read and invert bits
mem=96MB : real- test environment
- Celeron(Mendocino) 400MHz, RAM 96MB(boot with “mem=32m”)
- Linux 2.6.21
Test command : reverse_file -d <N> -r 10 <text file>
Page 20 (c) NEC Corporation 2006
Issues
* Issue on design and configuration
- Page cache stored in ccache is not released unless acccessed
- Already created and posted a patch to make it FIFO (LRU)
- Flagmentation of ccache region
- Nitin Guputa's "VStore" project
- Choise of compression algorithm
- LZO has been merged; LZO is the first choice at this moment
- Ccache size can not be changed
- Can not turn off once started
- Tricky relation with swap
- (perpetual swap-cache, vitual swap)
- etc., etc., etc....
Page 21 (c) NEC Corporation 2006
Issues
* Issues on implementation
- Config option required
- Need to break out patch
- A big monolithic patch now
- Need to clean up
Page 22 (c) NEC Corporation 2006
Personal View for the Future
* VSTore
- Effectiveness of VStore should be verified
- To begin with, test implementation in user space seems reasonable
- "Once this allocator is implemented, the ccaching project will reduce to simple use of the exported API from VStore and LZO modules." (Nitin Guputa, Jul 25th, 2007)
- Mainlining of VStore by itself is unlikely. The function needs users; VStore and compressed cache should be considered together.
* Layer structure
- It would be better to separate completely from swap layer for the compressed data to be swapped out
=> May be utilized as a path to swap on Flash
Page 23 (c) NEC Corporation 2006
Appendix : resources
* Project page / Wiki
* Mailing list
* OLS Papers / Presentations
- "Evaluating effects of cache memory compression on embedded systems"
(Anderson Farias Briglia, Allan Bezerra, Leonid Moiseichuk, Nitin Gupta)
Thank you
