2015年3月22日星期日

Optimizing Golang application for memory intensive workload

For server side application that handling complex business logic,  we usually replace the default glibc memory allocator with a more efficient one like  tcmalloc and jemalloc. For example, redis key-value store uses jemalloc as its default allocator and merge jemalloc's source code into its source tree.

As I mention in this redis pull request, we have an in-house redis fork that optimized for intensive workload, one trick here is to tune the value of


LG_CHUNK_DEFAULT


The default value for LG_CHUNK_DEFAULT is 22 which means that it use mmap/brk to claim 4M equivalent of address space from the operating system every time and use this 4M to fill in the slots of different size class.   The value of LG_CHUNK_DEFAULT is changing from time to time ranging from 19 to 22 (512K - 4M), because its difficult to find the fittest value that fit into every use case. For example, Firefox use jemalloc as its default allcator, if we link this version of jemalloc to Firefox Moblie running on an old android device (Moto G) that only have 512M of memory, UI will hang very often because


1) OOM Killer


2) existence of other memory consuming app or service


3) granularity for each mmap/brk is too large, kernel is difficult to find 4M chunk for you when we only have 512M ram




I am not sure how jemalloc was configure on firefox built for mobile device. But I am pretty sure that 4M is not enough for either redis or our high loaded server application, so we tune this value to something like 25 or 26 in our in-house fork of jemalloc to optimize the performance of the allocator mainly for a more continuous address space to enhance cache locality and reduce the overhead of syscall interruption caused by mmap call, and the initialization of data structure after each mmap. However, this hack is not worked for Golang application, because it uses its own allocator written in Golang that have data structure and logic similar to Tcmalloc.




After we analyzed the source code of Golang runtime allocator, we found that similar constant value exists for Golang allocator:


1) HeapAllocChunk 
2) chunk constant inside function persistentalloc

For HeapAllocChunk, it determine how mush memory to "ask" (a term used in the source code) when we call mmap, the default value is 1M, which is 4X smaller than that of jemalloc. For our own workload specifically, we modify this value to 25 for the reason listed above.

Like Golang's own http server implementation, we use one goroutine to handle every incoming request to simplified eventloop and epoll stuff. In a busy web server, we discover that persistentalloc was called very often to allocate space for storing metadata related to each goroutine. Inside the definition of persistentalloc the author of golang define that, we will only allocate  256K more memory when we eat up all the existing memory. In a busy web server that have more than 200k concurrent connection. Expanding the metadata store 256k every time when we have 200k goroutine or more is simply not reasonable. So we tune this value to 1M and make it more fit to our own workload.