There is a long establish debate on lwn.net/Articles/327994,lwn.net/Articles/328363/ on whether to use data=writeback mode as default mount option for ext4 filesystem. Even upstream kernel uses writeback as default, RHEL6.4 do not, code snippet that control default journal options are as follow:
switch (test_opt(sb, DATA_FLAGS)) {
case 0:
/* No mode set, assume a default based on the journal
* capabilities: ORDERED_DATA if the journal can
* cope, else JOURNAL_DATA
*/
if (jbd2_journal_check_available_features
(sbi->s_journal, 0, 0, JBD2_FEATURE_INCOMPAT_REVOKE))
set_opt(sb, ORDERED_DATA);
else
set_opt(sb, JOURNAL_DATA);
break;
So, what's the point of using ORDERED as default, RHEL is targeting high end enterprise system, so data consistency is critical. When we look into the write_end function in ext4 source code, we found that:
if(ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
ret = ext4_jbd2_file_inode(handle, inode);
if (ret) {
unlock_page(page);
page_cache_release(page);
goto errout;
}
}
If ORDERED_MODE is set, every time we make a write(2) call we will send an inode attribute change request to jbd2(journaling block device) layer and will file the transaction at the middle of the disk(please check get_midpoint_journal_block in mke2fs), in any case, this will cause disk seek operation and increase IO latency.
In our R720 server, we use battery backed raid card with 1G write cache, this means when we met power faliure, we can have another 40 minutes for our IO subsystem to survive, and flush the writecache. The implication of this hardware spec for ext4 and jbd is that we can enable async_commit and disable barrier to reduce the SYNC command sent to SCSI device. And reduce unecessary seeking at all.
2015年8月12日星期三
2015年7月11日星期六
Implementing a read-write lock with upgrade capability in Golang
In order to increase the performance of our server system, we began to switch to Golang in web server development a few months ago. One benefit that Golang can offer is "concurrency". This kind of concurrency is different from traditional threading paradigm in the sence that it can evently distribute your work load to your SMP system with little overhead.
However, when we need to access global share objects inside goroutines, concurrency may become bad things. For following reasons: realease,acquire semantic; change visibility, memory fence and others.
1) for operation that can are atomic under assembly language semantic, ensure no stale data was read by using atomic package
2) for other operations that are not atomic proper locking is needed.
Here are the code for simple RW lock with upgrade(mixed in other logic):
func (srv *Server) mapStringToInt(app_version string) uint32{
srv.sti_mu.RLock()
//fast path
if element,exist := srv.string_to_int[app_version];exist {
srv.sti_mu.RUnlock()
return element
}
srv.sti_mu.RUnlock()
temp := atomic.AddUint32(&srv.sti_cnt,1)
srv.sti_mu.Lock()
if element,exist := srv.string_to_int[app_version];exist {
srv.sti_mu.Unlock()
return element
} else {
srv.string_to_int[app_version] = temp
srv.sti_mu.Unlock()
return temp
}
}
However, when we need to access global share objects inside goroutines, concurrency may become bad things. For following reasons: realease,acquire semantic; change visibility, memory fence and others.
1) for operation that can are atomic under assembly language semantic, ensure no stale data was read by using atomic package
2) for other operations that are not atomic proper locking is needed.
Here are the code for simple RW lock with upgrade(mixed in other logic):
func (srv *Server) mapStringToInt(app_version string) uint32{
srv.sti_mu.RLock()
//fast path
if element,exist := srv.string_to_int[app_version];exist {
srv.sti_mu.RUnlock()
return element
}
srv.sti_mu.RUnlock()
temp := atomic.AddUint32(&srv.sti_cnt,1)
srv.sti_mu.Lock()
if element,exist := srv.string_to_int[app_version];exist {
srv.sti_mu.Unlock()
return element
} else {
srv.string_to_int[app_version] = temp
srv.sti_mu.Unlock()
return temp
}
}
2015年3月22日星期日
Optimizing Golang application for memory intensive workload
For server side application that handling complex business logic, we usually replace the default glibc memory allocator with a more efficient one like tcmalloc and jemalloc. For example, redis key-value store uses jemalloc as its default allocator and merge jemalloc's source code into its source tree.
As I mention in this redis pull request, we have an in-house redis fork that optimized for intensive workload, one trick here is to tune the value of
LG_CHUNK_DEFAULT
The default value for LG_CHUNK_DEFAULT is 22 which means that it use mmap/brk to claim 4M equivalent of address space from the operating system every time and use this 4M to fill in the slots of different size class. The value of LG_CHUNK_DEFAULT is changing from time to time ranging from 19 to 22 (512K - 4M), because its difficult to find the fittest value that fit into every use case. For example, Firefox use jemalloc as its default allcator, if we link this version of jemalloc to Firefox Moblie running on an old android device (Moto G) that only have 512M of memory, UI will hang very often because
1) OOM Killer
2) existence of other memory consuming app or service
3) granularity for each mmap/brk is too large, kernel is difficult to find 4M chunk for you when we only have 512M ram
I am not sure how jemalloc was configure on firefox built for mobile device. But I am pretty sure that 4M is not enough for either redis or our high loaded server application, so we tune this value to something like 25 or 26 in our in-house fork of jemalloc to optimize the performance of the allocator mainly for a more continuous address space to enhance cache locality and reduce the overhead of syscall interruption caused by mmap call, and the initialization of data structure after each mmap. However, this hack is not worked for Golang application, because it uses its own allocator written in Golang that have data structure and logic similar to Tcmalloc.
After we analyzed the source code of Golang runtime allocator, we found that similar constant value exists for Golang allocator:
1) HeapAllocChunk
2) chunk constant inside function persistentalloc
For HeapAllocChunk, it determine how mush memory to "ask" (a term used in the source code) when we call mmap, the default value is 1M, which is 4X smaller than that of jemalloc. For our own workload specifically, we modify this value to 25 for the reason listed above.
Like Golang's own http server implementation, we use one goroutine to handle every incoming request to simplified eventloop and epoll stuff. In a busy web server, we discover that persistentalloc was called very often to allocate space for storing metadata related to each goroutine. Inside the definition of persistentalloc the author of golang define that, we will only allocate 256K more memory when we eat up all the existing memory. In a busy web server that have more than 200k concurrent connection. Expanding the metadata store 256k every time when we have 200k goroutine or more is simply not reasonable. So we tune this value to 1M and make it more fit to our own workload.
As I mention in this redis pull request, we have an in-house redis fork that optimized for intensive workload, one trick here is to tune the value of
LG_CHUNK_DEFAULT
The default value for LG_CHUNK_DEFAULT is 22 which means that it use mmap/brk to claim 4M equivalent of address space from the operating system every time and use this 4M to fill in the slots of different size class. The value of LG_CHUNK_DEFAULT is changing from time to time ranging from 19 to 22 (512K - 4M), because its difficult to find the fittest value that fit into every use case. For example, Firefox use jemalloc as its default allcator, if we link this version of jemalloc to Firefox Moblie running on an old android device (Moto G) that only have 512M of memory, UI will hang very often because
1) OOM Killer
2) existence of other memory consuming app or service
3) granularity for each mmap/brk is too large, kernel is difficult to find 4M chunk for you when we only have 512M ram
I am not sure how jemalloc was configure on firefox built for mobile device. But I am pretty sure that 4M is not enough for either redis or our high loaded server application, so we tune this value to something like 25 or 26 in our in-house fork of jemalloc to optimize the performance of the allocator mainly for a more continuous address space to enhance cache locality and reduce the overhead of syscall interruption caused by mmap call, and the initialization of data structure after each mmap. However, this hack is not worked for Golang application, because it uses its own allocator written in Golang that have data structure and logic similar to Tcmalloc.
After we analyzed the source code of Golang runtime allocator, we found that similar constant value exists for Golang allocator:
1) HeapAllocChunk
2) chunk constant inside function persistentalloc
For HeapAllocChunk, it determine how mush memory to "ask" (a term used in the source code) when we call mmap, the default value is 1M, which is 4X smaller than that of jemalloc. For our own workload specifically, we modify this value to 25 for the reason listed above.
Like Golang's own http server implementation, we use one goroutine to handle every incoming request to simplified eventloop and epoll stuff. In a busy web server, we discover that persistentalloc was called very often to allocate space for storing metadata related to each goroutine. Inside the definition of persistentalloc the author of golang define that, we will only allocate 256K more memory when we eat up all the existing memory. In a busy web server that have more than 200k concurrent connection. Expanding the metadata store 256k every time when we have 200k goroutine or more is simply not reasonable. So we tune this value to 1M and make it more fit to our own workload.
订阅:
博文 (Atom)