2016年8月16日星期二

Amazon Aurora Is very expensive !

A few month ago, we migrated one of our DB built on top of ec2 m3.xlarge and GP2 disk in us-west-1 to Aurora in us-west-2 region.

Some Observation:


  • Aurora only available in some region with at least 3 az, in us-west-1 north California, us-west-1a is already full, so we can not use it in this region

  • IP traffic from us-west-1 and us-west-2 go through specific tunnel, latenct and bandwidth are guarantee

  • But Aurora is expensive, and can not save you 90% of cost, stated in aws login screen. Specifically, Aurora charge you by every IO request you have made, this is horrible.




We have approx 50GB of data, QPS is about 300-500, it run smoothly on top of Dell R430 with PERC H310(No RAID Buffer) Raid 0 Mode. $1000 per month is expensive for this kind of workload.

How to build a gateway to Amazon S3 with aws golang sdk and IAM role functionality

Previously, we have a in-house RESTful API gateway to amazon s3 backend, signing every request from client with aws V4 signing standard. It was built with nginx+lua module, which claimed to be able to handle 10k requests per second. But we discover a few problems after it was pushed to online environment for half a year.


  1. About 10% of request have 403 forbidden status code return from s3 backend
  2. nginx are not able to follow 60s ttl announced by s3 dns record.(nginx community version dos not support dynamic dns resolve feature) 
  3. nginx+lua are not able to handle 100-continue return code from s3 backend
  4. nginx+lua sometime generating unnecessary disk io
  5. s3 gateway performance is bound to s3 api request limit and not nginx qps limit
  6. Inhouse s3 gateway is not able to use IAM role to eliminate the risk of key loss
  7. Inhouse s3 gateway is subject to aws signing method changes in the future.
So we planed to refactor this service using native aws SDK, two candidates are selected, boto3 and aws golang sdk. We pick up aws-golang-sdk at the end because:

  1. boto3 did not support python 2.6 in centos 6 environment
  2. adding additional http server framework on top of boto3 make the deployment process more complex
  3. golang have better memory footprint than python 
Below are some code snippets:

  • Use static credential for instance outside ec2 or existing ec2 instance without IAM role

   var conf *aws.Config
   conf = &aws.Config{
                Region: aws.String(REGION),
                Credentials: credentials.NewStaticCredentials("", "",""),
            }
  • Use IAM role credential

    conf = &aws.Config{
                Region: aws.String(REGION),
                Credentials: credentials.NewCredentials(&ec2rolecreds.EC2RoleProvider{Client: ec2metadata.New(session.New())}),
            }        

    OR (aws sdk including aws-cli use instance IAM role by default)

    conf = &aws.Config{
                Region: aws.String(REGION),
            }

  • Get HTTP return code from s3 backend
   func get_http_status_code_from_error(err error) (int,string) {
        if awsErr, ok := err.(awserr.RequestFailure); ok {
            return awsErr.StatusCode(),awsErr.Code()
        } else {
            return 500,err.Error()
        }
   }

2016年7月26日星期二

hit a kernel bug related to Intel® I/O Acceleration Technology in rhel 7.1 system

Recently we tweak some of the BIOS setting in our Dell R720 System, specifically we enable the "I/OAT DMA Engine" listed in the integrated devices page which supposed to improve io performance if used with recent kernel(not 2.6.X)



However, in our production system we received many alarms related to cpu usage and slow io, all in vm instances running on barebone server with IOAT enable.After we log into one of those machine, we find out that cpu0 was busy handling softirq.













At the beginning we suspected that this is the problem of interrupt balance, so we enable irqbalance, but it didn't work. Because, unlike io interrupt and network interrupt, soft interrupt has only one queue, and cannot redistribute among multiple cores in the same host.


We compare every aspect of system configuration extracted from barebone hosts with and without this problem. And find out in dmesg log that ioat was the root cause, after revert that change, problem resolved.

related link:
https://bugs.centos.org/view.php?id=8778
https://access.redhat.com/solutions/1409393
https://access.redhat.com/articles/879293

2015年8月12日星期三

On the effect of two major journaling mode of ext4 file system

There is a long establish debate on  lwn.net/Articles/327994,lwn.net/Articles/328363/ on whether to use data=writeback mode as default mount option for ext4 filesystem. Even upstream kernel uses writeback as default, RHEL6.4 do not, code snippet that control default journal options are as follow:

switch (test_opt(sb, DATA_FLAGS)) {
    case 0:
        /* No mode set, assume a default based on the journal
         * capabilities: ORDERED_DATA if the journal can
         * cope, else JOURNAL_DATA
         */
        if (jbd2_journal_check_available_features
            (sbi->s_journal, 0, 0, JBD2_FEATURE_INCOMPAT_REVOKE))
            set_opt(sb, ORDERED_DATA);
        else
            set_opt(sb, JOURNAL_DATA);
        break;

So, what's the point of using ORDERED as default, RHEL is targeting high end enterprise system, so data consistency is critical. When we look into the write_end function in ext4 source code, we found that:

  if(ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
        ret = ext4_jbd2_file_inode(handle, inode);
        if (ret) {
            unlock_page(page);
            page_cache_release(page);
            goto errout;
        }
    } 


If ORDERED_MODE is set, every time we make a write(2) call we will send an inode attribute change request to jbd2(journaling block device) layer and will file the transaction at the middle of the disk(please check get_midpoint_journal_block in mke2fs), in any case, this will cause disk seek operation and increase IO latency.


In our R720 server, we use battery backed raid card with 1G write cache, this means when we met power faliure, we can have another 40 minutes for our IO subsystem to survive, and flush the writecache. The implication of this hardware spec for ext4 and jbd is that we can enable async_commit and disable barrier to reduce the SYNC command sent to SCSI device. And reduce unecessary seeking at all.

2015年7月11日星期六

Implementing a read-write lock with upgrade capability in Golang

In order to increase the performance of our server system, we began to switch to Golang in web server development a few months ago. One benefit that Golang can offer is "concurrency". This kind of concurrency is different from traditional threading paradigm in the sence that it can evently distribute your work load to your SMP system with little overhead.


However, when we need to access global share objects inside goroutines, concurrency may become bad things. For following reasons: realease,acquire semantic; change visibility, memory fence and others.


1) for operation that can are atomic under assembly language semantic, ensure no stale data was read by using atomic package


2) for other operations that are not atomic proper locking is needed.


Here are the code for simple RW lock with upgrade(mixed in other logic):


func (srv *Server) mapStringToInt(app_version string) uint32{
        srv.sti_mu.RLock()
        //fast path
        if element,exist := srv.string_to_int[app_version];exist {
                srv.sti_mu.RUnlock()
                return element
        }
        srv.sti_mu.RUnlock()
        temp := atomic.AddUint32(&srv.sti_cnt,1)
        srv.sti_mu.Lock()
        if element,exist := srv.string_to_int[app_version];exist {
                srv.sti_mu.Unlock()
                return element
        } else {
                srv.string_to_int[app_version] = temp
                srv.sti_mu.Unlock()
                return temp
        }
}

2015年3月22日星期日

Optimizing Golang application for memory intensive workload

For server side application that handling complex business logic,  we usually replace the default glibc memory allocator with a more efficient one like  tcmalloc and jemalloc. For example, redis key-value store uses jemalloc as its default allocator and merge jemalloc's source code into its source tree.

As I mention in this redis pull request, we have an in-house redis fork that optimized for intensive workload, one trick here is to tune the value of


LG_CHUNK_DEFAULT


The default value for LG_CHUNK_DEFAULT is 22 which means that it use mmap/brk to claim 4M equivalent of address space from the operating system every time and use this 4M to fill in the slots of different size class.   The value of LG_CHUNK_DEFAULT is changing from time to time ranging from 19 to 22 (512K - 4M), because its difficult to find the fittest value that fit into every use case. For example, Firefox use jemalloc as its default allcator, if we link this version of jemalloc to Firefox Moblie running on an old android device (Moto G) that only have 512M of memory, UI will hang very often because


1) OOM Killer


2) existence of other memory consuming app or service


3) granularity for each mmap/brk is too large, kernel is difficult to find 4M chunk for you when we only have 512M ram




I am not sure how jemalloc was configure on firefox built for mobile device. But I am pretty sure that 4M is not enough for either redis or our high loaded server application, so we tune this value to something like 25 or 26 in our in-house fork of jemalloc to optimize the performance of the allocator mainly for a more continuous address space to enhance cache locality and reduce the overhead of syscall interruption caused by mmap call, and the initialization of data structure after each mmap. However, this hack is not worked for Golang application, because it uses its own allocator written in Golang that have data structure and logic similar to Tcmalloc.




After we analyzed the source code of Golang runtime allocator, we found that similar constant value exists for Golang allocator:


1) HeapAllocChunk 
2) chunk constant inside function persistentalloc

For HeapAllocChunk, it determine how mush memory to "ask" (a term used in the source code) when we call mmap, the default value is 1M, which is 4X smaller than that of jemalloc. For our own workload specifically, we modify this value to 25 for the reason listed above.

Like Golang's own http server implementation, we use one goroutine to handle every incoming request to simplified eventloop and epoll stuff. In a busy web server, we discover that persistentalloc was called very often to allocate space for storing metadata related to each goroutine. Inside the definition of persistentalloc the author of golang define that, we will only allocate  256K more memory when we eat up all the existing memory. In a busy web server that have more than 200k concurrent connection. Expanding the metadata store 256k every time when we have 200k goroutine or more is simply not reasonable. So we tune this value to 1M and make it more fit to our own workload.