2016年8月16日星期二

Amazon Aurora Is very expensive !

A few month ago, we migrated one of our DB built on top of ec2 m3.xlarge and GP2 disk in us-west-1 to Aurora in us-west-2 region.

Some Observation:


  • Aurora only available in some region with at least 3 az, in us-west-1 north California, us-west-1a is already full, so we can not use it in this region

  • IP traffic from us-west-1 and us-west-2 go through specific tunnel, latenct and bandwidth are guarantee

  • But Aurora is expensive, and can not save you 90% of cost, stated in aws login screen. Specifically, Aurora charge you by every IO request you have made, this is horrible.




We have approx 50GB of data, QPS is about 300-500, it run smoothly on top of Dell R430 with PERC H310(No RAID Buffer) Raid 0 Mode. $1000 per month is expensive for this kind of workload.

How to build a gateway to Amazon S3 with aws golang sdk and IAM role functionality

Previously, we have a in-house RESTful API gateway to amazon s3 backend, signing every request from client with aws V4 signing standard. It was built with nginx+lua module, which claimed to be able to handle 10k requests per second. But we discover a few problems after it was pushed to online environment for half a year.


  1. About 10% of request have 403 forbidden status code return from s3 backend
  2. nginx are not able to follow 60s ttl announced by s3 dns record.(nginx community version dos not support dynamic dns resolve feature) 
  3. nginx+lua are not able to handle 100-continue return code from s3 backend
  4. nginx+lua sometime generating unnecessary disk io
  5. s3 gateway performance is bound to s3 api request limit and not nginx qps limit
  6. Inhouse s3 gateway is not able to use IAM role to eliminate the risk of key loss
  7. Inhouse s3 gateway is subject to aws signing method changes in the future.
So we planed to refactor this service using native aws SDK, two candidates are selected, boto3 and aws golang sdk. We pick up aws-golang-sdk at the end because:

  1. boto3 did not support python 2.6 in centos 6 environment
  2. adding additional http server framework on top of boto3 make the deployment process more complex
  3. golang have better memory footprint than python 
Below are some code snippets:

  • Use static credential for instance outside ec2 or existing ec2 instance without IAM role

   var conf *aws.Config
   conf = &aws.Config{
                Region: aws.String(REGION),
                Credentials: credentials.NewStaticCredentials("", "",""),
            }
  • Use IAM role credential

    conf = &aws.Config{
                Region: aws.String(REGION),
                Credentials: credentials.NewCredentials(&ec2rolecreds.EC2RoleProvider{Client: ec2metadata.New(session.New())}),
            }        

    OR (aws sdk including aws-cli use instance IAM role by default)

    conf = &aws.Config{
                Region: aws.String(REGION),
            }

  • Get HTTP return code from s3 backend
   func get_http_status_code_from_error(err error) (int,string) {
        if awsErr, ok := err.(awserr.RequestFailure); ok {
            return awsErr.StatusCode(),awsErr.Code()
        } else {
            return 500,err.Error()
        }
   }

2016年7月26日星期二

hit a kernel bug related to Intel® I/O Acceleration Technology in rhel 7.1 system

Recently we tweak some of the BIOS setting in our Dell R720 System, specifically we enable the "I/OAT DMA Engine" listed in the integrated devices page which supposed to improve io performance if used with recent kernel(not 2.6.X)



However, in our production system we received many alarms related to cpu usage and slow io, all in vm instances running on barebone server with IOAT enable.After we log into one of those machine, we find out that cpu0 was busy handling softirq.













At the beginning we suspected that this is the problem of interrupt balance, so we enable irqbalance, but it didn't work. Because, unlike io interrupt and network interrupt, soft interrupt has only one queue, and cannot redistribute among multiple cores in the same host.


We compare every aspect of system configuration extracted from barebone hosts with and without this problem. And find out in dmesg log that ioat was the root cause, after revert that change, problem resolved.

related link:
https://bugs.centos.org/view.php?id=8778
https://access.redhat.com/solutions/1409393
https://access.redhat.com/articles/879293