2016年7月26日星期二

hit a kernel bug related to Intel® I/O Acceleration Technology in rhel 7.1 system

Recently we tweak some of the BIOS setting in our Dell R720 System, specifically we enable the "I/OAT DMA Engine" listed in the integrated devices page which supposed to improve io performance if used with recent kernel(not 2.6.X)



However, in our production system we received many alarms related to cpu usage and slow io, all in vm instances running on barebone server with IOAT enable.After we log into one of those machine, we find out that cpu0 was busy handling softirq.













At the beginning we suspected that this is the problem of interrupt balance, so we enable irqbalance, but it didn't work. Because, unlike io interrupt and network interrupt, soft interrupt has only one queue, and cannot redistribute among multiple cores in the same host.


We compare every aspect of system configuration extracted from barebone hosts with and without this problem. And find out in dmesg log that ioat was the root cause, after revert that change, problem resolved.

related link:
https://bugs.centos.org/view.php?id=8778
https://access.redhat.com/solutions/1409393
https://access.redhat.com/articles/879293