2015年8月12日星期三

On the effect of two major journaling mode of ext4 file system

There is a long establish debate on  lwn.net/Articles/327994,lwn.net/Articles/328363/ on whether to use data=writeback mode as default mount option for ext4 filesystem. Even upstream kernel uses writeback as default, RHEL6.4 do not, code snippet that control default journal options are as follow:

switch (test_opt(sb, DATA_FLAGS)) {
    case 0:
        /* No mode set, assume a default based on the journal
         * capabilities: ORDERED_DATA if the journal can
         * cope, else JOURNAL_DATA
         */
        if (jbd2_journal_check_available_features
            (sbi->s_journal, 0, 0, JBD2_FEATURE_INCOMPAT_REVOKE))
            set_opt(sb, ORDERED_DATA);
        else
            set_opt(sb, JOURNAL_DATA);
        break;

So, what's the point of using ORDERED as default, RHEL is targeting high end enterprise system, so data consistency is critical. When we look into the write_end function in ext4 source code, we found that:

  if(ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
        ret = ext4_jbd2_file_inode(handle, inode);
        if (ret) {
            unlock_page(page);
            page_cache_release(page);
            goto errout;
        }
    } 


If ORDERED_MODE is set, every time we make a write(2) call we will send an inode attribute change request to jbd2(journaling block device) layer and will file the transaction at the middle of the disk(please check get_midpoint_journal_block in mke2fs), in any case, this will cause disk seek operation and increase IO latency.


In our R720 server, we use battery backed raid card with 1G write cache, this means when we met power faliure, we can have another 40 minutes for our IO subsystem to survive, and flush the writecache. The implication of this hardware spec for ext4 and jbd is that we can enable async_commit and disable barrier to reduce the SYNC command sent to SCSI device. And reduce unecessary seeking at all.