RHEL 6.4 운영 중 System Hang이 발생하고 OS messages 로그에 Call Trace "WARNING: at lib/list_debug.c:26 __list_add+0x6d/0xa0()"가 발생됨
Feb 11 14:54:20 thorhead kernel: ------------[ cut here ]------------
Feb 11 14:54:20 thorhead kernel: WARNING: at lib/list_debug.c:26 __list_add+0x6d/0xa0() (Not tainted)
Feb 11 14:54:20 thorhead kernel: Hardware name: ProLiant DL380p Gen8
Feb 11 14:54:20 thorhead kernel: list_add corruption. next->prev should be prev (ffff88081acc9df0), but was ffff88081eff8218. (next=ffff88081eff8218).
Feb 11 14:54:20 thorhead kernel: Modules linked in: vfat fat usb_storage fuse mptctl mptbase nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 8021q garp stp llc sunrpc rdma_ucm(U) ib_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ipv6 ib_uverbs(U) ib_umad(U) mlx5_ib(U) mlx5_core(U) mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) mlx4_core(U) compat(U) knem(U) uinput hpwdt hpilo e1000e sg microcode serio_raw iTCO_wdt iTCO_vendor_support power_meter shpchp ext4 mbcache jbd2 sd_mod crc_t10dif hpsa pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Feb 11 14:54:20 thorhead kernel: Pid: 3728, comm: nfsd4 Not tainted 2.6.32-358.el6.x86_64 #1
Feb 11 14:54:20 thorhead kernel: Call Trace:
Feb 11 14:54:20 thorhead kernel: [<ffffffff8106e2e7>] ? warn_slowpath_common+0x87/0xc0
Feb 11 14:54:20 thorhead kernel: [<ffffffff8106e3d6>] ? warn_slowpath_fmt+0x46/0x50
Feb 11 14:54:20 thorhead kernel: [<ffffffff81288e7d>] ? __list_add+0x6d/0xa0
Feb 11 14:54:20 thorhead kernel: [<ffffffffa0379eca>] ? laundromat_main+0x23a/0x3f0 [nfsd]
Feb 11 14:54:20 thorhead kernel: [<ffffffffa0379c90>] ? laundromat_main+0x0/0x3f0 [nfsd]
Feb 11 14:54:20 thorhead kernel: [<ffffffff81090ac0>] ? worker_thread+0x170/0x2a0
Feb 11 14:54:20 thorhead kernel: [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
Feb 11 14:54:20 thorhead kernel: [<ffffffff81090950>] ? worker_thread+0x0/0x2a0
Feb 11 14:54:20 thorhead kernel: [<ffffffff81096916>] ? kthread+0x96/0xa0
Feb 11 14:54:20 thorhead kernel: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
Feb 11 14:54:20 thorhead kernel: [<ffffffff81096880>] ? kthread+0x0/0xa0
Feb 11 14:54:20 thorhead kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Feb 11 14:54:20 thorhead kernel: ---[ end trace 8b6c6154af96cb5d ]---
NFS4 운영과 관련하여 나타나는 증상으로 Redhat 문서 166583를 참조 및 적용 할 수 있다.
RHEL NFS server crashes due to corruption in the del_recall_lru list
Updated November 28 2013 at 1:26 AM
https://access.redhat.com/site/solutions/166583
Issue
•A RHEL 6 NFS server may log a message similar to:
list_add corruption. next->prev should be prev (ffff880818ab3df0), but was ffff88078a1a48d0. (next=ffff88078a1a48d0).
•Panic soon thereafter with a message similar to:
BUG: soft lockup - CPU#9 stuck for 67s! [nfsd4:5319]
•The process triggering the panic will be 'nfsd4' (also known as the 'laundromat thread').
•A RHEL 5 NFS server may log a message similar to:
list_add corruption. prev->next should be ffffffff88593e10, but was ffff811362f1b648
and immediately panic. The process triggering the panic will either be 'nfsd4' (also known as the 'laundromat thread'), or it will be one of the main 'nfsd' threads (in which case the nfsd_break_deleg_cb() function will probably be in the backtrace).
Root Cause
•Corruption of the del_recall_lru list, which is used to track NFSv4 delegation recalls.
On RHEL 6, this appears to put the code into an infinite loop from the corruption causing the list to never appear empty.
On RHEL 5, the machine panics as soon as the list corruption is uncovered (this is due to different logic in lib/list_debug.c).
Environment
•Red Hat Enterprise Linux 6
◦Seen on RHEL 6.1 and RHEL 6.3 kernels. Other kernel versions may be impacted as well.
•Red Hat Enterprise Linux 5
◦Seen on RHEL 5.10 kernel. Other kernel versions may be impacted as well.
Resolution
Workaround: Disable delegations on the RHEL NFS Server via:
# echo 0 >/proc/sys/fs/leases-enable
•RHEL 6 - Tracked by internal Red Hat Bugzilla 914772.
•RHEL 5 - Tracked by internal Red Hat Bugzilla 1028559.
•Contact your support representative for more information.
아래 문서를 통해 6.5에서 수정된 것으로 기대해 볼 수 있으나, 상세 정보가 상이하여 명확치는 않다.
System became unresponsive after "WARNING: at lib/list_debug.c:26 __list_add+0x6d/0xa0()".
Updated February 21 2014 at 2:39 AM
https://access.redhat.com/site/solutions/408873
Environment
•Red Hat Enterprise Linux 6
•kernel-2.6.32-358.*
•possibly involved kernel modules: xfs, ixgbe
Resolution
In environment where this occurred, upgrade of the kernel to 2.6.32-431.el6 is reported to fix the issue.
This kernel is already part of RHEL6.5GA.