SandyBridge 이상의 CPU 를 사용하는 장비에서 RHEL 운영 시 TSC overflow 로 인한 OS hang 현상
Request / Error / Symptom :
Server 가 Cold Boot (Power-Off 후에 reboot) 되고 운영된 지 200 여일 정도 지나면 특정 버전 이하의 커널 운영 서버들은 Hang 이 발생할 수 있음
Syslog 에 아래 메시지 확인되고, System 이 Hang 되는 현상
[sched] x86: Avoid unnecessary overflow in sched_clock (...) [765720]
dmesg 로그에 아래와 유사한 Stacktrace 가 발견됨
INFO: task bash:12543 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bash D 0000000000000012 0 12543 12542 0x00000084
ffff880c343b3ce8 0000000000000082 ffff880c343b3d98 ffffffffffffffe9
ffff880c343b3c88 ffffffffa00c9129 ffff880c343f4aa0 0000010100000015
ffff880c343f5058 ffff880c343b3fd8 000000000000fb88 ffff880c343f5058 Call Trace:
[<ffffffffa00c9129>] ? ext4_check_acl+0x29/0x90 [ext4]
[<ffffffffa008fbf0>] ? ext4_file_open+0x0/0x130 [ext4]
[<ffffffff8150ea05>] schedule_timeout+0x215/0x2e0
[<ffffffff8117e514>] ? nameidata_to_filp+0x54/0x70
[<ffffffff81277379>] ? cpumask_next_and+0x29/0x50
[<ffffffff8150e683>] wait_for_common+0x123/0x180
[<ffffffff81063310>] ? default_wake_function+0x0/0x20
[<ffffffff8150e79d>] wait_for_completion+0x1d/0x20
[<ffffffff8106513c>] sched_exec+0xdc/0xe0
[<ffffffff8118a0a0>] do_execve+0xe0/0x2c0
[<ffffffff810095ea>] sys_execve+0x4a/0x80
[<ffffffff8100b4ca>] stub_execve+0x6a/0xc0
Root Cause
SandyBrdige 이상의 CPU 의 Design Issue 로 인하여 CPU 내의 TSC (Time Stamp Counter) 값이 warm boot 를 할 경우 reset 되지 않는 증상
Systems with Intel® Xeon® Processor E5, Intel® Xeon® Processor E5 v2, or Intel® Xeon® Processor E7 v2 and certain versions of Red Hat Enterprise Linux 6 kernels become unresponsive/hung or incur a kernel panic
https://access.redhat.com/solutions/433883
Root Cause
On Intel® Xeon® Processor E5 Family 6 Model 45 (also known as SandyBridge), the Time Stamp Counter (TSC) is not cleared by a warm reset. This is documented in the Intel® Xeon® Processor E5 Family Specification Update as erratum BT81.
On Intel® Xeon® Processor E5 v2 Family 6 Model 62 (also known as IvyBridge), the Time Stamp Counter (TSC) is not cleared by a warm reset. This is documented in the Intel® Xeon® Processor E5 v2 Family Specification Update as erratum CA105.
On Intel® Xeon® Processor E7 v2 Family 6 Model 62 (also known as IvyBridge-EX), the Time Stamp Counter (TSC) is not cleared by a warm reset. This is documented in the Intel® Xeon® E7-2800/4800/8800 v2 Product Family Specification Update as erratum CF101.
Resolution
This issue is addressed in the following kernel updates:
• RHEL 6.5 - kernel-2.6.32-431.el6.
This package is available via Errata RHSA-2013:1645. The related Red Hat Private Bug is 975507.
• RHEL 6.4.z EUS - kernel-2.6.32-358.23.2.el6.
This package is available via Errata RHSA-2013:1436. The related Red Hat Private Bug is 1001954.
• RHEL 6.3.z EUS - kernel-2.6.32-279.37.2.el6.
This package is available via Errata RHSA-2013:1450. The related Red Hat Private Bug is 1004185.
• RHEL 6.2.z EUS - kernel-2.6.32-220.45.1.el6.
This package is available via Errata RHSA-2013:1519. The related Red Hat Private Bug is 1024453.
The workaround will be for the customer to do a COLD reboot, not a warm reboot to clear the TSC timer.
Action Plan 1. - Workaround
What: Cold Boot
Why : Cold Boot 를 할 경우 TSC 가 reset 됨
Action Plan 2.
What: 커널 업그레이드
Why : TSC 가 overflow 되는 증상 Bug Fix