본문 바로가기
OS-OE Knowledge/Linux-Unix KB

ProLiant/Linux 운영 중 MCE log 발생

by 스쳐가는인연 2018. 9. 14.

MCE 관련 로그는 OS의 메모리 모니터링 기술 EDAC 기능에 의해 기록 되는데, 이 기술은 하드웨어의 메모리 모니터링 기술보다 정밀하지 못하다. 간혹 실제 오류가 없음에도 OS의 EDAC의 민감한 엔진에 의해 오류를 기록하는 경우가 있다.

메시지 발생 시 하드웨어 정보(IML, Front LED) 통해 중복 확인하여 이상이 없는 경우 해당 메시지는 무시하거나OS MCE 감지 기능을 비활성화 하고 사용하는 것이 좋다.

 

Advisory: (Revision) ProLiant G6,G7, Gen8 and Gen9 Servers - Correctable Machine Check Errors That Do Not Require Customer Action May Erroneously Be Logged to the Operating System Error Logs

https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c03356780

 

The System ROMs on HP ProLiant servers are designed to monitor these errors and to report to the customer through the Integrated Management Log (IML) and other means (such as the health LED) if there is an issue with any hardware component in the system.

 

Notice: (Revision) Linux - To Ensure Efficient Firmware First Handling of Memory Failures HPE Recommends Booting With the mce=ignore_ce Boot Parameter in Addition to Disabling EDAC

https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00016026en_us

https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c04183538

 

HPE recommends disabling EDAC in addition to disabling the correctable error detection functionality of the Linux kernel's Machine Check Event (MCE) handling.

 

Should EDAC modules be disabled on HP Proliant hardware, as recommended by HP?

https://access.redhat.com/solutions/414723

 

Individual hardware vendors then advise customers to enable or disable this general purpose feature as appropriate, depending on compatibility with their tailored error detection offerings.

 

Erroneous MCE taint on Some CPU Processors
https://www.novell.com/support/kb/doc.php?id=7008578
 

 

IBM 및 Dell의 관련 기술 문서 

Interpreting /var/log/mcelog on IMM based servers - IBM System x3850 X5, x3950 X5

http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=migr-5084973

 

Workaround

Do not use the Linux MCE daemon.

 

Special consideration when using Linux error detection and correction (EDAC) for tracking memory errors - Lenovo Server
https://datacentersupport.lenovo.com/us/en/products/servers/system-x/solutions/ht107942-special-consideration-when-using-linux-error-detection-and-correction-edac-for-tracking-memory-errors-lenovo-server

Solution
Lenovo recommends disabling Linux EDAC and Linux kernel's Machine Check Event (MCE) handling functionality in order to provide accurate Dual In-ine Memory Module (DIMM) error reporting which is tracked by the system's management independent of the Operating System. Hardware, including memory DIMMs, will not be replaced under warranty based on EDAC /var/log/message errors. After EDAC modules are disabled, system diagnostics such as BMC, IMM, XCC will be used to verify memory problems.

 

M620 Blade Memory issues

http://en.community.dell.com/support-forums/servers/f/956/t/19535045.aspx

It would be much better to disable EDAC and let the BMC handle error reporting and logging of the hardware

 

 

Action Plan 1.

What: 전원 설정 확인

Why : ProLiant 장비의 권장 전원 설정 확인

To disable C-states, here are steps to perform in the RBSU during POST:

Press F9 during POST to access the RBSU.

Select Power Management Option , then select HP Power Profile change the default value to Maximum Performance or Custom .

Then return to the previous menu.

Select HP Power Regulator change from the default value to HP Static High Performance Mode or OS Control Mode.

Then go back to the previous menu.

Select Minimum Processor Idle Power State change from the default value to No C-states .

Select Advanced Power Management Options select Minimum Processor Idle Power Package State and change the setting from Package C3 State to No Package State

Select Advanced Power Management Options select Collaborative Power Control and change the setting from Enabled to Disabled.

  

Action Plan 2.

What: mcelog disable 혹은 무시

Why : MCE 로그 발생 해결을 위해

To do.

1) Disable

    disable EDAC if running.

    a. Search EDAC modules

       # lsmod | grep edac

    b. For each EDAC module (if any found):

       Add the following to /etc/modprobe.conf on OS releases that support /etc/modprobe.conf

       alias edac_xxx off  (edac_xxx lsmod 확인된 )

       Add the following to /etc/modprobe.d/blacklist.conf on OS releases that support /etc/modprobe.d/blacklist.conf

       blacklist edac_xxx  (edac_xxx lsmod 확인된 )

 

       RHEL

       "/boot/grub/grub.conf" 아래 항목 추가

       mce=ignore_ce

 

       How do I disable MCE function?

       https://access.redhat.com/site/solutions/367773

 

       SLES

       "/boot/grub/menu.lst" 아래 항목 추가

       mce=ignore_ce

 

2) Ignore

 

cf. 

mce=ignore_ce

prevents linux from initiating a poll every five minutes of the machine check banks for correctable errors

 

intel_idel.max_cstate=0

prevents the kernel from overriding the BIOS C-state setting.

 

 

관련문서

HP ProLiant BL680c G7 Server Series - SUSE Enterprise Linux 11 Service Pack 1: "mcelog" Shows Corrected MCE Errors
https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c03418028

 

Machine check
https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt

 

 

 

 

 

 

 

반응형