MCE 관련 로그는 OS의 메모리 모니터링 기술 EDAC 기능에 의해 기록 되는데, 이 기술은 하드웨어의 메모리 모니터링 기술보다 정밀하지 못하다. 간혹 실제 오류가 없음에도 OS의 EDAC의 민감한 엔진에 의해 오류를 기록하는 경우가 있다.
메시지 발생 시 하드웨어 정보(IML, Front LED)를 통해 중복 확인하여 이상이 없는 경우 해당 메시지는 무시하거나, OS의 MCE 감지 기능을 비활성화 하고 사용하는 것이 좋다.
Advisory: (Revision) ProLiant G6,G7, Gen8 and Gen9 Servers - Correctable Machine Check Errors That Do Not Require Customer Action May Erroneously Be Logged to the Operating System Error Logs
The System ROMs on HP ProLiant servers are designed to monitor these errors and to report to the customer through the Integrated Management Log (IML) and other means (such as the health LED) if there is an issue with any hardware component in the system.
Notice: (Revision) Linux - To Ensure Efficient Firmware First Handling of Memory Failures HPE Recommends Booting With the mce=ignore_ce Boot Parameter in Addition to Disabling EDAC
HPE recommends disabling EDAC in addition to disabling the correctable error detection functionality of the Linux kernel's Machine Check Event (MCE) handling.
Should EDAC modules be disabled on HP Proliant hardware, as recommended by HP?
Individual hardware vendors then advise customers to enable or disable this general purpose feature as appropriate, depending on compatibility with their tailored error detection offerings.
Erroneous MCE taint on Some CPU Processors
IBM 및 Dell의 관련 기술 문서
Interpreting /var/log/mcelog on IMM based servers - IBM System x3850 X5, x3950 X5
Do not use the Linux MCE daemon.
Special consideration when using Linux error detection and correction (EDAC) for tracking memory errors - Lenovo Server
Lenovo recommends disabling Linux EDAC and Linux kernel's Machine Check Event (MCE) handling functionality in order to provide accurate Dual In-ine Memory Module (DIMM) error reporting which is tracked by the system's management independent of the Operating System. Hardware, including memory DIMMs, will not be replaced under warranty based on EDAC /var/log/message errors. After EDAC modules are disabled, system diagnostics such as BMC, IMM, XCC will be used to verify memory problems.
M620 Blade Memory issues
It would be much better to disable EDAC and let the BMC handle error reporting and logging of the hardware
Action Plan 1.
What: 전원 설정 확인
Why : ProLiant 장비의 권장 전원 설정 확인
•To disable C-states, here are steps to perform in the RBSU during POST:
◦Press F9 during POST to access the RBSU.
◦Select Power Management Option , then select HP Power Profile change the default value to Maximum Performance or Custom .
◦Then return to the previous menu.
◦Select HP Power Regulator change from the default value to HP Static High Performance Mode or OS Control Mode.
◦Then go back to the previous menu.
◦Select Minimum Processor Idle Power State change from the default value to No C-states .
◦Select Advanced Power Management Options select Minimum Processor Idle Power Package State and change the setting from Package C3 State to No Package State
◦Select Advanced Power Management Options select Collaborative Power Control and change the setting from Enabled to Disabled.
Action Plan 2.
What: mcelog disable 혹은 무시
Why : MCE 로그 발생 해결을 위해
disable EDAC if running.
a. Search EDAC modules
# lsmod | grep edac
b. For each EDAC module (if any found):
Add the following to /etc/modprobe.conf on OS releases that support /etc/modprobe.conf
alias edac_xxx off (edac_xxx 는 lsmod로 확인된 값)
Add the following to /etc/modprobe.d/blacklist.conf on OS releases that support /etc/modprobe.d/blacklist.conf
blacklist edac_xxx (edac_xxx 는 lsmod로 확인된 값)
"/boot/grub/grub.conf"에 아래 항목 추가
How do I disable MCE function?
"/boot/grub/menu.lst"에 아래 항목 추가
prevents linux from initiating a poll every five minutes of the machine check banks for correctable errors
prevents the kernel from overriding the BIOS C-state setting.
HP ProLiant BL680c G7 Server Series - SUSE Enterprise Linux 11 Service Pack 1: "mcelog" Shows Corrected MCE Errors