uPCIe Err may occur on System configured with AMD EPYC 7xx2-(ROME) or 7xx3-(MILLAN)
HW: Apollo 6500 Gen10 plus (XL675d Gen10 plus) + NVIDIA A100-SXM4-80GB
Symptom: System 운영 중, IO 장치에 uPCIe Err가 발생하고, Error Status가 아래와 같이 “completion timeout” 또는 “Malformed TLP status”로 표시됨
e.g.)
Uncorrectable PCI Express Error Detected. Slot 3 (Segment 0x0, Bus 0x43, Device 0x0, Function 0x0). Uncorrectable Error Status: 0x40000
Uncorrectable PCI Express Error Detected. Slot 3 (Segment 0x0, Bus 0x43, Device 0x0, Function 0x0). Uncorrectable Error Status: 0x44000
Uncorrectable PCI Express Error Detected. Slot 7 (Segment 0x0, Bus 0xCB, Device 0x0, Function 0x0). Uncorrectable Error Status: 0x4000
Recommended Action:
To do.
1) 부팅 중 <F9> 펑션키를 통해 BIOS 구성 진입
2) Workload Profile을 "Custom"으로 변경
3) Disable Infinity Fabric Power Management.
a. System Utilities > System Configuration > BIOS/Platform Configuration (RBSU) > Power and Performance Options > Advanced Power Options.
b. Select "Disabled" >> Save(<F10>)
4) Set the AMD Infinity Fabric Performance State to P0.
a. System Utilities > System Configuration > BIOS/Platform Configuration (RBSU) > Power and Performance Options > Infinity Fabric Performance State.
b. Select P0 >> Save(<F10>)
5) Disable AMD C-state efficiency mode.
a. System Utilities > System Configuration > BIOS/Platform Configuration (RBSU) > Power and Performance Options > C-State Efficiency Mode.
b. Select "Disable" >> Save(<F10>)
6) Disable Data Fabric C-states.
a. System Utilities > System Configuration > BIOS/Platform Configuration (RBSU) > Power and Performance Options > Data Fabric C-State Enable.
b. Select "Disabled" >> Save(<F10>)
7) Configure NBIO LCLK DPM Level.
a. System Utilities > System Configuration > BIOS/Platform Configuration (RBSU) > Power and Performance Options > I/O Options > NBIO LCLK DPM Level.
b. Select "Static High".
c. Repeat for each NBIO LCLK.
d. Save(<F10>)
8) Disable Active State Power Management.
a. System Utilities > System Configuration > BIOS/Platform Configuration (RBSU) > PCIe Device Configuration > PCIe Power Management (ASPM)
b. Select "Disable" >> Save(<F10>)
9) Disable Access Control Service
a. System Utilities > System Configuration > BIOS/Platform Configuration (RBSU) > Virtualization Options > Access Control Service
b. Select "Disable" >> Save(<F10>)
c. Linux OS 상에서, 아래 명령을 통해 ACS 비활성화
# for i in $(lspci | cut -f 1 -d " "); do setpci -v -s $i ecap_acs+6.w=0; done
Note. 명령 수행 시, 오류 발생 가능(Executing the command may result in output indicating it cannot be executed for some PCIe devices. This is expected behavior.)
10) Set the minimum C-state.
a. System Utilities > System Configuration > BIOS/Platform Configuration (RBSU) > Power and Performance Options > Minimum Processor Idle Power Core C-State.
What if: b-1. Linux OS에 "cpupower" package를 설치한 경우, "C6-state" 선택 후 저장(<F10>), OS상에서, 아래 명령 수행
# cpupower idle-set -d 2
What if: b-2. "cpupower" package를 사용하지 않는 환경의 경우, "No-Cstates" 선택 후 저장(<F10>)
Note. Step 9)/10)의 OS 상에서 수행되는 명령은 리부팅 시 다시 재수행되도록 구성/설정해야합니다.
Note: these commands are not permanent and need to be entered into a startup script, so they are executed again after a reboot.
Note. 7xx2- Processor 환경의 경우, 위 나열된 설정 적용.
Note. 7xx3- Processor 환경의 경우, System ROM 3.00 이상 적용 후 증상이 지속되는 경우, 위 나열된 설정 적용
참조문서:
Advisory: HPE ProLiant Gen10 Plus/Gen10 Plus V2 Servers and Apollo Gen10 Plus Servers - Uncorrectable PCIe Bus Errors May Occur On Systems Configured with an AMD EPYC 7xx2- or 7xx3-Series Processor
https://support.hpe.com/hpesc/public/docDisplay?docId=a00140808en_us