NVIDIA - GPU Memory ECC 오류에 따른 RMA 조건 이해

HW Knowledge

NVIDIA - GPU Memory ECC 오류에 따른 RMA 조건 이해

스쳐가는인연 2026. 5. 16. 15:52

Note. NVIDIA GPU의 Memory에서 ECC 오류 발생에 따라 GPU 교체 필요 여부를 검토하는 과정을 이해해 보려고 공부 중...

Note. NVIDIA GPU의 경우, 각 서버 HW 제조사가 부품을 판매하더라도, 교체 시 조건은 NVIDIA 자체의 정책을 참조하여 진행.
Note. NVIDIA의 정책 변경이 있을 수 있음에 항상 최신 버전의 NVIDIA 권고에 대하여 검토가 필요함.

NVIDIA에서 제공하는 공식 Memory 오류 상황에 대한 가이드 문서:
NVIDIA GPU Memory Error Management
https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/latest/index.html

RMA Policy
https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/latest/rma-policy-thresholds-for-row-remapping.html
https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/latest/sram-uncorrectable-errors.html

GPU DRAM Memory RMA Policy
Any of the following events will trigger a row-remapping failure flag:
- A remapping attempt for an uncorrectable memory error on a bank that already has eight uncorrectable error rows remapped.
> 동일 bank에서 9번의 Uncorrectable Memory Error를 경험한 경우 교체
(버퍼 공간을 모두 소진하여, 추가적인 Address 수정이 불가능한 경우로 이해 가능.)
Note. 한 개 bank는 8번의 remapping을 할 수 있는 buffer row를 가지고 있음
Note. None 이 한 개 이상이고, High/Partial/Low 가 변화하지 않은 상황에서(한 개 이상으로, ‘0’ 상태를 의미하지 않음) GPU가 ECC오류를 경험한 경우
Note. 오류 감지 시, 한 번에 확인될 수 없는 조건으로, GPU ECC 오류의 중복 발생(2회 이상, 1차 감지 후 다시 감지 한) 상황에서 수치 변동이 없는 경우를 의미함.

1차 – Volatile CE/UCE 증가/발생
Remapped Rows
        Correctable Error                 : 1
        Uncorrectable Error               : 2
        Pending                           : Yes
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 2558 bank(s)
            High                          : 0 bank(s)
            Partial                       : 1 bank(s)
            Low                           : 0 bank(s)
            None                          : 1 bank(s)

2차 – Volatile CE/UCE 증가/발생
Remapped Rows
        Correctable Error                 : 1
        Uncorrectable Error               : 8
        Pending                           : No
        Remapping Failure Occurred        : Yes
        Bank Remap Availability Histogram
            Max                           : 2558 bank(s)
            High                          : 0 bank(s)
            Partial                       : 1 bank(s)
            Low                           : 0 bank(s)
            None                          : 1 bank(s)

- A remapping attempt for an uncorrectable memory error on a row that was already remapped and can occur with less than eight total remaps to the same bank.
> 기 remapping이 수행된 row의 중복 재사용이 시도된 경우 교체
(수정을 위해 Address 변경이 진행되며, 잘못된 Address 지정이 있는 경우로 이해 가능)
Note. remapping이 진행될 때, row 단위로 진행됨 (Unmap)
Note. None 수치에 관계없이, High/Partial/Low가 변화하지 않은 상황에서(한 개 이상으로, ‘0’ 상태를 의미하지 않음) GPU가 ECC오류를 경험한 경우
Note. 오류 감지 시, 한 번에 확인될 수 없는 조건으로, GPU ECC 오류의 중복 발생(2회 이상, 1차 감지 후 다시 감지 한) 상황에서 수치 변동이 없는 경우를 의미함.

1차 – Volatile CE/UMCE 증가/발생
Remapped Rows
        Correctable Error                 : 1
        Uncorrectable Error               : 2
        Pending                           : Yes
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 2559 bank(s)
            High                          : 0 bank(s)
            Partial                       : 1 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)

2차 – Volatile CE/UMCE 증가/발생
Remapped Rows
        Correctable Error                 : 1
        Uncorrectable Error               : 2
        Pending                           : No
        Remapping Failure Occurred        : Yes
        Bank Remap Availability Histogram
            Max                           : 2559 bank(s)
            High                          : 0 bank(s)
            Partial                       : 1 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)

- After 512 total remappings for an uncorrectable memory error have occurred.
> remapping이 전체 512회 이상 진행된 경우 교체
Note. Max가 128 이하로 낮아진 경우 (=High/Partial/Low/None의 총 계가 512 이상)
Note. 512는 A100기준 수치로, 이 후 세대 GPU의 경우, 전체 가용 예비 영역의 80% 정도를 기준으로 검토
Note. 전체 가용 예비 영역이 512보다 큰 경우, 512를 기준 함

Note. Bank Remap Availability Histogram - row buffer size example

L40: 191

A100: 640

H100: 2560

H200 3072

B200: 3840

...

GPU L2 Memory RMA Policy
Any of the following events will trigger the SRAM Threshold Exceeded flag:
- More than 4 UCE Unique Count events within an address bank for parity protected SRAMs.
- More than 2 UCE Unique Count events within an address bank for SECDED ECC protected SRAMs.

- SRAM ECC Error의 경우, 수정 가능한 오류로 분류되며, 리부팅 시 초기화 됨
- 일반적으로 반복 발생 및 누적되지 않음
SRAM errors falls under correctable category and gets resettled on every reboot.
We don't have to worry on these counters.
- 반복 발생되는 경우, Nvidia에서 제공하는 진단툴(field diag tool) 수행 결과를 취합하여 추가 검토 필요.

Note. 아래 조건의 경우 교체 검토
1. Aggregate - SRAM Uncorrectable Parity 수치가 4 이상인 경우
2. Aggregate - SRAM Uncorrectable SEC-DED 수치가 2 이상인 경우

Note. H100 이전의 경우, SRAM Uncorrectable 수치가 4 이상인 경우

Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 5
            SRAM Uncorrectable SEC-DED    : 3
            DRAM Correctable              : 29
            DRAM Uncorrectable            : 2
            SRAM Threshold Exceeded       : Yes
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : 3
            SRAM SM                       : 5
            SRAM Microcontroller          : 0
            SRAM PCIE                     : 0
            SRAM Other                    : 0
        Channel Repair Pending            : No
        TPC Repair Pending                : No

Row Remapping
https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/latest/row-remapping.html
This feature is used to prevent known degraded memory locations from being used. The row-remapping feature is a replacement for the page retirement scheme used in prior generation GPUs. Every bank in DRAM is equipped with spare rows in hardware. As opposed to traditional page retirement, the row-remapper replaces degrading memory cells with spare ones to avoid offlining regions of memory in software. This differs from dynamic page offlining in that the memory is fixed at a hardware level and does not leave software visible holes in the address space. The process of row-remapping requires a GPU reset to take effect and will remain persistent throughout the life of the life of the GPU.

Note. 데이터 접근을 위해 특정 Cell에 접근하려면, 특정 Bank의 특정 Row를 접근 후 Column으로 접근하여 데이터 접근함

- 특정 Cell의 오류가 확인되면, Cell 단위로 대체 불가하여 대상 Cell이 포함된 Row 전체가 예비 영역으로 대체 필요함

- Row remap은 임시로 Online 중에 GPU reset (PCIe FLR) 가능하나, 상황에 따라, Host reboot(offline) 필요할 수 있음.

- Row remap의 영구적 적용은 Host remap 필요

RAS Repair - GPU Memory Repair
https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/latest/hbm-channel-repair.html
GPU memory repair consists of swapping spare DRAM channels or L2 cache slices.
There are multiple L2 slices that make up L2 cache in the NVIDIA GPU.

For example, if there are spare channels available and the DRAM has a bank that is trending towards failure, the channel in which the bank resides can be swapped out for a spare. After two row re-mappings in the same bank, the next occurrence of an uncorrectable ECC error in that bank will attempt to trigger repair if a spare is available. The same concept applies to failing L2 slice as well.

Note. Blackwell Architecture에서 추가된 기능으로, 특정 Bank에서 UCE가 지속 발생하여, row remapping이 3회 시도되는 경우, HBM 연결 채널 중 여유 채널(가용한 대체 경로)가 확보되는 경우, Channel repair를 진행하여, ECC 이벤트 발생으로 다른 여러 뱅크가 정상 동작 중임에도 특정 Bank만이 마모되는 것을 예방/회피

Note. Hopper 중 H200에 적용됨

HBM Architecture: 이미지 출처: https://dl.acm.org/doi/10.1145/3767333#sec-2-1

Dynamic Page Offlining
https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/latest/dynamic-page-offlining.html
Dynamic Page Offlining improves resiliency and availability of NVIDIA GPUs to uncorrectable ECC errors. Once the NVIDIA driver identifies the location of an uncorrectable error in the frame buffer memory, it marks the page containing the error as unusable. Once the page is marked unusable, any of the currently executing or newly launched workloads will not be allocating this page in question.

1. NV Driver ECC 엔진에 의해 HBM Page/Page frame 16 오류 감지

2. 대체 Page Frame 11 선정

3. 대체 Page Frame 11에 필요한 데이터 로드/복사

4. Page table 업데이트 - Page Frame 16 > Page Frame 11

5. Page Frame 16 unusable set

참조용: Ampere 이전 세대의 경우, 아래 참조
Dynamic Page Retirement
https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html#rma-eligibility
The Tesla board will continue to retire pages up until the page retirement table is full, at 64 dynamically retired memory pages. However, a board that generates 60 or more retired pages is eligible for an RMA. A Tesla card with 64 pages retired will fail the NVIDIA Field Diagnostic tool.

Additionally, if a board is found to exhibit 15 or more retired pages and continues to retire memory pages at a rate of one or more newly retired pages per week, it can be evaluated for an RMA before the 60-page RMA threshold has been reached. Please track the page retirement rate and provide that information with the returned board..

Uncorrectable Error/Double Bit Error: 단기간에 많은 오류 누적이 발생하는 경우

- NV: 15개 이상이 이미 누적된 상황에서, 주당 1개 이상이 지속 발생하는 경우

- HPE 30일이내 5+개의 Retire page 가 발생하거나, 10+ 이상의 Retire page가 확인되는 경우, 교체 검토

Correctable Error/Single Bit Error: CE와 UCE 전체(합계)의 Dynamic retire page가 60+이상인 경우, 교체

관련참조자료:
NVIDIA GPU Memory Error Management
https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/latest/index.html

Notice: (Revision) NVIDIA Graphical Processing Units (GPUs) - Handling ECC Memory Errors on NVIDIA GPUs
https://support.hpe.com/hpesc/public/docDisplay?docId=a00155638en_us&docLocale=en_US

Notice: HPE Multiple Server Platforms - NVIDIA Accelerator/GPU Replacement Policy for HPE ProLiant or Apollo Servers Configured With NVIDIA PCIE/SXM Accelerator Modules
https://internal.support.hpe.com/hpesc/docDisplay?docId=emr_na-a00117925en_us

NVIDIA GPU (A100, H100, H200, B200) reports ECC or SRAM memory errors - Lenovo ThinkSystem
https://support.lenovo.com/kr/ko/solutions/tt2651-nvidia-gpu-reports-ecc-errors-lenovo-thinkagile-and-thinksystem