HPE ProLiant Server의 iLO Virtual Console 화면을 Virtual Serial Port로 출력하기 (HPE iLO VSP 구성)

 

1. Hardware Configuration

 

HPE iLO 4 User Guide
https://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c03334051-19.pdf#m94362
P.224. Using the iLO Virtual Serial Port

Configuring the iLO Virtual Serial Port in the UEFI System Utilities
1. Access the UEFI system utilities -> Press <F9> during POST

2. Set the Virtual Serial Port COM port.
a. move to System Configuration -> BIOS/Platform configuration (RBSU) -> System Options -> Serial Port Options.
b. Select Virtual Serial Port, and Select the COM 2. (default)

3. Set the BIOS serial console port COM port
Select BIOS Serial Console and EMS
-> Select BIOS Serial Console Port -> "Auto" to "Virtual Serial Port"
-> Select BIOS Serial Console Emulation Mode -> VT100+ (default).
-> Select BIOS Serial Console Baud Rate -115200 (default).
-> EMS Console -> Disabled (default)

 

2. Software Configuration.
1) RHEL OS의 Grub Configure (커널 파라메터 추가, 뒤에 나열된 터미널이 주 콘솔)
# vim /boot/grub/grub.conf
------------------------------------------

# ttyS0 and unit 0 are for com1 and ttyS1 and unit 1 are for com2.
# rear is primary display
console=tty0 console=ttyS1,115200
------------------------------------------

e.g.)

------------------------------------------
default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title Red Hat Enterprise Linux 6 (2.6.32-573.el6.x86_64)
        root (hd0,0)
        kernel /vmlinuz-2.6.32-573.el6.x86_64 ro root=/dev/mapper/vg_dl380g9j7u1-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD rd_LVM_LV=vg_dl380g9j7u1/lv_root SYSFONT=latarcyrheb-sun16 crashkernel=128M rd_LVM_LV=vg_dl380g9j7u1/lv_swap  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet console=tty0 console=ttyS1,115200
        initrd /initramfs-2.6.32-573.el6.x86_64.img
------------------------------------------
        kernel /vmlinuz-2.6.32-573.el6.x86_64 ro root=/dev/mapper/vg_dl380g9j7u1-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD rd_LVM_LV=vg_dl380g9j7u1/lv_root SYSFONT=latarcyrheb-sun16 crashkernel=128M rd_LVM_LV=vg_dl380g9j7u1/lv_swap  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM console=tty0 console=ttyS1,115200 intremap=no_x2apic_optout elevator=deadline nmi_watchdog=0 intel_idle.max_cstate=0 processor.max_cstate=0

------------------------------------------

 

2) check teminial in list (목록에 있는지 검토)
# /etc/securetty
------------------------------------------
ttyS1
------------------------------------------

 

 

Note: Optional (커널 파라메터 추가 하는 방법 중 다른 방법, OS 버전마다 구성이 일부 다를 수 있으니 참고)
# vim /boot/grub/grub.conf
------------------------------------------
# splashimage=(hd0,0)/grub/splash.xpm.gz  << Comment out (add '#', 주석처리)

Add below two line in top (아래 2 줄 추가, '-'가 2회 필요함에 주의, 가이드에서 1회처럼 기록됨)
serial --unit=0 --speed=115200
terminal --timeout=10 serial console


# cat /boot/grub/grub.conf
------------------------------------------
serial --unit=0 --speed=115200
terminal --timeout=10 serial console

default=0
timeout=5
#splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title Red Hat Enterprise Linux 6 (2.6.32-573.el6.x86_64)
        root (hd0,0)
        kernel /vmlinuz-2.6.32-573.el6.x86_64 ro root=/dev/mapper/vg_dl360g9c8u13-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD rd_LVM_LV=vg_dl360g9c8u13/lv_root SYSFONT=latarcyrheb-sun16 crashkernel=128M rd_LVM_LV=vg_dl360g9c8u13/lv_swap  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet console=tty0 console=ttyS1,115200
        initrd /initramfs-2.6.32-573.el6.x86_64.img
------------------------------------------

 

참고문서:
How to setup virtual serial console for a HP system with iLO's VSP?
https://access.redhat.com/solutions/28555

How does one set up a serial terminal and/or console in Red Hat Enterprise Linux?
https://access.redhat.com/articles/3166931

Posted by 스쳐가는인연

NVMe Disk를 시스템에서 추가 또는 제거하기

- ProLiant의 경우 Hot Add (Online 상에서 신규 Disk 추가)는 현 기준 지원하지 않으니 주의

- 기존 Disk를 교체하거나, 제거하는 작업에 참조

 

Test 환경

- DL380 Gen9 (RHEL 7.5)

 

# lsblk
NAME          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda             8:0    0 558.7G  0 disk
├─sda1          8:1    0   200M  0 part /boot/efi
├─sda2          8:2    0     1G  0 part /boot
└─sda3          8:3    0 557.5G  0 part
  ├─rhel-root 253:0    0    50G  0 lvm  /
  └─rhel-home 253:2    0   3.4T  0 lvm  /home
sr0            11:0    1   4.3G  0 rom
nvme0n1       259:0    0   1.5T  0 disk
└─nvme0n1p1   259:1    0   1.5T  0 part
  ├─rhel-swap 253:1    0  15.7G  0 lvm  [SWAP]
  └─rhel-home 253:2    0   3.4T  0 lvm  /home
nvme1n1       259:2    0   1.5T  0 disk
└─nvme1n1p1   259:3    0   1.5T  0 part
  └─rhel-home 253:2    0   3.4T  0 lvm  /home

 

2개의 NVMe를 장착한 상태에서, RHEL을 설치한 상태

 

NVMe의 전원 제어를 위해 OS의 커널 파라메터를 추가

# vim /etc/default/grub

 

add below line's tail - "pci=pcie_bus_perf"
GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet pci=pcie_bus_perf"

 

# grub2-mkconfig -o /boot/grub2/grub.cfg
# grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg

 

변경된 GRUB을 적용 후 리부팅.

 

시스템의 NVMe를 검색

# find /sys/devices |egrep 'nvme[0-9][0-9]?$'
e.g.)
/sys/devices/pci0000:00/0000:00:03.0/0000:08:00.0/0000:09:09.0/0000:0b:00.0/nvme/nvme0
/sys/devices/pci0000:00/0000:00:03.0/0000:08:00.0/0000:09:0a.0/0000:0c:00.0/nvme/nvme1

 

전원 제어를 위한 bus 정보 확인

# egrep -H '.*' /sys/bus/pci/slots/*/address
/sys/bus/pci/slots/0-1/address:0000:0c:00
/sys/bus/pci/slots/0-2/address:0000:0d:00
/sys/bus/pci/slots/0-3/address:0000:0e:00
/sys/bus/pci/slots/0-4/address:0000:0f:00
/sys/bus/pci/slots/0-5/address:0000:10:00
/sys/bus/pci/slots/0/address:0000:0b:00
/sys/bus/pci/slots/1/address:0000:08:00
/sys/bus/pci/slots/2/address:0000:05:00

 

# grep '0b:00' /sys/bus/pci/slots/*/address
/sys/bus/pci/slots/0/address:0000:0b:00

 

# grep '0c:00' /sys/bus/pci/slots/*/address
/sys/bus/pci/slots/0-1/address:0000:0c:00

 

IO 중인 자료를 저장

# blockdev --flushbufs /dev/nvme1n1

NVMe 전원을 종료

# echo 0 > /sys/bus/pci/slots/0-1/power

 

# lspci | grep -i "non-vol"
# lsblk
NAME          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda             8:0    0 558.7G  0 disk
├─sda1          8:1    0   200M  0 part /boot/efi
├─sda2          8:2    0     1G  0 part /boot
└─sda3          8:3    0 557.5G  0 part
  ├─rhel-root 253:0    0    50G  0 lvm  /
  └─rhel-home 253:2    0   3.4T  0 lvm  /home
sr0            11:0    1   4.3G  0 rom
nvme0n1       259:0    0   1.5T  0 disk
└─nvme0n1p1   259:1    0   1.5T  0 part
  ├─rhel-swap 253:1    0  15.7G  0 lvm  [SWAP]
  └─rhel-home 253:2    0   3.4T  0 lvm  /home

 

NVMe 전원을 인입

# echo 1 > /sys/bus/pci/slots/0-1/power

 

# lsblk
NAME          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda             8:0    0 558.7G  0 disk
├─sda1          8:1    0   200M  0 part /boot/efi
├─sda2          8:2    0     1G  0 part /boot
└─sda3          8:3    0 557.5G  0 part
  ├─rhel-root 253:0    0    50G  0 lvm  /
  └─rhel-home 253:2    0   3.4T  0 lvm  /home
sr0            11:0    1   4.3G  0 rom
nvme0n1       259:0    0   1.5T  0 disk
└─nvme0n1p1   259:1    0   1.5T  0 part
  ├─rhel-swap 253:1    0  15.7G  0 lvm  [SWAP]
  └─rhel-home 253:2    0   3.4T  0 lvm  /home
nvme2n1       259:4    0   1.5T  0 disk
└─nvme2n1p1   259:5    0   1.5T  0 part
  └─rhel-home 253:2    0   3.4T  0 lvm  /home

 

 

추가 참조 작업 - Disk 추가 후 사용을 위해 ...

기존 구성에 SW RAID 사용 환경이라면 NVMe를 제거하고, 다시 추가하는 명령

---------------------------------
# cat /proc/mdstat
# mdadm --manage /dev/md0 --fail /dev/nvme1n1
# mdadm --manage /dev/md0 --add /dev/nvme1n1
---------------------------------

 

신규 Disk를 단일 볼륨(JBOD) 등으로 장착하는 경우,

New Disk ------------------------
# fdisk /dev/nvme1n1
a) 'n' new partition
b) 'p' partition
c) '1' 1st
d) 'w' save

# mkfs -t ext4 /dev/nvme1n1p1
or
# mkfs.xfs /dev/nvme1n1p1

# mount /dev/nvme1n1p1 /mnt/
---------------------------------

Posted by 스쳐가는인연

현 네트워크의 Gateyway 알아내기

 

route 명령은 Kernel 버전에 따라 정보를 제대로 보여주지 못함.

버전에 관계 없이, "netstat -rn" 또는 "ip route"를 통해 확인이 가능함

 

RHEL 7.5

# netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         192.171.6.1      0.0.0.0         UG        0 0          0 eno1
192.171.6.0      0.0.0.0         255.255.254.0   U         0 0          0 eno1
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 eno1
192.168.122.0   0.0.0.0         255.255.255.0   U         0 0          0 virbr0

 

# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         gateway         0.0.0.0         UG    0      0        0 eno1
192.171.6.0      0.0.0.0         255.255.254.0   U     0      0        0 eno1
link-local      0.0.0.0         255.255.0.0     U     1002   0        0 eno1
192.168.122.0   0.0.0.0         255.255.255.0   U     0      0        0 virbr0

 

# ip route
default via 192.171.6.1 dev eno1
192.171.6.0/23 dev eno1 proto kernel scope link src 192.171.6.28
169.254.0.0/16 dev eno1 scope link metric 1002
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1

 

RHEL 6.7
# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.54.72.0      *               255.255.248.0   U     0      0        0 eth0
link-local      *               255.255.0.0     U     1002   0        0 eth0
default         10.54.79.254    0.0.0.0         UG    0      0        0 eth0

# netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
192.54.72.0      0.0.0.0         255.255.248.0   U         0 0          0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 eth0
0.0.0.0         192.54.79.254    0.0.0.0         UG        0 0          0 eth0


# ip route
192.54.72.0/21 dev eth0  proto kernel  scope link  src 192.54.72.112
169.254.0.0/16 dev eth0  scope link  metric 1002
default via 192.54.79.254 dev eth0


 

Posted by 스쳐가는인연

Linux 운영 중 아래와 유사한 이벤트가 발생될 수 있다.

 

--------------------------------------------------------------

Jun 20 10:43:56 localhost kernel: CPU57: Package temperature above threshold, cpu clock throttled (total events = 1)
Jun 20 10:43:56 localhost kernel: CPU57: Core temperature/speed normal
Jun 20 10:43:56 localhost kernel: CPU57: Package temperature/speed normal

--------------------------------------------------------------

 

- H/W 적으로는 전원 구성을 점검해 볼 필요가 있다.

  (Max Performance 권장. CPU가 Idle과 Powerup을 하는 경우 유발될 수 있음)
- S/W 적으로는 Kernel 패치가 필요한 지 검토할 필요가 있다. (Bug)

- Intel에 따르면, 단시간에 부하가 걸리는 경우, Thermal Control Circuit (TCC)가 동작하여 발생할 수 있는 이벤트로, 운영 상황에 따라서는 무시할 수 있다.

  (CPU 사용율이 급증(?)하여 발열이 비정상적으로 발생될 때, Clock을 제어하여 온도를 제어하는 정상적인 동작)

 

 

참고문서.

HPE ProLiant Gen8, HPE ProLiant Gen9, and HPE ProLiant Gen10 Servers - Short Durations of Throttling (TCC Activation) May Cause Operating Systems to Issue Machine Check Alerts, Which Is Expected Behavior
https://access.redhat.com/solutions/3401881

 

Notice: HPE ProLiant Gen8, HPE ProLiant Gen9, and HPE ProLiant Gen10 Servers - Short Durations of Throttling (TCC Activation) May Cause Operating Systems to Issue Machine Check Alerts, Which Is Expected Behavior
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00020196en_us

 

This is not HPE-specific.
This behavior is expected and no action needs to be taken. Functionality of the system is not impacted by these alerts.


Seeing "Temperature above threshold" or "Core power limit notification" in /var/log/messages
https://access.redhat.com/solutions/134973

 

Resolution
•There are two different underlying issues that can trigger these messages.
◦One issue was a bug in the kernel.

 The issue was fixed in following kernel package versions (tracked via private RHBZ#908990)
◾RHEL6: kernel-2.6.32-407.el6(RHBZ#908990) or later
◾RHEL6.4.z: kernel-2.6.32-358.20.1.el6(RHBZ#999328) or later
◾RHEL6.3.z: kernel-2.6.32-279.39.1.el6(RHBZ#1020527) or later
◾RHEL6.2.z: kernel-2.6.32-220.44.1.el6(RHBZ#1020519) or later

◦The other is hardware side issue.These messages indicate that the system hardware is reporting temperatures above acceptable thresholds. These errors indicate a potential failure of the cooling solution on this system, and the CPUs are being throttled down to reduce the heat they generate. The system should be investigated for failing cooling and, should all be operational, hardware diagnostics should be run to ensure that the CPUs and system board are not faulty.


syslogd reporting: Temperature above threshold, cpu clock throttled.
https://access.redhat.com/solutions/35494

 

Resolution
Disabling the "C States" in the BIOS, so that the CPU is always running at full power.

Posted by 스쳐가는인연

MCE 관련 로그는 OS의 메모리 모니터링 기술 EDAC 기능에 의해 기록 되는데, 이 기술은 하드웨어의 메모리 모니터링 기술보다 정밀하지 못하다. 간혹 실제 오류가 없음에도 OS의 EDAC의 민감한 엔진에 의해 오류를 기록하는 경우가 있다.

메시지 발생 시 하드웨어 정보(IML, Front LED) 통해 중복 확인하여 이상이 없는 경우 해당 메시지는 무시하거나OS MCE 감지 기능을 비활성화 하고 사용하는 것이 좋다.

 

Advisory: (Revision) ProLiant G6,G7, Gen8 and Gen9 Servers - Correctable Machine Check Errors That Do Not Require Customer Action May Erroneously Be Logged to the Operating System Error Logs

https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c03356780

 

The System ROMs on HP ProLiant servers are designed to monitor these errors and to report to the customer through the Integrated Management Log (IML) and other means (such as the health LED) if there is an issue with any hardware component in the system.

 

Notice: (Revision) Linux - To Ensure Efficient Firmware First Handling of Memory Failures HPE Recommends Booting With the mce=ignore_ce Boot Parameter in Addition to Disabling EDAC

https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00016026en_us

https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c04183538

 

HPE recommends disabling EDAC in addition to disabling the correctable error detection functionality of the Linux kernel's Machine Check Event (MCE) handling.

 

Should EDAC modules be disabled on HP Proliant hardware, as recommended by HP?

https://access.redhat.com/solutions/414723

 

Individual hardware vendors then advise customers to enable or disable this general purpose feature as appropriate, depending on compatibility with their tailored error detection offerings.

 

Erroneous MCE taint on Some CPU Processors
https://www.novell.com/support/kb/doc.php?id=7008578
 

 

IBM 및 Dell의 관련 기술 문서 

Interpreting /var/log/mcelog on IMM based servers - IBM System x3850 X5, x3950 X5

http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=migr-5084973

 

Workaround

Do not use the Linux MCE daemon.

 

M620 Blade Memory issues

http://en.community.dell.com/support-forums/servers/f/956/t/19535045.aspx

It would be much better to disable EDAC and let the BMC handle error reporting and logging of the hardware

 

 

Action Plan 1.

What: 전원 설정 확인

Why : ProLiant 장비의 권장 전원 설정 확인

To disable C-states, here are steps to perform in the RBSU during POST:

Press F9 during POST to access the RBSU.

Select Power Management Option , then select HP Power Profile change the default value to Maximum Performance or Custom .

Then return to the previous menu.

Select HP Power Regulator change from the default value to HP Static High Performance Mode or OS Control Mode.

Then go back to the previous menu.

Select Minimum Processor Idle Power State change from the default value to No C-states .

Select Advanced Power Management Options select Minimum Processor Idle Power Package State and change the setting from Package C3 State to No Package State

Select Advanced Power Management Options select Collaborative Power Control and change the setting from Enabled to Disabled.

  

Action Plan 2.

What: mcelog disable 혹은 무시

Why : MCE 로그 발생 해결을 위해

To do.

1) Disable

    disable EDAC if running.

    a. Search EDAC modules

       # lsmod | grep edac

    b. For each EDAC module (if any found):

       Add the following to /etc/modprobe.conf on OS releases that support /etc/modprobe.conf

       alias edac_xxx off  (edac_xxx lsmod 확인된 )

       Add the following to /etc/modprobe.d/blacklist.conf on OS releases that support /etc/modprobe.d/blacklist.conf

       blacklist edac_xxx  (edac_xxx lsmod 확인된 )

 

       RHEL

       "/boot/grub/grub.conf" 아래 항목 추가

       mce=ignore_ce

 

       How do I disable MCE function?

       https://access.redhat.com/site/solutions/367773

 

       SLES

       "/boot/grub/menu.lst" 아래 항목 추가

       mce=ignore_ce

 

2) Ignore

 

cf. 

mce=ignore_ce

prevents linux from initiating a poll every five minutes of the machine check banks for correctable errors

 

intel_idel.max_cstate=0

prevents the kernel from overriding the BIOS C-state setting.

 

 

관련문서

HP ProLiant BL680c G7 Server Series - SUSE Enterprise Linux 11 Service Pack 1: "mcelog" Shows Corrected MCE Errors
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c03418028

Posted by 스쳐가는인연