PVE 的 PCIE Bus Error 处理记录

检查了一下 PVE 的日志，发现大量 NVME 的报错信息。

cat /var/log/messages
Sep  3 06:26:04 pve kernel: [171088.231004] nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep  3 06:26:04 pve kernel: [171088.231009] nvme 0000:02:00.0:   device [126f:2263] error status/mask=00000001/0000e000
Sep  3 06:26:04 pve kernel: [171088.231014] nvme 0000:02:00.0:    [ 0] RxErr
Sep  3 17:45:37 pve kernel: [211863.130656] pcieport 0000:00:01.1: AER: Multiple Corrected error received: 0000:00:01.1
Sep  3 17:45:37 pve kernel: [211863.132722] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Sep  3 17:45:37 pve kernel: [211863.132727] pcieport 0000:00:01.1:   device [8086:6f03] error status/mask=00001100/00002000
Sep  3 17:45:37 pve kernel: [211863.132732] pcieport 0000:00:01.1:    [ 8] Rollover
Sep  3 17:45:37 pve kernel: [211863.132735] pcieport 0000:00:01.1:    [12] Timeout

查了论坛是说硬件相关的问题，需要更新 BIOS，或者尝试使用不同的 PCIe/M.2 插槽。还有就是使用更新的内核。

以上几个方法，在当前都无法执行。最后使用了掩耳盗铃的方法，关闭 AER 通过不记录错误来掩盖问题，但它并不能解决实际的 PCIe 错误。

编辑 grub 引导文件

nano /etc/default/grub

修改这两参数

GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=nommconf"
GRUB_CMDLINE_LINUX="cie_aspm=off

pcie_aspm=off，禁用了引发错误的电源管理模式。
pci=nommconf，禁用 nommconf 支持，nommconf 主要作用是提供一种更快速、高效的方法来访问 PCIe 设备的配置信息，但导致系统不稳定性或不兼容性问题。

更新配置，然后重启。

sudo update-grub
sudo reboot