Finding ECC memory errors on HP servers

A little perl utility to help you find failing memory in HP servers.. This utility parses hpdiags output to report the value of the ECC memory error counters in the spd registers since the last boot. This utility will report errors even when memory prefailure notification (which would otherwise log these errors to the IML) is disabled in the BIOS. Note that a small number of corrected errors does not necessarily indicate a problem.

At a minimum it requires perl and the XML::Simple module. It will run hpdiags and parse the output, though you can pass it an existing hpdiags XML filename instead with the ‘-f’ option. The output or any errors looks like this:

[root@hpserver ~]# /tmp/hpdiags_ramcheck
    Product Number : 555555-001      
    Serial Number  : USE1234567
    Model          : HP ProLiant DL385 G6
    ROM            : A22 02/09/2010
        (1) Corrected single bit error(s) on DIMM 1
            SPS-DIMM 4GB PC2-6400 SDRAM DDR2 RDIMM  (P/N 501111-001)
        (7) Uncorrectable multibit error(s) on DIMM 2
            SPS-DIMM 4GB PC2-6400 SDRAM DDR2 RDIMM  (P/N 501111-001)