10 Reasons I Hate Server Name Changes

There’s been a, um, heated discussion at my work about our current server naming standards. Personally I hate our naming scheme. However, a lot of people have become accustomed to it, and a lot of applications use the standards in their configurations, often to automate management. So when the idea of changing it came up again this week, the O.C.D. part of me was anxious to “fix” our naming. Then I realized just what that meant.

For those that can’t read my handwriting (with explanations):

#10. Time consuming to “fix” existing servers (must update files, DNS, monitoring, reboot, etc)
#9. Nobody is every happy (or rather, for every person you please you probably tick someone else off)
#8. This would be irrelevant if everyone used CNAMEs (’nuff said)
#7. Breaks bcfg setup (our config management system bases some configs on the hostname)
#6. Adds time to RH6 upgrade process (rather than an OS upgrade being transparent, now the owners need to update all their configs)
#5. Invariably will end up repeating this again later (this is the 3rd or 4th naming “standard”)
#4. Angers end-users (they need to update their configs, and notify everyone that depends on their apps – would be irrelevant if #8 didn’t apply)
#3. DCops must relabel everything (datacenter guys must label every server)
#2. Value? Makes us $0. Saves us $0.
*DRUMROLL*
#1. Must open 700 WOs for Windows to update DNS (my team does not have DNS rights, so we must open a request and coordinate each change).

If this doesn’t sound like a “make work” project, I don’t know what does.

Finding ECC memory errors on HP servers

A little perl utility to help you find failing memory in HP servers.. This utility parses hpdiags output to report the value of the ECC memory error counters in the spd registers since the last boot. This utility will report errors even when memory prefailure notification (which would otherwise log these errors to the IML) is disabled in the BIOS. Note that a small number of corrected errors does not necessarily indicate a problem.

At a minimum it requires perl and the XML::Simple module. It will run hpdiags and parse the output, though you can pass it an existing hpdiags XML filename instead with the ‘-f’ option. The output or any errors looks like this:

[root@hpserver ~]# /tmp/hpdiags_ramcheck 
hpserver.domain.com:
    Product Number : 555555-001      
    Serial Number  : USE1234567
    Model          : HP ProLiant DL385 G6
    ROM            : A22 02/09/2010
        (1) Corrected single bit error(s) on DIMM 1
            SPS-DIMM 4GB PC2-6400 SDRAM DDR2 RDIMM  (P/N 501111-001)
        (7) Uncorrectable multibit error(s) on DIMM 2
            SPS-DIMM 4GB PC2-6400 SDRAM DDR2 RDIMM  (P/N 501111-001)