So, for the last three days I've been trying to make a script work right.
Some boring background: Our product is a small server chassis. It has smart management features. The power supplies, fans, ethernet switch(es), storage controller(s), server 'blades' and disk drives, all can be monitored for operational state via a web page served by the chassis management module (CMM).
Each of the things I named there can be replaced, and all can be hot-swapped (with some preparation.)
The magic thing: a functional configuration, which is about the same space as a large-ish tower configuration, can be put together for under 20K.
One of the cool tricks is that the server blades do not have storage on them. Instead, the storage subsystem "feeds" them with virtual disk drive(s). So if a blade needs to be replaced or upgraded, you don't have to re-install everything. BUT!! The blade BMC (boot management controller) firmware, and BIOS, may need to be updated or a special BIOS configuration set up for some specific applications. Not to mention, other devices sometimes need firmware upgrades to fix their problems.
This means we have to test the ability to perform those updates.
So, we have two ways that we can do this. We can test manually (using a special version of the firmware which differs from the "release" version only by reporting a different version number) by using the GUI to flip back and forth between the two. This is called "ping-pong testing." Not terribly difficult, but initially was very tiresome (lots of human interaction required), now less so but still very time consuming because each reboot takes the maximum amount of time (usually it's fast, but if something changes it takes longer) since EVERYTHING changes.
OR we can test via automation. Since the CMM keeps track of which versions and files it should use, we simply tell it use different ones, and then restart the firmware update management (which then goes to all the systems and asks them what they have and tells them if they should set up for a firmware update, and then a human using the GUI would normally have to put things into the state required for updating. One can use a script to do the different fiddly bits and pieces, but using one of the command-line-accessible control interfaces to direct the changes instead of the GUI. By using the same interface to talk to the parts I need to update, my script does what a human would do at the right times, without having to have a person chained to the thing. (Basically I replicate what I do manually but in code.)
I'm attempting to test the BIOS update for a new, not yet released blade server. So I automated the bios update, something that's been over-due for a long time.
After I got review for my first cut of this particular script, I modified it and tested the components. I also tested the wrapper that makes things happen more than once. The wrapper worked. The doPing and doPong operations worked. The part that waits for all the servers to do their thing, also worked (and was tsted with the doPing/doPong script repeatedly.)
And once I had the kinks out I tried it over again several times. Just for grins and because I wanted to prove that the script worked repeatedly, and it did.
So then I sent out the email, "Hey, it looks like this is working and the new BIOS seems to have fixed that one bug we were seeing so I will run the test until tomorrow so we can see how stable it REALLY is."
And then I started the test running.
It died. Somewhere in the first or second update, something failed and died and it did NOT complete what it was supposed to do. And it died in a way that looks remarkably unlike what my script could accomplish.
I should never have broken the First Rule of Frizbee.