Steve Hutchison (foomf) wrote,
Steve Hutchison

  • Mood:

Foomf and the no-good, very bad, awful day.


We have this fairly important release of firmware/software that is supposed to be going out to the factory for our product. AND I diligently tested it last week, even crocked my sleep schedule to do so. Friday, we decided that This Would Be The Good One, so I created a Twig off the most recently released Branch of our code, changed ONLY the minimum things that had to be changed, checked in the changes in the Twig... and built it.

And behold, it finished and it was, to all appearances, worthy of testing.
So over the weekend we tested it: I installed it on my test box, and other testers installed it on theirs.
And it installed, and it booted, and the firmware was where it should be.

So Monday, it is Decided, they verify everything, I vouch for the testing, it is Blessed and it Goes Forth and is put into the parts database and placed on FTP sites and all is well.

And having been given a special new bit of yet another version of this firmware which would be the Grand Unified Finally Long Overdue Working Bit, I use my special, carefully developed transplant scripts to cut the old firmware out of a copy of this Blessed Silver Release to the Factory. And I transplant in the new bits, and put it where people can test it, and broadcast to the small group of Smoke Testers that there is a new Smoke Test build of the upcoming thing. Then I begin testing the Smoke Test build, running an automated test.

And this morning, when I arrive at work, there are frantic emails. "It doesn't work! The firmware isn't there! It isn't the right one!"

And I get the same frantic message for the Smoke Test as for the Silver Thing.

I carefully copy the installation zip file to a clean and safe place. I command it to unzip into a directory. In the directory is a carefully encrypted package. I decrypt the package and it becomes another zip file. I command it to unzip into a directory. I look in the directory and behold, the firmware folder is there, and the folder for the device which is receiving new firmware is there, and it is empty, flaccid, waste, as though pick-pocketed. As I continue to investigate, the emails continue flying like a flock of bat-sized mosquitoes, searching for blood.

Horror as a thousand goat-young nickering in the dark corners! I go back to the automated email sent forth by the build system and open the attached log, unzip it, open it and search it for the name of the device, and skip over the multiple places where it exists without being what I want, and there... is the line where the "make" call directed that the file be copied from point A to point B.

And there is an error message. "Cannot find file ___; file not copied." exit 1.
But then, the make command just blithely ... keeps going to the next one. No error propagating out to the world, no failure to build. Like the child who grew to completeness in the womb except for the absence of a liver, it could never have thrived. Yet it did, in our tests.

Because, neither I nor the other testers thought to remove from our test machines the copies of the firmware which we had been testing, when the install file installed, there WAS a copy of the file (left behind in the flash memory disk drive.) So its absence ... undetected.

I looked back a bit further, and noted that when the firmware image was checked into the source control system it ... just skipped doing that one bit.

AHA! I send a note to all and sundry explaining, there was an undetected error in the build, the tests did not find it because it was installed on test systems which had already been testing it, and our system doesn't remove things which are expected to be there, only to install them fresh, so ... we didn't catch it, and I would FIX it immediately.

So, while fielding all sorts of frantic emails describing the steps everyone is taking to pull BACK the broken version, I (1) forced it to check in the firmware, (2) re-enabled building the branch, (3) built it, (4) re-disabled building the branch while I (5) uploaded it to my test system, (6) deleted the file from the flash drive that was supposed to be installed when booting this thing but which had remained, sneakily, between updates, and (7) rebooted. Then while it rebooted, I (8) took apart a copy in the test context, just as I had earlier to find the unexpected, stealthy absence of the file, and verified: YES! it was there.

And as I looked at my freshly booted system it was... Not there? INCONCEIVABLE! But this time, it was in its expected place in the firmware tree with all the other firmware and drivers for the other devices. So why was it not in the special place, the only place this particular device could find it?
And I remembered, oh yes, the MANIFEST! You see, we have a Manifest that tells us what our firmware is, and if the version of the manifest in the flash drive is the same as the version we just unpacked while booting? We know we don't have to unpack everything because we've already done this. It saves as much as 15 minutes when booting (because large files take a long time to install in a flash drive.)
So I remove the manifest in the flash drive and reboot again... AND SUCCESS! It works, it installs this particular bit of firmware image in the place(s) it needs to be, and all is ... ready to re-test.

What, you don't think I am NOT going to make sure this works in the system. Yeah, we made sure it worked Monday and this is _in theory_ identical bits, but after the hole going undetected, we're just a bit paranoid.

I do some more testing, ensuring that the default minimum functions do in fact function.
I send the emails, "It's ready to test, I did my testing, oh, and if you installed the broken-silver version, you will have to delete the manifest or it won't work right, because the manifest has not changed."

And while I continue to test another round of FRANTIC emails, "Do we have to tell ____ (independent testing company) to do this? They'll just try to send it back to the factory for them to do it!"

Much kerfuffling proceedeth. Manager declares, "No. They do not get to send it back to the factory, as they installed the broken bits, nor is it necessary for them to delete the file, simply to roll back to the previous version (using the rollback feature which we provided in this product about a year ago) AND once they have gotten the bad bits off, they can load the good ones."

Meanwhile, there are Little Mysterious Weird Things showing up, and each of them requires the test guys to investigate the cause, and determine that yes, these are the same MWT's that were found in the previous old-firmware build that was released for a slightly different piece of hardware.

And I am rebuilding the smoke test of the Next Big Thing using the New Silver, having blessed it Passed Through Smoke... Seems to be a LOT of smoke around ... and then I finish, and try to install it on my test system so we can find out if the Grand, Unified, Re-Merged, All One Thing Again firmware that's supposed to support the Old hardware and the NEW hardware, actually does. And of course it fails and I notice that it's actually running the old, broken build, so I replace it yet again with the new thing and it ... finally comes up. And I start the tests running, and escape to home at last.

This is not to mention the four OTHER crises that came up during that time, though none of those were my fault (fortunately, or I would have exploded like an over-ripe watermelon).


  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded