?

Log in

No account? Create an account

Previous Entry | Next Entry

Foomf and the no-good, very bad, awful day.


So.

We have this fairly important release of firmware/software that is supposed to be going out to the factory for our product. AND I diligently tested it last week, even crocked my sleep schedule to do so. Friday, we decided that This Would Be The Good One, so I created a Twig off the most recently released Branch of our code, changed ONLY the minimum things that had to be changed, checked in the changes in the Twig... and built it.

And behold, it finished and it was, to all appearances, worthy of testing.
So over the weekend we tested it: I installed it on my test box, and other testers installed it on theirs.
And it installed, and it booted, and the firmware was where it should be.

So Monday, it is Decided, they verify everything, I vouch for the testing, it is Blessed and it Goes Forth and is put into the parts database and placed on FTP sites and all is well.

And having been given a special new bit of yet another version of this firmware which would be the Grand Unified Finally Long Overdue Working Bit, I use my special, carefully developed transplant scripts to cut the old firmware out of a copy of this Blessed Silver Release to the Factory. And I transplant in the new bits, and put it where people can test it, and broadcast to the small group of Smoke Testers that there is a new Smoke Test build of the upcoming thing. Then I begin testing the Smoke Test build, running an automated test.

And this morning, when I arrive at work, there are frantic emails. "It doesn't work! The firmware isn't there! It isn't the right one!"

And I get the same frantic message for the Smoke Test as for the Silver Thing.

I carefully copy the installation zip file to a clean and safe place. I command it to unzip into a directory. In the directory is a carefully encrypted package. I decrypt the package and it becomes another zip file. I command it to unzip into a directory. I look in the directory and behold, the firmware folder is there, and the folder for the device which is receiving new firmware is there, and it is empty, flaccid, waste, as though pick-pocketed. As I continue to investigate, the emails continue flying like a flock of bat-sized mosquitoes, searching for blood.

Horror as a thousand goat-young nickering in the dark corners! I go back to the automated email sent forth by the build system and open the attached log, unzip it, open it and search it for the name of the device, and skip over the multiple places where it exists without being what I want, and there... is the line where the "make" call directed that the file be copied from point A to point B.

And there is an error message. "Cannot find file ___; file not copied." exit 1.
But then, the make command just blithely ... keeps going to the next one. No error propagating out to the world, no failure to build. Like the child who grew to completeness in the womb except for the absence of a liver, it could never have thrived. Yet it did, in our tests.

Because, neither I nor the other testers thought to remove from our test machines the copies of the firmware which we had been testing, when the install file installed, there WAS a copy of the file (left behind in the flash memory disk drive.) So its absence ... undetected.

I looked back a bit further, and noted that when the firmware image was checked into the source control system it ... just skipped doing that one bit.

AHA! I send a note to all and sundry explaining, there was an undetected error in the build, the tests did not find it because it was installed on test systems which had already been testing it, and our system doesn't remove things which are expected to be there, only to install them fresh, so ... we didn't catch it, and I would FIX it immediately.

So, while fielding all sorts of frantic emails describing the steps everyone is taking to pull BACK the broken version, I (1) forced it to check in the firmware, (2) re-enabled building the branch, (3) built it, (4) re-disabled building the branch while I (5) uploaded it to my test system, (6) deleted the file from the flash drive that was supposed to be installed when booting this thing but which had remained, sneakily, between updates, and (7) rebooted. Then while it rebooted, I (8) took apart a copy in the test context, just as I had earlier to find the unexpected, stealthy absence of the file, and verified: YES! it was there.

And as I looked at my freshly booted system it was... Not there? INCONCEIVABLE! But this time, it was in its expected place in the firmware tree with all the other firmware and drivers for the other devices. So why was it not in the special place, the only place this particular device could find it?
And I remembered, oh yes, the MANIFEST! You see, we have a Manifest that tells us what our firmware is, and if the version of the manifest in the flash drive is the same as the version we just unpacked while booting? We know we don't have to unpack everything because we've already done this. It saves as much as 15 minutes when booting (because large files take a long time to install in a flash drive.)
So I remove the manifest in the flash drive and reboot again... AND SUCCESS! It works, it installs this particular bit of firmware image in the place(s) it needs to be, and all is ... ready to re-test.

What, you don't think I am NOT going to make sure this works in the system. Yeah, we made sure it worked Monday and this is _in theory_ identical bits, but after the hole going undetected, we're just a bit paranoid.

I do some more testing, ensuring that the default minimum functions do in fact function.
I send the emails, "It's ready to test, I did my testing, oh, and if you installed the broken-silver version, you will have to delete the manifest or it won't work right, because the manifest has not changed."

And while I continue to test another round of FRANTIC emails, "Do we have to tell ____ (independent testing company) to do this? They'll just try to send it back to the factory for them to do it!"

Much kerfuffling proceedeth. Manager declares, "No. They do not get to send it back to the factory, as they installed the broken bits, nor is it necessary for them to delete the file, simply to roll back to the previous version (using the rollback feature which we provided in this product about a year ago) AND once they have gotten the bad bits off, they can load the good ones."

Meanwhile, there are Little Mysterious Weird Things showing up, and each of them requires the test guys to investigate the cause, and determine that yes, these are the same MWT's that were found in the previous old-firmware build that was released for a slightly different piece of hardware.

And I am rebuilding the smoke test of the Next Big Thing using the New Silver, having blessed it Passed Through Smoke... Seems to be a LOT of smoke around ... and then I finish, and try to install it on my test system so we can find out if the Grand, Unified, Re-Merged, All One Thing Again firmware that's supposed to support the Old hardware and the NEW hardware, actually does. And of course it fails and I notice that it's actually running the old, broken build, so I replace it yet again with the new thing and it ... finally comes up. And I start the tests running, and escape to home at last.

This is not to mention the four OTHER crises that came up during that time, though none of those were my fault (fortunately, or I would have exploded like an over-ripe watermelon).

Comments

( 8 comments — Leave a comment )
tagryn
Jun. 18th, 2008 11:51 am (UTC)
"INCONCEIVABLE!"

That word, I do not think it means what you think it means. 8)

Ah, the joys of testing, yes?
foomf
Jun. 19th, 2008 12:16 am (UTC)
Precisely the usage I was intending too.
invisiblewolf
Jun. 18th, 2008 01:34 pm (UTC)
Ah, panic. We've had many, MANY of those with our software builds. For a while, we were getting different results with what should have theoretically been the same files, just re-compiled and re-built. At the time, our safety lead was saying (sarcastically) that the lack of a deterministic build process just might keep first flight from happening. But we've supposedly fixed all the errors now.

I'm not stepping foot onto a 787 until the year 2013 or thereabouts.

-Spiritwolf.
erikred
Jun. 18th, 2008 03:44 pm (UTC)
Sending you smooth-sailing thoughts.

And please keep your over-ripe watermelon inside the vehicle at all times.
(Deleted comment)
drath
Jun. 20th, 2008 08:24 pm (UTC)
Deleted and reposted for clarity...
That is exactly why I don't have the stones to do anything like that for a living.

I remember in a college data structures class, the prof had us pull together some simple procedures to make a working program. [EDIT] The prof had prepared all of these procedures on the office computer, and we were to tie it all together by loading a saved copy of the source code onto our computers in the lab and creating the main body of the program. [/edit]

This was the final project of the term, by the way.

For reasons that were never, ever clear to me, something happened on the lab computers that should not have been possible. The environment was Borland Turbo Pascal (stop laughing, it was an introductory class), and in the process of constructing a linked list, I ran a step and trace that revealed that the lab computers weren't handling memory allocation right... specifically, declaring a new node in the linked list should have pointed to an empty cell, but the information within somehow corresponded to an existing variable declared elsewhere. This bug was not reproduceable on the office computer, but happened on every lab computer, even though they appeared to be identical makes.

Around that point, something in me said "Maybe I can make a living writing SQL code..."
foomf
Jun. 21st, 2008 01:29 am (UTC)
Re: Deleted and reposted for clarity...
Sounds like a configuration problem to me. Identical "make" sure, but different version of OS, and/or libraries, and/or floating point hardware, etc. ad crashium.
drath
Jun. 21st, 2008 03:07 pm (UTC)
Re: Deleted and reposted for clarity...
I hear that if I get a Mac, everything "just works" for some reason.

Should it be easy or difficult to say that with a straight face?
drath
Jun. 21st, 2008 03:13 pm (UTC)
Although...
The thing that still stumps me is, I don't recall being given pre-compiled code. Common sense says my memory must be fuzzy and that I must be wrong about this, but I really really thought all we were working with was a text file of Pascal source code, to which we just added a main program body loop. Perhaps my recall just don't work so great. Although when I pointed it out, the instructor was able to fix it. Hmm.
( 8 comments — Leave a comment )