Friday, April 01, 2005

paradise found

well, not quite.

now we are faced with the aftermath of the company's black thursday, where the main server got zapped, and good, by the untested interactions between meralco, an uninterruptible power supply, and the building's generator.

turns out that the systems had been tested in following fashion: interrupt meralco and ups keeps power, interrupt meralco and generator activates. they had not been tried with all three in one test.

after some investigation, apparently in the haste to finish the building, the generator was wired to the ups with three identical black wires, so there was an issue of polarity (no at-a-glance way to tell which wire had to go where). so what actually happened was that meralco power went out, ups kept power on to the computers, generator came on after that -- and sucked all the power out of the ups' batteries.

...and the render farm was writing data to the main server's hard disk array.

one massive ouch.

four months of work, locked in the server, inaccessible. but we're fine, we've made backups, right?

not quite (again). i'd made a preliminary backup of our working data -- and filled up a hard disk array on another server. since we hadn't been able to get another server/storage solution, another backup was made, this time by my technical co-worker, but only of scenes that had been finalized. in this way, we were able to back up the entire project's most important data, or so we felt.

then that dark thursday rolls around, and therefore we have two backups, yes? my backup is only of the first 7 sequences, the scene backup is of the whole project. we ought to be fine.

not.

turns out that the server where we had stored the scene backup had been offline from monday that week, ostensibly to add more hard disks to the array. also turns out that the idle hands that had offlined the server had not backed the data up to some other location prior to adding the hard disks (and there was another location, with space to spare). so then the impossible had happened. we lost 60% of the project scene backups.

apple singapore troubleshooter comes in, and gives us no new hope. the system is well and truly screwed.

it is quite bleak indeed.

then, during one of the meeting-filled days that follow, where we are discussing a more redundant (hence safer) means of data management, something occurred to me.
we actually had a third, unlooked-for, backup. it was a natural offshoot of our 'localization' render process where scenes and images that a render job needs are copied locally to the render machines to eliminate network-incurred penalties.

there was a very real possibility that inspite of some blundering hands and limited inital backup space, the great majority of the shots that were finalized and rendered were intact and could be retrieved. i took the matter up first with my technical co-worker and he was enthused by the prospect.

later, i told the boss that i may have found a way to retrieve the project from almost total re-working, and he was really pleased. and he made a joke to the effect that if what i said was true, he'd treat me to a haircut. except that i'm bald... ...so i said that i'd really rather have a car...
...he thanked me for my honesty in that regard (after the thanks for saving the project), and said that that could be arranged. we'll see...

so the upshot is that prior to march 17, the project had made it to just over 95% completion -- and as of last week, after clogging the network with gigabytes of data from all render machines, and sifting through all those gigabytes, we had come to a point close enough to 95% of project data reconstructed.

a good thing, to be sure.

now comes the rendering. or re-rendering in this case. we have the project, and lost all the rendered image files.

still and all, much better than zero.

from a state of near-utter disaster, this is some form of paradise indeed.

No comments: