Sunday, May 22, 2005

the way of pain

we've had problems before. to recap briefly, our main server got zapped and good by an unforeseen interaction of meralco power, an uninterruptible power supply, and the building's generator. good thing we had backups. bad thing we had a gremlin in the studio. impossible man. whenever he says that something is impossible, it's a good bet that he's done something. in that case, it was to add new hard disks to the backup server's raid box without first backing up the data to a location that had space enough for the task. end result: 60% of backup lost.

fast forward. almost two months to the day that main server went down. came to work in a star wars frame of mind ("i have a bad feeling..." etc). officemate says that it was the result of a weekend spent watching the extended version of "return of the king" and then shooting the breeze with barkada during impromptu reunion of sorts. however that works, the foreboding was soon to be proven somewhat prescient.

11:30, main server begins to fail.

no power outage. remote access reveals that raid is not visible to server. first line of action: reboot. server comes back online -- still nothing. time for the good ol' physical inspection. in the cold room's antechamber (i suppose it could be called), i find troubleshooter and his colleagues -- working on a different problem. he was unaware of the main server's behavior, impossible man still not having informed him. i turn around and i notice that the main raid box has all it's drive access lights steadily lit. that can't be good.

at that time, impossible man shows up. he himself is surprised at the news. i opine that this does not bode well for the future of that particular server architecture if it just ups and flakes out on whim. my suggestion is to shut down the server and raid for a while (over lunch break), give it a rest, and then see what happens when it's powered on again.

back to my floor, and thence to a mac i use for remote checking on the servers. i launch a raid administration utility to see which of the four raids was powered off.

three were on, one was off. the name of the one that was off gave me chills. i recalled that earlier that morning, impossible man was going to format the raid of the machine whose name the raid bore. back to the cold room. oh no.

backtrack: the mac server/storage solution comprises two parts, a "head" or server, and a raid box that is the storage. the raid box is connected to the storage by high-speed link; but is a separate computer in its own right, and remotely controllable and accessible even from machines that are not it's primary "head". also: four "heads" and four raid boxes stacked alternately starting at the bottom of a rack. primary server/raid combination at the bottom.

the bottom-most raid was powered down.

connect the dots. since the main server's raid had the name of the machine to be formatted, this was the likeliest combination that spelled disaster. i raised this likelihood to the production manager, who spoke to impossible man, who (naturally) uttered the word "impossible" and even tried to deflect any blame my way. at this point in time i backed off. i made the observation that, yes, the machine could indeed have crashed of its own volition. we would just have to see if the machine comes back after being powered on. then we would know either way if it was a system stability on the server/raid part, or idle hands at work.

...this "my word vs. his" is getting to be tiresome.

after lunch. raid powered on. i notice that the names of the raid have been changed, as reflected in the raid admin utility. how decent, and after the fact as usual. ok, let's go have a look see. click on the now-renamed main server raid, check on disk status.

initializing 5%.

he had actually formatted it. honest mistake, or whatever. he had formatted our main server's raid. 4 terabytes of hard work: restoring from unlooked-for backup on render farm; checking and verifying database integrity; a month of full-on rendering. poof.

again, luckily i had a full backup (minus that week) on another server. but still...

strike two...

...third time's the charm.

ugh. methinks i shall not dwell on that possibility.

Friday, May 13, 2005

it's the destination...

...not the journey.

admittedly a twist on that old saw.

however, in this instance, it does hold true. the project has now finished (indeed, at the end of last month). in relation to a previous post (april 1st), where i noted that we had two months to get to the finish line, it turned into one month after all the post-black thursday negotiations were done.

as fate would have it, i still had a spreadsheet from the very beginning of this venture which allowed some sort of projection on hardware requirements based on total frame count x projected layer count x reasonable render time per frame.

initially, this spreadsheet had resulted in a farm numbering in the low 40's (not 42 - of course that would have been neat in hindsight and a hitch-hiker's guide to the galaxy sort of way).

we initially ended up with 25, of which one died (strangely enough, unit 13), and one was taken away to be a combination license and job server for our render management software. so that makes 23. and that number stayed until march the 17th.

two months shrunk into one, and the spreadsheet came up with an additional 40 dual-processor machines to add to the current farm to make the deadline. now, during the course of the project, we had been sporadically testing samples of render units from various vendors and we were settled on either an either intel or amd solution.

we'd also tested the dual g5 solution from apple and found it a third slower, so i didn't really consider it a contender.

catch being, with the time (two weeks to obtain and test and setup), only apple's asia retailer had the numbers we needed readily available... ...a point perhaps worthy of some thought.

so we are now the largest installation of dual g5 rack-mount computers outside of biological research facilities.

think projected addition plus a third more, to make up for the per-unit shortfall.

after all is said and done, we finished, we made it. how we got there, the hardware story is above.

the human angle, perhaps others can inscribe.

have to admit that i'm not really looking forward to administering such a host of hostile machines (a triumph of form over function in most all it's respects) - but it's a living, after all. a challenge. a potential heart attack.

(",)

wonder what other turns our hardware future may take.