we've had problems before.  to recap briefly, our main server got zapped and good by an unforeseen interaction of meralco power, an uninterruptible power supply, and the building's generator.  good thing we had backups.  bad thing we had a gremlin in the studio.  impossible man.  whenever he says that something is impossible, it's a good bet that he's done something.  in that case, it was to add new hard disks to the backup server's raid box without first backing up the data to a location that had space enough for the task.  end result: 60% of backup lost.
fast forward.  almost two months to the day that main server went down.  came to work in a star wars frame of mind ("i have a bad feeling..." etc).  officemate says that it was the result of a weekend spent watching the extended version of "return of the king" and then shooting the breeze with barkada during impromptu reunion of sorts.  however that works, the foreboding was soon to be proven somewhat prescient.
11:30, main server begins to fail.
no power outage.  remote access reveals that raid is not visible to server.  first line of action: reboot.  server comes back online -- still nothing.  time for the good ol' physical inspection.  in the cold room's antechamber (i suppose it could be called), i find troubleshooter and his colleagues -- working on a different problem.  he was unaware of the main server's behavior, impossible man still not having informed him.  i turn around and i notice that the main raid box has all it's drive access lights steadily lit.  that can't be good.
at that time, impossible man shows up.  he himself is surprised at the news.  i opine that this does not bode well for the future of that particular server architecture if it just ups and flakes out on whim.  my suggestion is to shut down the server and raid for a while (over lunch break), give it a rest, and then see what happens when it's powered on again.
back to my floor, and thence to a mac i use for remote checking on the servers.  i launch a raid administration utility to see which of the four raids was powered off.
three were on, one was off.  the name of the one that was off gave me chills.  i recalled that earlier that morning, impossible man was going to format the raid of the machine whose name the raid bore.  back to the cold room.  oh no.
backtrack: the mac server/storage solution comprises two parts, a "head" or server, and a raid box that is the storage.  the raid box is connected to the storage by high-speed link; but is a separate computer in its own right, and remotely controllable and accessible even from machines that are not it's primary "head".  also: four "heads" and four raid boxes stacked alternately starting at the bottom of a rack.  primary server/raid combination at the bottom.
the bottom-most raid was powered down.
connect the dots.  since the main server's raid had the name of the machine to be formatted, this was the likeliest combination that spelled disaster.  i raised this likelihood to the production manager, who spoke to impossible man, who (naturally) uttered the word "impossible" and even tried to deflect any blame my way.  at this point in time i backed off.  i made the observation that, yes, the machine could indeed have crashed of its own volition.  we would just have to see if the machine comes back after being powered on.  then we would know either way if it was a system stability on the server/raid part, or idle hands at work.
...this "my word vs. his" is getting to be tiresome.
after lunch.  raid powered on.  i notice that the names of the raid have been changed, as reflected in the raid admin utility.  how decent, and after the fact as usual.  ok, let's go have a look see.  click on the now-renamed main server raid, check on disk status.
initializing 5%.
he had actually formatted it.  honest mistake, or whatever.  he had formatted our main server's raid.  4 terabytes of hard work: restoring from unlooked-for backup on render farm; checking and verifying database integrity; a month of full-on rendering.  poof.
again, luckily i had a full backup (minus that week) on another server.  but still...
strike two...
...third time's the charm.
ugh.  methinks i shall not dwell on that possibility.
Sunday, May 22, 2005
Subscribe to:
Post Comments (Atom)
 
 
No comments:
Post a Comment