Despite being in the field of technical computing for over thirty years now, the level of technological sophistication in my research group has been stagnant for somewhere around twenty years. Take for example our user authentication:
- We use NIS for user authentication. Period. No Kerberos.
- Our NIS server is an SGI Indigo2 that is over fifteen years old now…
- …and fifteen years ago, single DES was considered good enough for password hashes
So our entire password database and all password hashes are fully exposed to the network, and the hashes are in an extremely insecure format. Oh, and did I mention that our sole NIS server is running on a fifteen year old machine that is still using its factory-installed hard drive? One of these days it’s going to blow out, and we don’t have any sort of drop-in replacement or backups for when that happens.
In fact, the matter of not having backups has been a major issue for as long as I’ve worked here. When I started here as an undergraduate researcher, the entire lab’s backup strategy was to tar up directories and sftp them to a 250GB external USB drive connected to our already-old Power Mac G4. When that filled up, we started just backing up data to whatever disks had free space. One of the researchers started copying his data into the /usr partition on the compute nodes of our cluster since they had around 30GB free per node. Another copied backups to various workstations that weren’t being used at the time. And the third full-time researcher simply didn’t back up his data at all. The cluster had automatic tape backup, after all, so why waste the effort?
On the day of my graduation, my department threw a luncheon for the new graduates where faculty could meet the parents of the graduating students. My research advisor (under whom I am now finishing my Ph.D.) introduced himself to my parents, said a number of congratulatory and flattering things, and finished by turning to me and saying “Oh, and the cluster went down yesterday. All the data is gone. When are you going to be back in the lab?”
The following Monday I was back in the lab, and a lot of data was lost. The cluster did have a tape drive with amanda installed, but nobody knew if the tape backups were ever actually running. Nobody knew how to examine the contents of the tape, and nobody had rotated tapes recently. In fact, although the tape drive was sold with two DLT tapes, the second one was never even unwrapped, much less rotated in. I’d be pretty confident that even if amanda was doing regular tape backups, the tape had more writes on it than was safe for a DLT cartridge.
This story isn’t particularly interesting; the internet is full of similar anecdotal backup horror stories. But it really sucks when it happens to you, and after returning to my group some time later to do graduate work, I took it upon myself to establish data redundancy and automated backups to make sure I never lost my data again.
Unfortunately, the technological sophistication of my group never went beyond buying an external USB hard drive, plugging into a Mac, and letting OS X magically set it up so that it can be written to. And for some reason, purchasing decisions were continually made by those with perhaps the least qualifications to be making them. The end result was our entire storage infrastructure being plugged into the USB ports of a Power Mac G5.
Our situation remained this way for some years despite the fact that the USB to SATA bridges used in external Seagate disks seem to fail under high throughput, and OS X does not handle failed drives gracefully at all. I finally got fed up with the constant outages and failures, voided the warranties on our bigger USB disks, ripped them out of their enclosures, and installed them properly into whatever workstations had the drive cage space and SATA channels to support them.
The end result is a bit of a mess:
Some systems are automatically backed up, some are not. Some have RAID1, others do not. And none of the backup disks have any redundancy, so if one goes, the backups on it are gone. It would please me to no end to replace this mess with a single storage solution; even something simple like a dozen terabytes of NAS would be a huge improvement over the spiderweb of small disks we’re currently using.
Unfortunately, the cost of a semi-serious storage solution (on the order of a few thousand dollars) is hard to sell on my boss. We have backup disks, they can store data, and nobody’s lost anything important since that cluster failed many years ago. Something must be going right, so why spend the money on storage when we can burn it on more inkjet printers to replace those whose cartridges have gone empty, or to hire more undergraduates who aren’t qualified to touch a UNIX workstation?
As backup space becomes a little tighter, I am tempted to halt the automated backups of my coworkers’ data and just automatically back up my data. After all, they have all been told to do their own backups, and I’ve told them not to trust that the automatic backups are actually working. Yet none of them have done a manual backup in at least a year, and nobody but me has been checking to see that the automated backup system is even working.
I struggle to convince my boss to spend more on storage to accommodate my backup needs, so maybe I should just use what I’ve got available to me and let the others worry about their own data. After all, it isn’t my job to keep their data backed up. I am not a system administrator; I don’t even have administrator privileges on any of the clusters I am automatically backing up. I just don’t want to lose any more of my data.