A Lower Maintenance Backup System for Fermilab
Overview
Our current backup scheme, fmb, has several problems. It is:
-
High maintenance,
-
Easily damaged by changes in OS version, and
-
Difficult and slow to test
-
Confusing to users
This proposal recommends implementing a new labwide backup system which
provides:
-
reliable backups on our supported platforms,
-
simplified configuration
-
signifigantly lower maintenance
by tossing several requirements of the previous system,
and adding several new requirements that would otherwise
need to be added to the current system.
These requirements can be met by extending Amanda (the University of Maryland network backup system)
in less time than we can implement a new system on our own.
The areas in which we would need to extend Amanda are:
- Tape mounts -- scripts or programs to call OCS and Juke
to mount tapes.
- Support for backing up partitions/files bigger than
one tape.
Changes in Requirements
Several requirements changes versus our previous backup
system will be made. Maintenance improvements, which
generally involve dropping existing requirements, and
Functionality improvements which generally involve adding
new requirements.
Maintenance Improvements
-
Changed -- Backup software to be written in "portable" /bin/sh --
instead the new one is written in portable ANSI C, because:
- Experience has shown /bin/sh isn't so portable, vendors
(esp. SGI) have been signifigantly changing shell functionality
even in "minor" patch releases in the last 2 years.
- Shell scripts create dependency on lots of system programs,
changes in any one of which causes maintenance load.
-
Drop - Must support muliple archive formats -- only GNU tar
will be supported, and possibly 1 vendor dump tool per platform,
if we absolutely have to.
This reduces testing time for new releases signifigantly.
Currently the main backup test run takes about
3psn hours
for p client platforms, s server platforms and n archive utilities per platform, which is of course 3p2n when all platforms are backup server and client systems.
So reducing archive schemes from 4 to 1 per platform is a fourfold
improvement in regression testing time. (Reducing platforms would be
even more signifigant, but probably beyond the realm of possibility).
-
Drop -- backups of arbitrary subdirectories.
Instead, we have a cluster-wide configuration of what is
to be backed up.
Users like D0 "kback" scripts will have to switch to
calling GNU tar directly.
-
Drop -- archive/drive error retries. This was added in
the old scripts on the assumption that a drive was likely
to be able to write on a given patch of tape a second time
even if it failed the first time.
In practice, even when it worked, it expanded the runtime
of the backup beyond when people were willing to have backups
running, and they turned it off.
Functionality Improvements
-
Add -- Multiple tape support. We are encountering more
and more often partitions and even individual files larger
than our tape drives.
Potiential solutions include
- spill-over onto sequential tapes
- striping several tapes
- striping tapes with extra XOR Block (RAID for tapes -- "RAIT")
The first two of these suffer from a signifigant decrease
in reliability -- we are twice as likely to have a tape
related failure.
In the third case, we have actually signifigantly reduced
error rates, because a double-tape failure is required
to fail a restore.
Both of the latter two schemes require several drives
in parallel to do backups and a restore, which would
make it impractical for some configuraions.
-
Add -- Kerberos authentication and encryption of
partition backups containing host keys, etc.
Other Expected Improvements
-
Other folks also using Amanda will generally port to a
given platform before we do, so we avoid "surprises" on
new operating system releases.
-
The only locally developed software will be small scripts
to find out what tapes are wanted and load them with OCS, etc.
-
Better reports and logging than our current system.
-
No more "overlooked" partitions.
-
Backup tapes will always be portable between systems, (assuming
users use the reccomended GNU tar)
-
Only one set of error messages.
-
no backup command-line options to learn/get confused about
(was that -e or -E?)
Proposed Time-line
Nov 1999: Detailed design and design meetings with sysadmins
Dec 1999: Installation of Amanda on build cluster
Jan 1999: Prototyping of locally developed parts; RAIT code tested, donated to Amanda
project.
Feb 2000: General Release