* total backup size on tapes: ~ 650TB
* total number of objects on tapes: ~5 million
* daily transfer to tape: 1-2.5 TB
-
+
## Requirements
* backup dumps: daily
* backup retention policy: 6 months
* components:
* frontend: AFS "native" backup system (butc, backup commands)
* tape backend: TSM backup service and tape manager
- * granularity:
+ * granularity:
* full backup cycle to tape ~ every 30-50 days
* 1 or 2 differential backups to tape per full backup cycle (differential = "diff to the last full")
* daily incremental backups to tape ("diff to the last incremental")
* partitions on a server are backup-up in sequence
* load-balancing the data flow to the backup service:
* avoid all servers doing full backups at the same time
-
-## Problems/Issues
+
+## Problems/Issues
* scalability: need 24 hour window to complete a dump of all partitions on a server (may be a problem for increasingly larger servers)
* more parallelism in dumping the volumes?
* using more features of the TSM (e.g. more accounts) may be required in the future
- * maintenance:
+ * maintenance:
* current administration layer for backup dump/restore is overly complex and drags a lot of historical/obsolete code
## Detailed details
- * AFS backup system frontend
+ * AFS backup system frontend
* backup volsets: one volset per partition
* database model: a volset dump contains a list of volume dumps
* dump hierarchy:
* once a differential is created, the backup database keeps track of the parentship of subsequent incrementals
* several layers of wrappers perl/ksh/arc/... to manage dumps and restores
* butc wrapped by expect script and fired on demand (no permanent butc processes), requires locking
- * end-user backup restore integrated into the afs_admin interface
+ * end-user backup restore integrated into the afs_admin interface
* restore done in a “synchronous” mode: an authenticated RPC to a restore server (one of afs file server nodes) is made using "arc"
- * the RPC connection is kept alive until user leaves a temporary shell which is positioned where recovered volumes are temporarily mounted
+ * the RPC connection is kept alive until user leaves a temporary shell which is positioned where recovered volumes are temporarily mounted
* TSM backend
* TSM server (one TSM account at the moment, using "archive" mode)