cern_backup_setup.mdwn

   1 # CERN backup setup ca. 2011
   2
   3 ## Base numbers
   4
   5  * ~55 file servers
   6  * 850 million files (50% yearly growth)
   7  * total backup size on tapes: ~ 650TB
   8  * total number of objects on tapes: ~5 million
   9  * daily transfer to tape: 1-2.5 TB
  10
  11 ## Requirements
  12  * backup dumps: daily
  13  * backup retention policy: 6 months
  14  * restore directly done by the users  (that is w/o admin intervention)
  15  * same backup policy for all volumes: all volumes are equal and no volumes are more equal than others
  16
  17 ## Current organization
  18
  19  * components:
  20     * frontend: AFS "native" backup system (butc, backup commands)
  21     * tape backend: TSM backup service and tape manager
  22  * granularity:
  23     * full backup cycle to tape ~ every 30-50 days
  24     * 1 or 2 differential backups to tape per full backup cycle (differential = "diff to the last full")
  25     * daily incremental backups to tape ("diff to the last incremental")
  26     * "yesterday" also available as “BK” clone on disk
  27          * for home directories these clones are already mounted and directly accessible to users
  28  * servers are backed-up in parallel by a cronjob every evening (i.e. every 24 hours)
  29     * partitions on a server are backup-up in sequence
  30  * load-balancing the data flow to the backup service:
  31     * avoid all servers doing full backups at the same time
  32
  33 ## Problems/Issues
  34  * scalability: need 24 hour window to complete a dump of all partitions on a server (may be a problem for increasingly larger servers)
  35    * more parallelism in dumping the volumes?
  36    * using more features of the TSM (e.g. more accounts) may be required in the future
  37  * maintenance:
  38    * current administration layer for backup dump/restore is overly complex and drags a lot of historical/obsolete code
  39
  40 ## Detailed details
  41
  42  * AFS backup system frontend
  43    * backup volsets: one volset per partition
  44      * database model: a volset dump contains a list of volume dumps
  45    * dump hierarchy:
  46       * simply defined as a chain: full-i1-...-i52 (no differentials)
  47       * differential backups are determined based on additional state information stored locally via perl Catalog module
  48         * once a differential is created, the backup database keeps track of the parentship of subsequent incrementals
  49    * several layers of wrappers perl/ksh/arc/... to manage dumps and restores
  50       * butc wrapped by expect script and fired on demand (no permanent butc processes), requires locking
  51       * end-user backup restore integrated into the afs_admin interface
  52          * restore done in a “synchronous” mode: an authenticated RPC to a restore server (one of afs file server nodes) is made using "arc"
  53          * the RPC connection is kept  alive until user leaves a temporary shell which is positioned where recovered volumes are temporarily mounted
  54  * TSM backend
  55
  56    * TSM server (one TSM account at the moment, using "archive" mode)
  57    * butc linked against XBSA/TSM library
  58      * each volume dump is stored as a separate object in TSM with the volset dump id prefix ("/{volsetdumpid}/{volname}.backup")