doc/arch/dafs-overview.txt

   1 The Demand-Attach FileServer (DAFS) has resulted in many changes to how
   2 many things on AFS fileservers behave. The most sweeping changes are
   3 probably in the volume package, but significant changes have also been
   4 made in the SYNC protocol, the vnode package, salvaging, and a few
   5 miscellaneous bits in the various fileserver processes.
   6
   7 This document serves as an overview for developers on how to deal with
   8 these changes, and how to use the new mechanisms. For more specific
   9 details, consult the relevant doxygen documentation, the code comments,
  10 and/or the code itself.
  11
  12  - The salvageserver
  13
  14 The salvageserver (or 'salvaged') is a new OpenAFS fileserver process in
  15 DAFS. This daemon accepts salvage requests via SALVSYNC (see below), and
  16 salvages a volume group by fork()ing a child, and running the normal
  17 salvager code (it enters vol-salvage.c by calling SalvageFileSys1).
  18
  19 Salvages that are initiated from a request to the salvageserver (called
  20 'demand-salvages') occur automatically; whenever the fileserver (or
  21 other tool) discovers that a volume needs salvaging, it will schedule a
  22 salvage on the salvageserver without any intervention needed.
  23
  24 When scheduling a salvage, the vol id should be the id for the volume
  25 group (the RW vol id). If the salvaging child discovers that it was
  26 given a non-RW vol id, it will send the salvageserver a SALVSYNC LINK
  27 command, and will exit. This will tell the salvageserver that whenever
  28 it receives a salvage request for that vol id, it should schedule a
  29 salvage for the corresponding RW id instead.
  30
  31  - FSSYNC/SALVSYNC
  32
  33 The FSSYNC and SALVSYNC protocols are the protocols used for
  34 interprocess communication between the various fileserver processes.
  35 FSSYNC is used for querying the fileserver for volume metadata,
  36 'checking out' volumes from the fileserver, and a few other things.
  37 SALVSYNC is used to schedule and query salvages in the salvageserver.
  38
  39 FSSYNC existed prior to DAFS, but it encompasses a much larger set of
  40 commands with the advent of DAFS. SALVSYNC is entirely new to DAFS.
  41
  42  -- SYNC
  43
  44 FSSYNC and SALVSYNC are both layered on top of a protocol called SYNC.
  45 SYNC isn't much a protocol in itself; it just handles some boilerplate
  46 for the messages passed back and forth, and some error codes common to
  47 both FSSYNC and SALVSYNC.
  48
  49 SYNC is layered on top of TCP/IP, though we only use it to communicate
  50 with the local host (usually via a unix domain socket). It does not
  51 handle anything like authentication, authorization, or even things like
  52 serialization. Although it uses network primitives for communication,
  53 it's only useful for communication between processes on the same
  54 machine, and that is all we use it for.
  55
  56 SYNC calls are basically RPCs, but very simple. The calls are always
  57 synchronous, and each SYNC server can only handle one request at a time.
  58 Thus, it is important for SYNC server handlers to return as quickly as
  59 possible; hitting the network or disk to service a SYNC request should
  60 be avoided to the extent that such is possible.
  61
  62 SYNC-related source files are src/vol/daemon_com.c and
  63 src/vol/daemon_com.h
  64
  65  -- FSSYNC
  66
  67  --- server
  68
  69 The FSSYNC server runs in the fileserver; source is in
  70 src/vol/fssync-server.c.
  71
  72 As mentioned above, FSSYNC handlers should finish quickly when
  73 servicing a request, so hitting the network or disk should be avoided.
  74 In particular, you absolutely cannot make a SALVSYNC call inside an
  75 FSSYNC handler; the SALVSYNC client wrapper routines actively prevent
  76 this from happening, so even if you try to do such a thing, you will not
  77 be allowed to. This prohibition is to prevent deadlock, since the
  78 salvageserver could have made the FSSYNC request that you are servicing.
  79
  80 When a client makes a FSYNC_VOL_OFF or NEEDVOLUME request, the
  81 fileserver offlines the volume if necessary, and keeps track that the
  82 volume has been 'checked out'. A volume is left online if the checkout
  83 mode indicates the volume cannot change (see VVolOpLeaveOnline_r).
  84
  85 Until the volume has been 'checked in' with the ON, LEAVE_OFFLINE, or
  86 DONE commands, no other program can check out the volume.
  87
  88 Other FSSYNC commands include abilities to query volume metadata and
  89 stats, to force volumes to be attached or offline, and to update the
  90 volume group cache. See doc/arch/fssync.txt for documentation on the
  91 individual FSSYNC commands.
  92
  93  --- clients
  94
  95 FSSYNC clients are generally any OpenAFS process that runs on a
  96 fileserver and tries to access volumes directly. The volserver,
  97 salvageserver, and bosserver all qualify, as do (sometimes) some
  98 utilities like vol-info or vol-bless. For issuing FSSYNC commands
  99 directly, there is the debugging tool fssync-debug.  FSSYNC client code
 100 is in src/vol/fssync-client.c, but it's not very interesting.
 101
 102 Any program that wishes to directly access a volume on disk must check
 103 out the volume via FSSYNC (NEEDVOLUME or OFF commands), to ensure the
 104 volume doesn't change while the program is using it. If the program
 105 determines that the volume is somehow inconsistent and should be
 106 salvaged, it should send the FSSYNC command FORCE_ERROR with reason code
 107 FSYNC_SALVAGE to the fileserver, which will take care of salvaging it.
 108
 109  -- SALVSYNC
 110
 111 The SALVSYNC server runs in the salvageserver; code is in
 112 src/vol/salvsync-server.c. SALVSYNC clients are just the fileserver, the
 113 salvageserver run with the -client switch, and the salvageserver worker
 114 children. If any other process notices that a volume needs salvaging, it
 115 should issue a FORCE_ERROR FSSYNC command to the fileserver with the
 116 FSYNC_SALVAGE reason code.
 117
 118 The SALVSYNC protocol is simpler than the FSSYNC protocol. The commands
 119 are basically just to create, cancel, change, and query salvages. The
 120 RAISEPRIO command increases the priority of a salvage job that hasn't
 121 started yet, so volumes that are accessed more frequently will get
 122 salvaged first. The LINK command is used by the salvageserver worker
 123 children to inform the salvageserver parent that it tried to salvage a
 124 readonly volume for which a read-write clone exists (in which case we
 125 should just schedule a salvage for the parent read-write volume).
 126
 127 Note that canceling a salvage is just for salvages that haven't run
 128 yet; it only takes a salvage job off of a queue; it doesn't stop a
 129 salvageserver worker child in the middle of a salvage.
 130
 131  - The volume package
 132
 133  -- refcounts
 134
 135 Before DAFS, the Volume struct just had one reference count, vp->nUsers.
 136 With DAFS, we know have the notion of an internal/lightweight reference
 137 count, and an external/heavyweight reference count. Lightweight refs are
 138 acquired with VCreateReservation_r, and released with
 139 VCancelReservation_r. Heavyweight refs are acquired as before, normally
 140 with a GetVolume or AttachVolume variant, and releasing the ref with
 141 VPutVolume.
 142
 143 Lightweight references are only acquired within the volume package; a vp
 144 should not be given to e.g. the fileserver code with an extra
 145 lightweight ref. A heavyweight ref is generally acquired for a vp that
 146 will be given to some non-volume-package code; acquiring a heavyweight
 147 ref guarantees that the volume header has been loaded.
 148
 149 Acquiring a lightweight ref just guarantees that the volume will not go
 150 away or suddenly become unavailable after dropping VOL_LOCK. Certain
 151 operations like detachment or scheduling a salvage only occur when all
 152 of the heavy and lightweight refs go away; see VCancelReservation_r.
 153
 154  -- state machine
 155
 156 Instead of having a per-volume lock, each vp always has an associated
 157 'state', that says what, if anything, is occurring to a volume at any
 158 particular time; or if the volume is attached, offline, etc. To do the
 159 basic equivalent of a lock -- that is, ensure that nobody else will
 160 change the volume when we drop VOL_LOCK -- you can put the volume in
 161 what is called an 'exclusive' state (see VIsExclusiveState).
 162
 163 When a volume is in an exclusive state, no thread should modify the
 164 volume (or expect the vp data to stay the same), except the thread that
 165 put it in that state. Whenever you manipulate a volume, you should make
 166 sure it is not in an exclusive state; first call VCreateReservation_r to
 167 make sure the volume doesn't go away, and then call
 168 VWaitExclusiveState_r. When that returns, you are guaranteed to have a
 169 vp that is in a non-exclusive state, and so can me manipulated. Call
 170 VCancelReservation_r when done with it, to indicate you don't need it
 171 anymore.
 172
 173 Look at the definition of the VolState enumeration to see all volume
 174 states, and a brief explanation of them.
 175
 176  -- VLRU
 177
 178 See: Most functions with VLRU in their name in src/vol/volume.c.
 179
 180 The VLRU is what dictates when volumes are detached after a certain
 181 amount of inactivity. The design is pretty much a generational garbage
 182 collection mechanism. There are 5 queues that a volume can be on the
 183 VLRU (VLRUQueueName in volume.h). 'Candidate' volumes haven't seen
 184 activity in a while, and so are candidates to be detached. 'New' volumes
 185 have seen activity only recently; 'mid' volumes have seen activity for
 186 awhile, and 'old' volumes have seen activity for a long while. 'Held'
 187 volumes cannot be soft detached at all.
 188
 189 Volumes are moved from new->mid->old if they have had activity recently,
 190 and are moved from old->mid->new->candidate if they have not had any
 191 activity recently. The definition of 'recently' is configurable by the
 192 -vlruthresh fileserver parameter; see VLRU_ComputeConstants for how they
 193 are determined. Volumes start at 'new' on attachment, and if any
 194 activity occurs when a volume is on 'candidate', it's moved to 'new'
 195 immediately.
 196
 197 Volumes are generally promoted/demoted and soft-detached by
 198 VLRU_ScannerThread, which runs every so often and moves volumes between
 199 VLRU queues depending on their last access time and the various
 200 thresholds (or soft-detaches them, in the case of the 'candidate'
 201 queue). Soft-detaching just means the volume is taken offline and put
 202 into the preattached state.
 203
 204  --- DONT_SALVAGE
 205
 206 The dontSalvage flag in volume headers can be set to DONT_SALVAGE to
 207 indicate that a volume probably doesn't need to be salvaged. Before
 208 DAFS, volumes were placed on an 'UpdateList' which was periodically
 209 scanned, and dontSalvage was set on volumes that hadn't been touched in
 210 a while.
 211
 212 With DAFS and the VLRU additions, setting dontSalvage now happens when a
 213 volume is demoted a VLRU generation, and no separate list is kept. So if
 214 a volume has been idle enough to demote, and it hasn't been accessed in
 215 SALVAGE_INTERVAL time, dontSalvage will be set automatically by the VLRU
 216 scanner.
 217
 218  -- Vnode
 219
 220 Source files: src/vol/vnode.c, src/vol/vnode.h, src/vol/vnode_inline.h
 221
 222 The changes to the vnode package are largely very similar to those in
 223 the volume package. A Vnode is put into specific states, some of which
 224 are exclusive and act like locks (see VnChangeState_r,
 225 VnIsExclusiveState). Vnodes also have refcounts, incremented and
 226 decremented with VnCreateReservation_r and VnCancelReservation_r like
 227 you would expect. I/O should be done outside of any global locks; just
 228 the vnode is 'locked' by being put in an exclusive state if necessary.
 229
 230 In addition to a state, vnodes also have a count of readers. When a
 231 caller gets a vnode with a read lock, we of course must wait for the
 232 vnode to be in a nonexclusive state (VnWaitExclusive_r), then the number
 233 of readers is incremented (VnBeginRead_r), but the vnode is kept in a
 234 non-exclusive state (VN_STATE_READ).
 235
 236 When a caller gets a vnode with a write lock, we must wait not only for
 237 the vnode to be in a nonexclusive state, but also for there to be no
 238 readers (VnWaitQuiescent_r), so we can actually change it.
 239
 240 VnLock still exists in DAFS, but it's almost a no-op. All we do for DAFS
 241 in VnLock is set vnp->writer to the current thread id for a write lock,
 242 for some consistency checks later (read locks are actually no-ops).
 243 Actual mutual exclusion in DAFS is done by the vnode state machine and
 244 the reader count.
 245
 246  - viced state serialization
 247
 248 See src/tviced/serialize_state.* and ShutDownAndCore in
 249 src/viced/viced.c
 250
 251 Before DAFS, whenever a fileserver restarted, it lost all information
 252 about all clients, what callbacks they had, etc. So when a client with
 253 existing callbacks contacted the fileserver, all callback information
 254 needed to be reset, potentially causing a bunch of unnecessary traffic.
 255 And of course, if the client does not contact the fileserver again, it
 256 could not get sent callbacks it should get sent.
 257
 258 DAFS now has the ability to save the host and CB data to a file on
 259 shutdown, and restore it when it starts up again. So when a fileserver
 260 is restarted, the host and CB information should be effectively the same
 261 as when it shut down. So a client may not even know if a fileserver was
 262 restarted.
 263
 264 Getting this state information can be a little difficult, since the host
 265 package data structures aren't necessarily always consistent, even after
 266 H_LOCK is dropped. What we attempt to do is stop all of the background
 267 threads early in the shutdown process (set fs_state.mode -
 268 FS_MODE_SHUTDOWN), and wait for the background threads to exit (or be
 269 marked as 'tranquil'; see the fs_state struct) later on, before trying
 270 to save state. This makes it a lot less likely for anything to be
 271 modifying the host or CB structures by the time we try to save them.
 272
 273  - volume group cache
 274
 275 See: src/vol/vg_cache* and src/vol/vg_scan.c
 276
 277 The VGC is a mechanism in DAFS to speed up volume salvages. Pre-VGC,
 278 whenever the salvager code salvaged an individual volume, it would need
 279 to read all of the volume headers on the partition, so it knows what
 280 volumes are in the volume group it is salvaging, so it knows what
 281 volumes to tell the fileserver to take offline. With demand-salvages,
 282 this can make salvaging take a very long time, since the time to read in
 283 all volume headers can take much more time than the time to actually
 284 salvage a single volume group.
 285
 286 To prevent the need to scan the partition volume headers every single
 287 time, the fileserver maintains a cache of which volumes are in what
 288 volume groups. The cache is populated by scanning a partition's volume
 289 headers, and is started in the background upon receiving the first
 290 salvage request for a partition (VVGCache_scanStart_r,
 291 _VVGC_scan_start).
 292
 293 After the VGC is populated, it is kept up to date with volumes being
 294 created and deleted via the FSSYNC VG_ADD and VG_DEL
 295 commands. These are called every time a volume header is created,
 296 removed, or changed when using the volume header wrappers in vutil.c
 297 (VCreateVolumeDiskHeader, VDestroyVolumeDiskHeader,
 298 VWriteVolumeDiskHeader). These wrappers should always be used to
 299 create/remove/modify vol headers, to ensure that the necessary FSSYNC
 300 commands are called.
 301
 302  -- race prevention
 303
 304 In order to prevent races between volume changes and VGC partition scans
 305 (that is, someone scans a header while it is being written and not yet
 306 valid), updates to the VGC involving adding or modifying volume headers
 307 should always be done under the 'partition header lock'. This is a
 308 per-partition lock to conceptually lock the set of volume headers on
 309 that partition. It is only read-held when something is writing to a
 310 volume header, and it is write-held for something that is scanning the
 311 partition for volume headers (the VGC or partition salvager). This is a
 312 little counterintuitive, but it is what we want.  We want multiple
 313 headers to be written to at once, but if we are the VGC scanner, we want
 314 to ensure nobody else is writing when we look at a header file.
 315
 316 Because the race described above is so rare, vol header scanners don't
 317 actually hold the lock unless a problem is detected. So, what they do is
 318 read a particular volume header without any lock, and if there is a
 319 problem with it, they grab a write lock on the partition vol headers,
 320 and try again. If it still has a problem, the header is just faulty; if
 321 it's okay, then we avoided the race.
 322
 323 Note that destroying vol headers does not require any locks, since
 324 unlink()s are atomic and don't cause any races for us here.
 325
 326  - partition and volume locking
 327
 328 Previously, whenever the volserver would attach a volume or the salvager
 329 would salvage anything, the partition would be locked
 330 (VLockPartition_r). This unnecessarily serializes part of most volserver
 331 operations. It also makes it so only one salvage can run on a partition
 332 at a time, and that a volserver operation cannot occur at the same time
 333 as a salvage. With the addition of the VGC (previous section), the
 334 salvager partition lock is unnecessary on namei, since the salvager does
 335 not need to scan all volume headers.
 336
 337 Instead of the rather heavyweight partition lock, in DAFS we now lock
 338 individual volumes. Locking an individual volume is done by locking a
 339 certain byte in the file /vicepX/.volume.lock. To lock volume with ID
 340 1234, you lock 1 byte at offset 1234 (with VLockFile: fcntl on unix,
 341 LockFileEx on windows as of the time of this writing). To read-lock the
 342 volume, acquire a read lock; to write-lock the volume, acquire a write
 343 lock.
 344
 345 Due to the potentially very large number of volumes attached by the
 346 fileserver at once, the fileserver does not keep volumes locked the
 347 entire time they are attached (which would make volume locking
 348 potentially very slow). Rather, it locks the volume before attaching,
 349 and unlocks it when the volume has been attached. However, all other
 350 programs are expected to acquire a volume lock for the entire duration
 351 they interact with the volume. Whether a read or write lock is obtained
 352 is determined by the attachment mode, and whether or not the volume in
 353 question is an RW volume (see VVolLockType()).
 354
 355 These locks are all acquired non-blocking, so we can just fail if we
 356 fail to acquire a lock. That is, an errant process holding a file-level
 357 lock cannot cause any process to just hang, waiting for a lock.
 358
 359  -- re-reading volume headers
 360
 361 Since we cannot know whether a volume is writeable or not until the
 362 volume header is read, and we cannot atomically upgrade file-level
 363 locks, part of attachment can now occur twice (see attach2 and
 364 attach_volume_header). What occurs is we read the vol header, assuming
 365 the volume is readonly (acquiring a read or write lock as necessary).
 366 If, after reading the vol header, we discover that the volume is
 367 writable and that means we need to acquire a write lock, we read the vol
 368 header again while acquiring a write lock on the header.
 369
 370  -- verifying checkouts
 371
 372 Since the fileserver does not hold volume locks for the entire time a
 373 volume is attached, there could have been a potential race between the
 374 fileserver and other programs. Consider when a non-fileserver program
 375 checks out a volume from the fileserver via FSSYNC, then locks the
 376 volume. Before the program locked the volume, the fileserver could have
 377 restarted and attached the volume. Since the fileserver releases the
 378 volume lock after attachment, the fileserver and the other program could
 379 both think they have control over the volume, which is a problem.
 380
 381 To prevent this non-fileserver programs are expected to verify that
 382 their volume is checked out after locking it (FSYNC_VerifyCheckout).
 383 What this does is ask the fileserver for the current volume operation on
 384 the specific volume, and verifies that it matches how the program
 385 checked out the volume.
 386
 387 For example, programType X checks out volume V from the fileserver, and
 388 then locks it. We then ask the fileserver for the current volume
 389 operation on volume V. If the programType on the vol operation does not
 390 match (or the PID, or the checkout mode, or other things), we know the
 391 fileserver must have restarted or something similar, and we do not have
 392 the volume checked out like we thought we did.
 393
 394 If the program determines that the fileserver may have restarted, it
 395 then must retry checking out and locking the volume (or return an
 396 error).