The Demand-Attach FileServer (DAFS) has resulted in many changes to how many things on AFS fileservers behave. The most sweeping changes are probably in the volume package, but significant changes have also been made in the SYNC protocol, the vnode package, salvaging, and a few miscellaneous bits in the various fileserver processes. This document serves as an overview for developers on how to deal with these changes, and how to use the new mechanisms. For more specific details, consult the relevant doxygen documentation, the code comments, and/or the code itself. - The salvageserver The salvageserver (or 'salvaged') is a new OpenAFS fileserver process in DAFS. This daemon accepts salvage requests via SALVSYNC (see below), and salvages a volume group by fork()ing a child, and running the normal salvager code (it enters vol-salvage.c by calling SalvageFileSys1). Salvages that are initiated from a request to the salvageserver (called 'demand-salvages') occur automatically; whenever the fileserver (or other tool) discovers that a volume needs salvaging, it will schedule a salvage on the salvageserver without any intervention needed. When scheduling a salvage, the vol id should be the id for the volume group (the RW vol id). If the salvaging child discovers that it was given a non-RW vol id, it will send the salvageserver a SALVSYNC LINK command, and will exit. This will tell the salvageserver that whenever it receives a salvage request for that vol id, it should schedule a salvage for the corresponding RW id instead. - FSSYNC/SALVSYNC The FSSYNC and SALVSYNC protocols are the protocols used for interprocess communication between the various fileserver processes. FSSYNC is used for querying the fileserver for volume metadata, 'checking out' volumes from the fileserver, and a few other things. SALVSYNC is used to schedule and query salvages in the salvageserver. FSSYNC existed prior to DAFS, but it encompasses a much larger set of commands with the advent of DAFS. SALVSYNC is entirely new to DAFS. -- SYNC FSSYNC and SALVSYNC are both layered on top of a protocol called SYNC. SYNC isn't much a protocol in itself; it just handles some boilerplate for the messages passed back and forth, and some error codes common to both FSSYNC and SALVSYNC. SYNC is layered on top of TCP/IP, though we only use it to communicate with the local host (usually via a unix domain socket). It does not handle anything like authentication, authorization, or even things like serialization. Although it uses network primitives for communication, it's only useful for communication between processes on the same machine, and that is all we use it for. SYNC calls are basically RPCs, but very simple. The calls are always synchronous, and each SYNC server can only handle one request at a time. Thus, it is important for SYNC server handlers to return as quickly as possible; hitting the network or disk to service a SYNC request should be avoided to the extent that such is possible. SYNC-related source files are src/vol/daemon_com.c and src/vol/daemon_com.h -- FSSYNC --- server The FSSYNC server runs in the fileserver; source is in src/vol/fssync-server.c. As mentioned above, FSSYNC handlers should finish quickly when servicing a request, so hitting the network or disk should be avoided. In particular, you absolutely cannot make a SALVSYNC call inside an FSSYNC handler; the SALVSYNC client wrapper routines actively prevent this from happening, so even if you try to do such a thing, you will not be allowed to. This prohibition is to prevent deadlock, since the salvageserver could have made the FSSYNC request that you are servicing. When a client makes a FSYNC_VOL_OFF or NEEDVOLUME request, the fileserver offlines the volume if necessary, and keeps track that the volume has been 'checked out'. A volume is left online if the checkout mode indicates the volume cannot change (see VVolOpLeaveOnline_r). Until the volume has been 'checked in' with the ON, LEAVE_OFFLINE, or DONE commands, no other program can check out the volume. Other FSSYNC commands include abilities to query volume metadata and stats, to force volumes to be attached or offline, and to update the volume group cache. See doc/arch/fssync.txt for documentation on the individual FSSYNC commands. --- clients FSSYNC clients are generally any OpenAFS process that runs on a fileserver and tries to access volumes directly. The volserver, salvageserver, and bosserver all qualify, as do (sometimes) some utilities like vol-info or vol-bless. For issuing FSSYNC commands directly, there is the debugging tool fssync-debug. FSSYNC client code is in src/vol/fssync-client.c, but it's not very interesting. Any program that wishes to directly access a volume on disk must check out the volume via FSSYNC (NEEDVOLUME or OFF commands), to ensure the volume doesn't change while the program is using it. If the program determines that the volume is somehow inconsistent and should be salvaged, it should send the FSSYNC command FORCE_ERROR with reason code FSYNC_SALVAGE to the fileserver, which will take care of salvaging it. -- SALVSYNC The SALVSYNC server runs in the salvageserver; code is in src/vol/salvsync-server.c. SALVSYNC clients are just the fileserver, the salvageserver run with the -client switch, and the salvageserver worker children. If any other process notices that a volume needs salvaging, it should issue a FORCE_ERROR FSSYNC command to the fileserver with the FSYNC_SALVAGE reason code. The SALVSYNC protocol is simpler than the FSSYNC protocol. The commands are basically just to create, cancel, change, and query salvages. The RAISEPRIO command increases the priority of a salvage job that hasn't started yet, so volumes that are accessed more frequently will get salvaged first. The LINK command is used by the salvageserver worker children to inform the salvageserver parent that it tried to salvage a readonly volume for which a read-write clone exists (in which case we should just schedule a salvage for the parent read-write volume). Note that canceling a salvage is just for salvages that haven't run yet; it only takes a salvage job off of a queue; it doesn't stop a salvageserver worker child in the middle of a salvage. - The volume package -- refcounts Before DAFS, the Volume struct just had one reference count, vp->nUsers. With DAFS, we know have the notion of an internal/lightweight reference count, and an external/heavyweight reference count. Lightweight refs are acquired with VCreateReservation_r, and released with VCancelReservation_r. Heavyweight refs are acquired as before, normally with a GetVolume or AttachVolume variant, and releasing the ref with VPutVolume. Lightweight references are only acquired within the volume package; a vp should not be given to e.g. the fileserver code with an extra lightweight ref. A heavyweight ref is generally acquired for a vp that will be given to some non-volume-package code; acquiring a heavyweight ref guarantees that the volume header has been loaded. Acquiring a lightweight ref just guarantees that the volume will not go away or suddenly become unavailable after dropping VOL_LOCK. Certain operations like detachment or scheduling a salvage only occur when all of the heavy and lightweight refs go away; see VCancelReservation_r. -- state machine Instead of having a per-volume lock, each vp always has an associated 'state', that says what, if anything, is occurring to a volume at any particular time; or if the volume is attached, offline, etc. To do the basic equivalent of a lock -- that is, ensure that nobody else will change the volume when we drop VOL_LOCK -- you can put the volume in what is called an 'exclusive' state (see VIsExclusiveState). When a volume is in an exclusive state, no thread should modify the volume (or expect the vp data to stay the same), except the thread that put it in that state. Whenever you manipulate a volume, you should make sure it is not in an exclusive state; first call VCreateReservation_r to make sure the volume doesn't go away, and then call VWaitExclusiveState_r. When that returns, you are guaranteed to have a vp that is in a non-exclusive state, and so can me manipulated. Call VCancelReservation_r when done with it, to indicate you don't need it anymore. Look at the definition of the VolState enumeration to see all volume states, and a brief explanation of them. -- VLRU See: Most functions with VLRU in their name in src/vol/volume.c. The VLRU is what dictates when volumes are detached after a certain amount of inactivity. The design is pretty much a generational garbage collection mechanism. There are 5 queues that a volume can be on the VLRU (VLRUQueueName in volume.h). 'Candidate' volumes haven't seen activity in a while, and so are candidates to be detached. 'New' volumes have seen activity only recently; 'mid' volumes have seen activity for awhile, and 'old' volumes have seen activity for a long while. 'Held' volumes cannot be soft detached at all. Volumes are moved from new->mid->old if they have had activity recently, and are moved from old->mid->new->candidate if they have not had any activity recently. The definition of 'recently' is configurable by the -vlruthresh fileserver parameter; see VLRU_ComputeConstants for how they are determined. Volumes start at 'new' on attachment, and if any activity occurs when a volume is on 'candidate', it's moved to 'new' immediately. Volumes are generally promoted/demoted and soft-detached by VLRU_ScannerThread, which runs every so often and moves volumes between VLRU queues depending on their last access time and the various thresholds (or soft-detaches them, in the case of the 'candidate' queue). Soft-detaching just means the volume is taken offline and put into the preattached state. --- DONT_SALVAGE The dontSalvage flag in volume headers can be set to DONT_SALVAGE to indicate that a volume probably doesn't need to be salvaged. Before DAFS, volumes were placed on an 'UpdateList' which was periodically scanned, and dontSalvage was set on volumes that hadn't been touched in a while. With DAFS and the VLRU additions, setting dontSalvage now happens when a volume is demoted a VLRU generation, and no separate list is kept. So if a volume has been idle enough to demote, and it hasn't been accessed in SALVAGE_INTERVAL time, dontSalvage will be set automatically by the VLRU scanner. -- Vnode Source files: src/vol/vnode.c, src/vol/vnode.h, src/vol/vnode_inline.h The changes to the vnode package are largely very similar to those in the volume package. A Vnode is put into specific states, some of which are exclusive and act like locks (see VnChangeState_r, VnIsExclusiveState). Vnodes also have refcounts, incremented and decremented with VnCreateReservation_r and VnCancelReservation_r like you would expect. I/O should be done outside of any global locks; just the vnode is 'locked' by being put in an exclusive state if necessary. In addition to a state, vnodes also have a count of readers. When a caller gets a vnode with a read lock, we of course must wait for the vnode to be in a nonexclusive state (VnWaitExclusive_r), then the number of readers is incremented (VnBeginRead_r), but the vnode is kept in a non-exclusive state (VN_STATE_READ). When a caller gets a vnode with a write lock, we must wait not only for the vnode to be in a nonexclusive state, but also for there to be no readers (VnWaitQuiescent_r), so we can actually change it. VnLock still exists in DAFS, but it's almost a no-op. All we do for DAFS in VnLock is set vnp->writer to the current thread id for a write lock, for some consistency checks later (read locks are actually no-ops). Actual mutual exclusion in DAFS is done by the vnode state machine and the reader count. - viced state serialization See src/tviced/serialize_state.* and ShutDownAndCore in src/viced/viced.c Before DAFS, whenever a fileserver restarted, it lost all information about all clients, what callbacks they had, etc. So when a client with existing callbacks contacted the fileserver, all callback information needed to be reset, potentially causing a bunch of unnecessary traffic. And of course, if the client does not contact the fileserver again, it could not get sent callbacks it should get sent. DAFS now has the ability to save the host and CB data to a file on shutdown, and restore it when it starts up again. So when a fileserver is restarted, the host and CB information should be effectively the same as when it shut down. So a client may not even know if a fileserver was restarted. Getting this state information can be a little difficult, since the host package data structures aren't necessarily always consistent, even after H_LOCK is dropped. What we attempt to do is stop all of the background threads early in the shutdown process (set fs_state.mode - FS_MODE_SHUTDOWN), and wait for the background threads to exit (or be marked as 'tranquil'; see the fs_state struct) later on, before trying to save state. This makes it a lot less likely for anything to be modifying the host or CB structures by the time we try to save them. - volume group cache See: src/vol/vg_cache* and src/vol/vg_scan.c The VGC is a mechanism in DAFS to speed up volume salvages. Pre-VGC, whenever the salvager code salvaged an individual volume, it would need to read all of the volume headers on the partition, so it knows what volumes are in the volume group it is salvaging, so it knows what volumes to tell the fileserver to take offline. With demand-salvages, this can make salvaging take a very long time, since the time to read in all volume headers can take much more time than the time to actually salvage a single volume group. To prevent the need to scan the partition volume headers every single time, the fileserver maintains a cache of which volumes are in what volume groups. The cache is populated by scanning a partition's volume headers, and is started in the background upon receiving the first salvage request for a partition (VVGCache_scanStart_r, _VVGC_scan_start). After the VGC is populated, it is kept up to date with volumes being created and deleted via the FSSYNC VG_ADD and VG_DEL commands. These are called every time a volume header is created, removed, or changed when using the volume header wrappers in vutil.c (VCreateVolumeDiskHeader, VDestroyVolumeDiskHeader, VWriteVolumeDiskHeader). These wrappers should always be used to create/remove/modify vol headers, to ensure that the necessary FSSYNC commands are called. -- race prevention In order to prevent races between volume changes and VGC partition scans (that is, someone scans a header while it is being written and not yet valid), updates to the VGC involving adding or modifying volume headers should always be done under the 'partition header lock'. This is a per-partition lock to conceptually lock the set of volume headers on that partition. It is only read-held when something is writing to a volume header, and it is write-held for something that is scanning the partition for volume headers (the VGC or partition salvager). This is a little counterintuitive, but it is what we want. We want multiple headers to be written to at once, but if we are the VGC scanner, we want to ensure nobody else is writing when we look at a header file. Because the race described above is so rare, vol header scanners don't actually hold the lock unless a problem is detected. So, what they do is read a particular volume header without any lock, and if there is a problem with it, they grab a write lock on the partition vol headers, and try again. If it still has a problem, the header is just faulty; if it's okay, then we avoided the race. Note that destroying vol headers does not require any locks, since unlink()s are atomic and don't cause any races for us here. - partition and volume locking Previously, whenever the volserver would attach a volume or the salvager would salvage anything, the partition would be locked (VLockPartition_r). This unnecessarily serializes part of most volserver operations. It also makes it so only one salvage can run on a partition at a time, and that a volserver operation cannot occur at the same time as a salvage. With the addition of the VGC (previous section), the salvager partition lock is unnecessary on namei, since the salvager does not need to scan all volume headers. Instead of the rather heavyweight partition lock, in DAFS we now lock individual volumes. Locking an individual volume is done by locking a certain byte in the file /vicepX/.volume.lock. To lock volume with ID 1234, you lock 1 byte at offset 1234 (with VLockFile: fcntl on unix, LockFileEx on windows as of the time of this writing). To read-lock the volume, acquire a read lock; to write-lock the volume, acquire a write lock. Due to the potentially very large number of volumes attached by the fileserver at once, the fileserver does not keep volumes locked the entire time they are attached (which would make volume locking potentially very slow). Rather, it locks the volume before attaching, and unlocks it when the volume has been attached. However, all other programs are expected to acquire a volume lock for the entire duration they interact with the volume. Whether a read or write lock is obtained is determined by the attachment mode, and whether or not the volume in question is an RW volume (see VVolLockType()). These locks are all acquired non-blocking, so we can just fail if we fail to acquire a lock. That is, an errant process holding a file-level lock cannot cause any process to just hang, waiting for a lock. -- re-reading volume headers Since we cannot know whether a volume is writable or not until the volume header is read, and we cannot atomically upgrade file-level locks, part of attachment can now occur twice (see attach2 and attach_volume_header). What occurs is we read the vol header, assuming the volume is readonly (acquiring a read or write lock as necessary). If, after reading the vol header, we discover that the volume is writable and that means we need to acquire a write lock, we read the vol header again while acquiring a write lock on the header. -- verifying checkouts Since the fileserver does not hold volume locks for the entire time a volume is attached, there could have been a potential race between the fileserver and other programs. Consider when a non-fileserver program checks out a volume from the fileserver via FSSYNC, then locks the volume. Before the program locked the volume, the fileserver could have restarted and attached the volume. Since the fileserver releases the volume lock after attachment, the fileserver and the other program could both think they have control over the volume, which is a problem. To prevent this non-fileserver programs are expected to verify that their volume is checked out after locking it (FSYNC_VerifyCheckout). What this does is ask the fileserver for the current volume operation on the specific volume, and verifies that it matches how the program checked out the volume. For example, programType X checks out volume V from the fileserver, and then locks it. We then ask the fileserver for the current volume operation on volume V. If the programType on the vol operation does not match (or the PID, or the checkout mode, or other things), we know the fileserver must have restarted or something similar, and we do not have the volume checked out like we thought we did. If the program determines that the fileserver may have restarted, it then must retry checking out and locking the volume (or return an error).