--- /dev/null
+This file contains a threading analysis of Ubik prepared by
+Jeffrey Hutzelman and sent ot the openafs-devel@openafs.org
+mailing list in February 2011, archived at:
+https://lists.openafs.org/pipermail/openafs-devel/2011-February/018329.html
+
+INTRODUCTION
+
+ This document describes the structure of Ubik, with an eye toward
+ thread-safety and concurrency. It begins by discussing the elements,
+ usage, and locking considerations of major shared data structures. This
+ is followed by a discussion of each major subsystem. The emphasis is on
+ deficiencies in thread-safety and the changes needed to correct them.
+
+ Most of this document focuses on the user-mode Ubik server code, which
+ comprises the bulk of the Ubik library code and also of the complexity.
+ A separate section near the end is dedicated to Ubik client code, which
+ itself is intended primarily for use in user-mode applications. The
+ OpenAFS cache manager includes its own mechanisms for accessing
+ replicated services, including the Ubik-managed VLDB, and for tracking
+ the servers that provide them.
+
+ Occasionally, issues related to concurrency, performance, and modularity
+ (particularly as regards support for supporting multiple independent
+ databases in a single server process) are also discussed; however, these
+ discussions are peripheral and are included only to record data obtained
+ while examining the question of thread-safety and as a basis for future
+ work. Similarly, throughout this document are detailed discussions of
+ some parts of the code and certain data structures which, while not
+ always directly relevant to thread-safety, were produced in the course
+ of analyzing the Ubik library code. These are included to provide
+ additional background and as a basis for future work in documentation of
+ Ubik library internals and interfaces.
+
+
+MAJOR DATA STRUCTURES
+
+ This section calls out the major shared data structures used in the Ubik
+ server code and details what they are used for and how shared access to
+ them is protected. Much of this text is probably suitable as a starting
+ point for expanding on the documentation found in comments in ubik.p.h.
+
+ struct ubik_dbase - database
+
+ This structure contains nearly everything Ubik knows about an open
+ database (for an explanation of "nearly", see the section below on
+ globals), including database-wide parameters and state and a list of
+ active transactions.
+
+ The following structure elements are set only when the database
+ structure is being initialized, after which they are never modified:
+
+ + pathName - The base path used to construct database filenames
+ + physical later method pointers (read, write, truncate, sync,
+ stat, open, setlabel, getlabel, getnfiles), used to manipulate
+ the physical database files on disk. These currently always
+ point to the methods defined in phys.c, and in fact other
+ modules contain knowledge of the internal implementation of that
+ module.
+
+ The primary synchronization mechanism used to protect this structure
+ is the database lock (versionLock), which is manipulated via the
+ DBHOLD() and DBRELE() macros. This lock protects the following
+ structure elements, as well as a number of globals as described
+ elsewhere in this document:
+
+ + activeTrans - list of active transactions
+ + version - database version (*)
+ + tidCounter - transaction ID of latest transaction (*)
+ + writeTidCounter - transaction ID of latest write transaction (*)
+ + flags (*)
+ + readers
+
+ (*) These fields are currently protected by the database lock, except
+ that the beacon thread does not hold that lock when examining them,
+ or the global 'ubik_epochTime', which holds the corresponding epoch.
+ The recommendation is for these fields, along with ubik_epochTime, to
+ be additionally protected by a second lock, allowing the beacon
+ thread to examine them without holding the database lock; see the
+ section on the BEACON package for details.
+
+ The condition variable 'version_cond' is used to signal to that the
+ database version may have changed; it is broadcast in udisk_commit(),
+ in SDISK_SendFile(), and from the recovery thread; it is monitored by
+ ubik_WaitVersion(), which can be called by an application to wait for
+ a database version change (this is not currently used in OpenAFS).
+ This CV is associated with the database lock. When LWP is used, this
+ condition is signalled on &ubik_dbase->version.
+
+ The condition variable 'flags_cond' is used by udisk_end() to signal
+ that DBWRITING flag has been cleared. This wakes threads waiting in
+ ubik.c:BeginTrans() to begin a new transaction. This CV is
+ associated with the database lock. When LWP is used, this condition
+ is signalled on &ubik_dbase->flags.
+
+ When LWP is used, signals are also sometimes sent on &urecovery_state
+ and on &ubik_amSyncSite. However, nothing ever waits on these, and
+ no corresponding condition variables exist when pthreads is used.
+
+ The 'cachedVersion' field holds the database version represented by
+ the application's cache; this is compared against the current
+ database version to discover if the cache is up to date. Both
+ cachedVersion and the application cache itself are protected by
+ cache_lock, an AFS reader/writer lock.
+
+ When the application wishes to use the cache, it calls
+ ubik_CacheUpdate(), which verifies that the cache is current,
+ updating it if necessary (under a write lock), and then returns with
+ the application cache read-locked. The application can then rely on
+ the cache's contents not to change until it ends the transaction by
+ calling ubik_EndTrans() or ubik_AbortTrans(). When one of these
+ functions is called, the read lock is dropped, and a write lock may
+ be acquired temporarily to update both cachedVersion and the cache
+ (in the case of committing a write transaction) or to clear
+ cachedVersion, indicating the cache is invalid (in the case of
+ aborting a transaction).
+
+ In the case of ubik_EndTrans() on a write transaction, the cache is
+ held write-locked until udisk_commit() has completed; this is also
+ done in SDISK_Commit(), which insures that the contents of the
+ on-disk database and page cache cannot change while an active
+ transaction holds the cache lock. Note that both the recovery thread
+ and SDISK_SendFile() may replace the database contents wholesale;
+ however, these operations first abort all outstanding transactions.
+
+ Prior to the introduction of ubik_BeginTransReadAnyWrite(), an
+ application could choose to manage its own cache, updating it during
+ write transactions and when, at the beginning of a read transaction,
+ it was discovered to be out of date (due to a write done by another
+ server). Under this model, the application uses Ubik transaction
+ locks to protect not only the database but also its cache. However,
+ the use of ubik_BeginTransReadAnyWrite() allows some transactions to
+ proceed and read the database without acquiring any locks, which
+ makes it unsafe to depend on Ubik transaction locks to protect the
+ application cache. Applications which make use of this mode must set
+ ubik_SyncWriterCacheProc to point to a procedure to be called to
+ update the cache during ubik_EndTrans(), after udisk_commit() has
+ succeeded and before the cache_lock is released. These applications
+ must then refrain from ever updating the cache other than during that
+ procedure or during the callback passed to ubik_CheckCache().
+
+
+ struct ubik_trans - active transaction
+
+ This structure represents the state of an active transaction,
+ including transaction metadata and state and a list of all changes
+ made as part of the transaction.
+
+ Transactions are organized into a linked list referenced by the
+ corresponding database structure, protected by the database lock.
+ Each transaction contains an upward pointer to the database,
+ a transaction type, and a transaction ID, all of which are immutable
+ for as long as the transaction exists.
+
+ A transaction created via the REMOTE interface exists until ended by
+ a call to SDISK_Abort() or SDISK_ReleaseLocks() or aborted in
+ urecovery_CheckTid() due to a transaction ID mismatch or creation of
+ a new transaction (note that in this last case, cleanup will be
+ deferred if SDISK_Lock() is blocked on this transaction). In these
+ cases, ubik_currentTrans is cleared at the same time, under the
+ database lock, and further REMOTE calls will fail gracefully until
+ a new transaction is started. Thus, REMOTE transactions may be
+ considered valid only for as long as the database lock is held.
+
+ Transactions created via ubik_BeginTrans(), ubik_BeginTransReadAny(),
+ or ubik_BeginTransReadAnyWrite() exist until ended by a call to
+ ubik_AbortTrans() or ubik_EndTrans(). They are never destroyed from
+ within the UBIK package. Code which discovers a transaction by
+ traversing the database's active transaction list may access the
+ transaction only as long as the database lock is held, since once it
+ is released, an application thread may end the transaction.
+
+ Each transaction contains the following fields, which are protected
+ by the database lock:
+
+ + locktype - type of lock held by this transaction
+ + minCommitTime - unused
+ + seekFile - file number of database file pointer
+ + seekPos - position of database file pointer
+ + flags
+
+ File writes are represented by 'iovec_info', an XDR vector structure
+ (type iovec_wrt, a vector of struct ubik_iovec) containing the file,
+ position, and length of each write, and 'iovec_data', an XDR opaque
+ containing the corresponding data. These vectors and their contents
+ are protected by the database lock.
+
+ File truncations are represented by a linked list of struct
+ ubik_trunc, each of which contains a file number and the new size of
+ that file. This list, and the records contained in it, are protected
+ by the database lock.
+
+ While transactions are currently protected by the database lock, it
+ may be desirable in the future to introduce a per-transaction lock,
+ in order to facilitate increased concurrency.
+
+
+ struct ubik_server - server status
+
+ This structure is used to track the set of peer servers, including
+ state related to elections and recovery. The global 'ubik_servers'
+ points to a linked list of these structures, which is constructed
+ during initialization. Once the list has been constructed, no
+ entries are ever added or removed; in addition, the following fields
+ are not changed after startup:
+
+ + addr[0] - server's primary address
+ + magic - true if this is the magic server
+ + isClone - true if this server is a clone
+
+ The following fields are used by the BEACON package to track beacons
+ to this server and their results. Currently, they are sometimes
+ accessed under the database lock, and sometimes without benefit of
+ any lock. They should probably be protected by the same lock as
+ globals private to the BEACON package.
+
+ + lastVoteTime - last time this server voted yes for us
+ + lastBeaconSent - last time a beacon was sent to this server
+ + lastVote - true if its late vote response was yes
+ + up - true if this server is believed to be up
+ + beaconSinceDown - true if a beacon has successfully been sent to
+ this server since the last time it was down
+
+ The following fields are used by the RECOVERY package to track the
+ state of each server. They are also examined and modified by the
+ various ContactQuorum_* functions. Currently, they are sometimes
+ accessed under the database lock, and sometimes without benefit of
+ any lock. They should be fully protected by the database lock.
+
+ + version - server's database version
+ + currentDB - true if server's database is up-to-date
+
+ The addr[] array is used to track all known addresses for the remote
+ server; this list is used when attempting to contact a down server
+ via an alternate address, and to identify the source of incoming
+ server-to-server RPCs. The first element of this array is always the
+ primary address and does not change after initialization; other
+ elements are updated during initialization and when a remote server
+ calls DISK_UpdateInterfaceAddr().
+
+ The system normally maintains an active Rx connection to the VOTE and
+ DISK services on each server; pointers to these are kept in the
+ 'vote_rxcid' and 'disk_rxcid' fields.
+
+ Currently, the addr[] array, 'vote_rxcid', and 'disk_rxcid' are not
+ protected by any lock. Because these are shared among a number of
+ subsystems and it is usually desirable to avoid blocking on the
+ database lock when making RPCs, it is recommended to introduce a new
+ lock (the "server address lock") to protect these fields and the
+ globals 'ubikSecClass' and 'ubikSecIndex', which contain the security
+ classes used for setting up such connections. The new lock would
+ need to be held in the following cases:
+ + Before beginning any RPC using 'vote_rxcid' or 'disk_rxcid', for
+ the time it takes to call rx_GetConnection() or rx_NewCall()
+ + For the duration of ubikGetPrimaryInterfaceAddr()
+ + In ubeacon_Interact(), when building the set of connections to be
+ used for multi_VOTE_Beacon().
+ + In updateUbikNetworkAddress(), first while constructing the set of
+ connections to be used for multi_DISK_UpdateInterfaceAddr(), and
+ then again each time a server's list of addresses is updated.
+ However, the lock must NOT be held while waiting for the RPC to
+ complete, to avoid deadlock when multiple servers are starting at
+ the same time.
+ + For the duration of SDISK_UpdateInterfaceAddr()
+ + For the duration of ubeacon_ReinitServer(), with the optional
+ exception of the call to afsconf_UpToDate().
+ + In src/ubik/recovery.c:DoProbe(), first while constructing the set
+ of connections to be used for multi_DISK_Probe() and, if the probe
+ is successful, again while destroying the old connections for the
+ server under test and replacing them with new ones. See the
+ section on the RECOVERY package for additional notes.
+
+
+GLOBAL VARIABLES
+
+ The library has an unfortunate number of global variables which are not
+ part of any data structure. Nonetheless, some of these are
+ database-specific; see MULTI-DATABASE SUPPORT, below.
+
+ Globals private to a particular subsystem are discussed in the
+ description of that subsystem. This section discusses global variables
+ which are shared and/or part of external interfaces.
+
+ The following globals are used to convey configuration parameters from
+ the application to the Ubik library, and are intended to be set before
+ the first call to ubik_ServerInit(). They are not modified by Ubik, and
+ are not protected by locks:
+
+ + ubik_debugFlag - print debug messages (unused)
+ + ubikPrimaryAddrOnly - use only primary interface
+ + ubik_nBuffers - number of DISK package page cache buffers
+ + ubik_SRXSecurityProc - callback to create server security class
+ + ubik_SRXSecurityRock - argument for ubik_SRXSecurityProc
+ + ubik_CRXSecurityProc - callback to create client security class
+ + ubik_CRXSecurityRock - argument for ubik_CRXSecurityProc
+ + ubik_SyncWriterCacheProc - callback to update application cache
+ + ubik_CheckRXSecurityProc - callback to check if caller is OK
+ + ubik_CheckRXSecurityRock - argument for ubik_CheckRXSecurityProc
+
+ The following globals contain values which are computed during Ubik
+ initialization and not modified thereafter. They are not protected by
+ any lock, which is safe so long as suitable memory barriers exist across
+ creation of threads and Rx services (see the notes on initialization in
+ the UBIK package section). Note that some of these values are
+ database-specific and properly belong in the ubik_dbase structure:
+
+ + ubik_servers - linked list of servers (see above)
+ + amIClone - true if this server is a clone
+ + ubik_callPortal - UDP port used for server-to-server RPCs
+ + ubik_quorum - number of servers to make a quorum
+ + ubik_dbase - pointer to the (only) database (see above)
+ + ubik_host[] - array of local addresses
+ + ubik_sc[] - security classes for Ubik RPC services
+
+ The following globals are currently protected by the database lock,
+ though in most cases there are existing references which do not hold the
+ lock. Later sections of this document contain recommendations in each
+ case to establish new locks to protect these or to correct cases where
+ they are accessed without the correct lock.
+
+ + ubik_currentTrans - pointer to REMOTE's current transaction
+ + ubik_epochTime - epoch for transaction IDs
+ + urecovery_state - recovery state bits
+ + ubik_dbVersion - sync site's database version
+ + ubik_dbTid - sync site's transaction ID
+ + ubikSecIndex - security class index for making Ubik RPCs
+ + ubikSecClass - security class for making Ubik RPCs
+
+ The global 'ubik_amSyncSite' belongs to the BEACON package, and should
+ be protected by the same lock as that package's private globals. In
+ fact, it should probably be made private; the only references outside
+ BEACON are in SVOTE_Debug() and SVOTE_DebugOld() and can easily be moved
+ into ubeacon_Debug().
+
+
+MAJOR SUBSYSTEMS
+
+ This section describes each of the major Ubik server subsystems. For
+ each subsystem, a brief overview is provided, along with a description
+ of that subsystem's handling of synchronization (or lack thereof).
+ Subsystems are described beginning with the lowest layer and working
+ toward the highest.
+
+ PHYS - Physical I/O
+
+ This package is responsible for handling I/O to database files. It
+ maintains a private, global array of MAXFDCACHE(4) fdcache
+ structures, each representing an open file descriptor on a database
+ file, either presently in use or idle and available for use. Entries
+ are indexed by database file ID, and each contain a file descriptor
+ number and a refcount. There is also a global 'inited' flag, and
+ a buffer used to construct pathnames when opening database files.
+
+ The static function uphys_open() takes a database pointer and fileid
+ and attempts to open the corresponding file. On success, it returns
+ a file descriptor; on failure, it returns the same as open(2) (with
+ errno set). This prefers to reuse an already-open fd whose cache
+ refcount is 0; failing that, it opens a new fd and attempts to enter
+ it in the cache. It prefers first to use an unused cache entry, then
+ to reclaim an idle one whose refcount is 0; if neither is possible,
+ the newly-opened fd is returned without caching. Note that the
+ refcount on an active entry is always exactly 1; the same entry
+ cannot be referenced more than once.
+
+ The function uphys_close() releases a file descriptor opened by
+ uphys_open(). If the file descriptor was cached, the cache entry is
+ dereferenced but the file is left open for future use; 0 is returned.
+ If the file descriptor is not found in the cache, then the return
+ value is that of an immediate call to close(2). This function also
+ handles cleanup when a cached fd is closed after the file to which it
+ refers is invalidated by uphys_invalidate(). This function is not
+ static, but may as well be -- it is used only by other functions in
+ the PHYS package.
+
+ Functions in this package are called via method pointers in the
+ database structure, which are set during initialization and not
+ changed thereafter. These functions must be called with the database
+ lock held; this protects the file descriptor cache and other globals
+ mentioned above, and prevents conflicting access to database files.
+
+ The requirement that functions in this package be called with the
+ database lock held means that concurrent access to database files is
+ not possible. This situation could be improved by the addition of
+ a new global lock, which would be held for most of the duration of
+ uphys_open(), uphys_close(), and uphys_invalidate(), but need _not_
+ be held during actual file accesses. This would eliminate the
+ requirement that calls to this package be made with the database lock
+ held, allowing multiple calls to be active concurrently. However, it
+ would still be necessary for higher layers to employ appropriate
+ synchronization as needed to maintain database consistency.
+
+
+ DISK - Logical Database I/O and Transaction Management
+
+ This package is responsible for logical database file I/O, logging,
+ and transaction management. It maintains a private cache of database
+ file page buffers. This entire cache, including buffer space, is
+ allocated and initialized the first time a transaction is started;
+ the global initd flag keeps track of whether or not this has been
+ done. The size of this cache defaults to 20 pages, but may be
+ overridden by setting the global 'ubik_nBuffers' (all OpenAFS Ubik
+ services do this); this global is not protected by any lock, but is
+ read exactly once, when the cache is initialized.
+
+ Except for udisk_Debug() (see below), all udisk_* interfaces require
+ a pointer either to the database or to an active transaction, and
+ must be called with the database lock held. This protects the DISK
+ package's accesses of the active transaction list and associated
+ state variables as well as its calls into the physical access layer,
+ including insuring that operations requiring multiple calls into that
+ layer are performed atomically.
+
+ In addition, the database lock protects the page cache hash table and
+ LRU queue, the buffer control structures, buffer contents, the
+ truncation record free list (freeTruncList), and global statistics.
+ However, it may be desirable in the future to introduce a separate
+ lock to protect these structures, in order to facilitate multiple
+ database support.
+
+ udisk_LogOpcode, udisk_LogEnd, udisk_LogTruncate, udisk_LogWriteData
+
+ These are internal interfaces for writing to the transaction log.
+ They each construct a log record and then directly call the
+ physical layer to append it to the log; the log file is _not_
+ cached by the page buffer cache, and these functions do not touch
+ the cache. Each of these must be atomic with respect to other
+ writes to the log file, which is accomplished by insuring they are
+ called only with the database locked.
+
+ DInit, DRead, DTrunc, DRelease, DFlush, DAbort, DSync, DNew
+
+ These are internal interfaces for accessing the page buffer cache.
+ They must be called with the database lock held.
+
+ DRead() and DNew() return a pointer to the page data buffer, not
+ to the buffer control structure; this pointer is later passed to
+ DRelease() to release the buffer. This mechanism depends on the
+ fact that buffer control structures are allocated in the same
+ order as the actual page data buffers.
+
+ DRead() guarantees that a read transaction will always see a clean
+ copy of the requested page, without any modifications made by an
+ in-progress write transaction. However, this is true only if the
+ next layer insures that once DRead() is called with write intent
+ on a page, that no further calls to DRead() are made for that page
+ until the buffer is written and DRelease() is called with the
+ dirty flag. Presently, this is the case because udisk_read() and
+ udisk_write(), which are the only callers of DRead(), are both
+ called with the database lock held.
+
+ udisk_read, udisk_truncate, udisk_write, udisk_begin, udisk_commit,
+ udisk_abort, udisk_end, udisk_Invalidate
+
+ These are external interfaces provided by the DISK package to
+ higher layers. Presently, the locking requirements of these
+ functions and the lower-layer functions they call are satisfied by
+ the simple rule that these must always be called with the database
+ lock. In the future, support for multiple databases and/or
+ a greater level of concurrency may require relaxing this rule
+ and/or acquiring additional locks within these functions.
+
+ In udisk_commit(), when committing the first write transaction
+ after becoming sync site, the database is relabelled. In this
+ case, the UBIK_RECLABELDB recovery state bit should be set before
+ propagating the label change to other servers, rather than after.
+ This is because because ContactQuorum_DISK_Setversion() will
+ release the database lock, at which point the recovery state may
+ be cleared by urecovery_ResetState() running in another thread.
+ It is important that a relabelling which occurs before recovery
+ state is cleared not result in the UBIK_RECLABELDB recovery state
+ bit being set after; otherwise, the server may fail to correctly
+ relabel the database the next time it becomes sync site.
+
+ udisk_Debug
+
+ The udisk_Debug() interface is called as part of preparing the
+ response to Ubik debugging requests. Unlike other udisk_* APIs,
+ this interface is not called on a specific database and does not
+ require the database lock to be held. It reports the version data
+ of the single database identified by the global 'ubik_dbase' and
+ traverses the buffer cache collecting statistics, both without
+ benefit of holding any locks. This may produce inconsistent data
+ but does not normally risk damaging data structures or accessing
+ invalid or uninitialized memory (but see below), and avoiding
+ locking may be of greater benefit than guaranteeing the return of
+ consistent data.
+
+ The global 'ubik_dbase' pointer is set when the first (only)
+ database is initialized, and the buffer cache data structures are
+ allocated the first time a transaction is started, either locally
+ or via the REMOTE package. Once set, the pointers used by
+ udisk_Debug() are never changed, and the data structures they
+ point to remain valid and are never freed. The buffer cache is
+ traversed by iterating over the array of buffer cache elements
+ (this is done in other places as well), not by following a linked
+ list or other data structure that may reordered, and so should not
+ be dangerous.
+
+ At a minimum, then, it should be arranged that udisk_Debug()
+ cannot be called before the database structure and page cache have
+ been initialized. This means that both of these actions must be
+ taken before any incoming RPCs are accepted. Practically, this
+ means that DInit() should be renamed udisk_init(), and called from
+ ubik_ServerInitCommon() during initialization.
+
+ Full correctness demands that udisk_Debug actually hold the
+ database lock while copying the database version and examining the
+ page cache. However, it may be desirable to sacrifice this
+ correctness in order to reduce interaction between the debugging
+ interfaces and the rest of the system.
+
+ Helper functions:
+ + unthread - remove transaction from activeTrans
+ + Dlru, Dmru - move page to front/end of LRUQ
+ + MatchBuffer - check if a buffer's dbase/file/page match
+ + FixupBucket - move page to the right bucket
+ + newslot - find and allocate an unused buffer
+ + DedupBuffer - throw away stale copies after page is flushed
+ + GetTrunc, PutTrunc - trunc record allocation
+ + FindTrunc - find active trunc record for file
+ + DoTruncs - apply truncations
+
+
+ LOCK - Database Transaction Locking
+
+ This package is responsible for managing database locks held as part
+ of a transaction. It supports a single read/write lock, which is
+ represented by the global variable 'rwlock', a struct Lock. The
+ global 'rwlockinit' records whether Lock_Init() has been called on
+ this lock, and is protected by the database lock. Properly, this
+ lock should be treated as belonging to the database.
+
+ ulock_getLock() expects to be called with the database lock held; it
+ depends on this to protect access to fields within the transaction
+ structure as well as to the global 'rwlockinit'. However, note that
+ the database lock must be temporarily released while obtaining the
+ read/write lock, and so ulock_getLock() should not be called in the
+ middle of a critical section. The 'await' parameter may in theory be
+ set to 0 to effect a non-blocking lock; however, this depends on
+ (incorrect) knowledge of the internals of struct Lock. Fortunately,
+ existing Ubik code always sets await=1. This feature should be fixed
+ or removed.
+
+ ulock_relLock() releases the read/write lock. It expects to be
+ called with the database lock held. Unlock ulock_getLock(), this
+ function does not release the database lock.
+
+ ulock_Debug() reports the state of the read/write lock, for debugging
+ purposes. It depends on knowledge of the internals of struct Lock,
+ and accesses both the Lock structure and the global rwlockinit
+ without benefit of holding any lock.
+
+ It would probably be best to eliminate 'rwlockinit' in favor of
+ always initializing the global 'rwlock' during startup; ulock_Debug()
+ would then have no need to examine the flag. Further, ulock_Debug()
+ should at a minimum use the appropriate locks when looking inside
+ struct Lock; ideally, knowledge of that structure's internals should
+ be abstracted into appropriate interfaces in src/lwp/lock.[ch].
+
+
+ VOTE - Voting Logic and VOTE_* RPCs
+
+ This package is responsible for keeping track of who is currently
+ sync site and deciding how to vote and whether this server should try
+ to become sync site. It provides the VOTE RPC package, which is
+ primarily responsible for processing vote requests from peer servers
+ (VOTE_Beacon), but also provides debugging interfaces and an
+ interface (VOTE_GetSyncSite) to allow clients to discover the current
+ sync site.
+
+ Voting decisions are made using state tracked in a set of global
+ variables, listed below. Currently there is no locking protecting
+ these variables; LWP-based Ubik depends on the cooperative nature of
+ LWP and on the fact that routines which examine and manipulate them
+ do not block. Introduction of a new "vote lock" is recommended to
+ protect the following globals:
+
+ + ubik_lastYesTime - last time VOTE_Beacon voted "yes"
+ + lastYesHost - host voted for at ubik_lastYesTime
+ + lastYesClaim - time lastYesHost started its voting cycle
+ + lastYesState - whether lastYesHost claimed to be sync site
+ + lowestHost - lowest host seen
+ + lowestTime - last time lowestHost was updated/heard from
+ + syncHost - sync host, or 0 if none known
+ + syncTime - last time syncHost was updated
+ + ubik_dbVersion - sync site's database version
+ + ubik_dbTid - sync site's transaction ID
+
+ The vote lock should be initialized in uvote_Init() and held for the
+ remainder of that function and also during of uvote_ShouldIRun(),
+ uvote_GetSyncSite(). It should also be held for the bulk of
+ SVOTE_Beacon() -- specifically, it should be acquired just before
+ beginning the lowest-host calculation, and released just before
+ calling urecovery_CheckTid(), in the event of a yes vote or else just
+ before returning.
+
+ In addition, the vote lock would protect the globals 'ubik_dbVersion'
+ and 'ubik_dbTid', which represent this server's knowledge of the
+ state of the database on the sync site. This variable is set in two
+ places in the REMOTE package, which change the local and sync site
+ database versions at the same time. It is also examined in one place
+ in the REMOTE package, and compared to the database version in the
+ RECOVERY package. It is imperative that when the local and remote
+ versions are changed at the same time, these changes become visible
+ to the RECOVERY package at the same time. Thus, both the changes and
+ the comparison must be done under both locks. For this purpose,
+ three new functions should be provided by the VOTE package:
+
+ + A function to safely set 'ubik_dbVersion'
+ + A function to compare 'ubik_dbVersion' to another value
+ + A function to check whether this is sync site and compare
+ 'ubik_dbVersion' to another value, without releasing the vote lock
+ in between (to be used in recovery).
+
+ Existing accesses to 'ubik_dbVersion' in REMOTE and RECOVERY, all of
+ which are already made under the database lock, would be replaced
+ with calls to the new accessors.
+
+ The result of these changes will require in several places that the
+ vote lock be acquired while the database lock is already held.
+ Therefore, the vote lock MUST NOT be held while acquiring the
+ database lock.
+
+ SVOTE_Beacon() currently calls urecovery_CheckTid() without benefit
+ of the database lock, which must be held when calling that function.
+ This is fairly simple to fix; however, the vote lock must be released
+ first. One simple approach would be to move the call to
+ urecovery_CheckTid() to after the point where the vote lock will be
+ released, wrapping it in DBHOLD/DBRELE, and conditionalizing the
+ whole thing on 'vote' being nonzero.
+
+ Separately, it may be worth examining whether it is appropriate to
+ call urecovery_CheckTid() only when voting yes, and not whenever
+ a beacon is received from a server claiming to be sync site.
+
+ As with the debug functions in other packages, SVOTE_Debug() and
+ SVOTE_DebugOld() access global variables belong to this and other
+ modules without benefit of the appropriate locks. There is also very
+ little abstraction here. It is probably desirable to introduce Debug
+ interfaces for the VOTE and RECOVERY packages, to be called from
+ these functions. Full correctness requires...
+
+ + Holding the vote lock while accessing the VOTE package globals
+ described above.
+ + Holding the database lock while accessing the database flags and
+ tidCounter, ubik_epochTime, ubik_currentTrans, and urecovery_state
+
+
+ BEACON - elections, tracking whether this is sync site
+
+ This package is responsible for running elections, collecting votes,
+ and deciding whether and for how long this is the sync site. It also
+ contains routines responsible for setting up the list of servers and
+ acquiring alternate addresses for other servers during startup.
+
+ For this purpose, it maintains a number of global variables and
+ ubik_server structure fields. Currently, these variables and fields
+ are sometimes accessed under the database lock, and sometimes without
+ benefit of any lock. In order to allow this package to operate with
+ a minimum of disruption due to database accesses, it is recommended
+ that a new lock be established to protect these. It is particularly
+ desirable to prevent loss of sync due to the potentially long-lived
+ RPCs the RECOVERY package must make under the database lock when
+ transferring full database copies between servers; for this reason,
+ it is important that both the BEACON thread and SVOTE_Beacon() be
+ able to operate without acquiring the database lock. The global
+ variables and ubik_server structure fields to be protected by the new
+ lock are the following:
+
+ + ubik_amSyncSite - true if this is the sync site
+ + syncSiteUntil - time until which this is the sync site
+ + ts->lastVoteTime - last time this server voted yes for us
+ + ts->lastBeaconSent - last time a beacon was sent to this server
+ + ts->lastVote - true if its late vote response was yes
+ + ts->up - true if this server is believed to be up
+ + ts->beaconSinceDown - true if a beacon has successfully been sent
+ to this server since the last time it was down
+
+ The new lock would need to be held in the following cases:
+ + In ubeacon_AmSyncSite(), from the first time 'ubik_amSyncSite' is
+ examined until after the potential call to urecovery_ResetState().
+ + In ubeacon_Interact(), when building the list of servers to which
+ to send beacons (to protect access to the 'up' flag). Note that
+ the server address lock must also be held during this loop, and
+ must be acquired _after_ the beacon lock.
+ + In ubeacon_Interact(), when updating status associated with
+ a server after receiving its beacon response.
+ + In ubeacon_Interact(), when updating globals after an election.
+ However, it must be released before calling urecovery_ResetState().
+ + In updateUbikNetworkAddress(), when marking a server down.
+ + In the various ContactQuorum_* functions, when checking whether
+ a server is up before calling it, and again when marking a server
+ down after a call fails.
+ + In ubik_EndTrans, when examining a server's 'beaconSinceDown' flag
+ and 'lastBeaconSent' timestamp to determine whether it has recently
+ gone down, requiring a brief hold until it either comes back up or
+ times out.
+
+ The following global variables are set during initialization and do
+ not change thereafter:
+
+ + nServers - number of voting servers
+ + amIMagic - true if this server gets the extra half vote
+ + ubik_singleServer - true if this is the only server
+
+ Currently, all initialization of this module is done _after_ the Rx
+ services have been created and an Rx server thread started. This is
+ done to avoid a situation in which all servers are blocked in
+ updateUbikNetworkAddress(), which is responsible for exchanging
+ addresses with peer servers, while none of them is prepared to
+ handled the DISK_UpdateInterfaceAddr call made by that function.
+
+ When using LWP, this early initialization of Rx is not a problem,
+ because the various initialization routines don't actually block until
+ updateUbikNetworkAddress() waits for RPCs to complete. However, when
+ using pthreads, it is a problem, because it means that incoming RPCs
+ may be processed when initialization is not complete.
+
+ To address this problem and insure orderly initialization, it is
+ recommended to change the initialization sequence so that most work,
+ including the bulk of BEACON package initialization, happens before
+ Ubik's Rx services have been created. The only work that should be
+ deferred until after that point is updateUbikNetworkAddress(),
+ followed by starting the long-running threads belonging to the BEACON
+ and RECOVERY packages. This requires updateUbikNetworkAddress() to
+ be exported (and perhaps to undergo a name change).
+
+ The major part of the work of this package is done in the function
+ ubeacon_Interact(), which is the body of a long-running thread known
+ as the beacon thread. This thread is responsible for periodically
+ sending "beacons" to other servers to request votes (when the sending
+ server is the sync site, these also include how long it will remain
+ so, the database version, and the active transaction ID). Beacons
+ are sent only from servers that are already sync site or believe they
+ are the best candidate, as determined by uvote_ShouldIRun().
+
+ On the sync site, the beacon thread is responsible for insuring that
+ new votes are collected frequently enough to avoid loss of quorum.
+ This means it must not block for an extended time; therefore, it must
+ not acquire the database lock, which in other threads may sometimes
+ be held during long-running operations (most notably, database file
+ transfers done under this lock by the recovery thread). Instead,
+ a new lock is proposed (the "version lock"), which is used to allow
+ the beacon thread to examine (but not modify) certain fields without
+ holding the database lock.
+
+ The version lock should be acquired _in addition to_ the database
+ lock when modifying 'ubik_epochTime' or the database 'version',
+ 'flags', 'tidCounter', and 'writeTidCounter' fields; such
+ modifications occur at the following locations:
+ + in ubik.c:BeginTrans(), when beginning a new transaction
+ + in udisk_begin(), when setting DBWRITING
+ + in udisk_commit(), when updating the database version at the end of
+ a write transaction (note this includes the case of relabelling the
+ database on the first write transaction after becoming sync site)
+ + in udisk_end(), when clearing DBWRITING
+ + in recovery.c:InitializeDB(), when setting the initial version of
+ an empty database
+ + in urecovery_Interact(), when labelling a new database the first
+ time quorum is established
+ + in urecovery_Interact(), after fetching the latest database from
+ another server (whether successful or not; there are two cases)
+ + in SDISK_SendFile(), when updating the database version both before
+ and after receiving a new database from the sync site. Note the
+ lock must _not_ be held during the transfer.
+ + in SDISK_SetVersion()
+
+ The version lock should be acquired by the beacon thread while
+ examining these fields, shortly before calling multi_VOTE_Beacon().
+ Since it must release the lock before making that call, the database
+ version will need to be explicitly copied into a local variable, as
+ is already done with the current transaction ID.
+
+ Note that if UBIK_PAUSE is defined, the DBVOTING flag poses an
+ additional problem, as it must be modified by the beacon thread. If
+ it is desirable to continue to support this variant, then it becomes
+ necessary for all accesses to the database flags to be protected by
+ the version lock. Other UBIK_PAUSE code may not have been reviewed
+ in depth and may pose additional problems.
+
+ The function ubeacon_AmSyncSite() is called both from the beacon
+ thread and from various other points to determine whether they are
+ running on the sync site, and thus whether various operations are
+ safe and appropriate. Calls to this function from outside the beacon
+ thread (often, via urecovery_AllBetter()) must be made with the
+ database lock held, and for operations modifying the database or
+ relying on its contents to remain constant, this lock must not then
+ be released until the operation is complete. The does not guarantee
+ that this will remain sync site for the duration of the operation,
+ but does insure that any operations done after or as a result of
+ losing sync will in fact occur after the operation holding the lock.
+
+ Of course, ubeacon_Interact() itself cannot call ubeacon_AmSyncSite()
+ with the database lock held, since it must not block on that lock
+ when running on the sync site (otherwise sync could be lost, as
+ described above). So, that portion of ubeacon_AmSyncSite() other
+ than the call to urecovery_ResetState() should be abstracted out into
+ a new function, which presumably returns separate flags indicating
+ both whether it is in fact running on the sync site (which becomes
+ the return value of ubeacon_AmSyncSite()) and whether a call to
+ urecovery_ResetState() is indicated. When the latter is set, the
+ beacon thread can acquire the database lock (since, if this is not
+ the sync site, blocking on this lock is not a problem) and call
+ urecovery_ResetState().
+
+ This difference in behavior is acceptable because the required
+ invariant is that when 'ubik_amSyncSite' is cleared, it cannot become
+ true again until urecovery_state has also cleared. This insures that
+ once a server discovers it is not sync site, urecovery_AllBetter()
+ cannot succeed on the basis of being sync site until sync is regained
+ and recovery has run, insuring the database is up to date. Since
+ only the beacon thread ever sets 'ubik_amSyncSite' true, that thread
+ can meet the invariant by calling urecovery_ResetState() before doing
+ anything else, while other threads must continue to hold the beacon
+ lock to prevent the beacon thread from setting 'ubik_amSyncSite'.
+
+ ubeacon_Debug() reports the value of syncSiteUntil, without benefit
+ of obtaining any locks. It should also report the value of
+ 'ubik_amSyncSite', allowing that global to be made private to the
+ BEACON package. Full correctness requires that the beacon lock be
+ held while accessing these globals.
+
+
+ RECOVERY - Consistency, Down Server Recovery, Propagation, Log Replay
+
+ This package is responsible for tracking whether the local database
+ is current and usable and for discovering when a server comes back up
+ after having been marked down. On the sync site, it is also
+ responsible for tracking the database versions on remote servers and
+ for recovering from inconsistencies.
+
+ To carry out these tasks, the RECOVERY package maintains a single
+ global, 'urecovery_state'. This is not currently effectively
+ protected by any lock; however, it should be protected by the
+ database lock. This is particularly important because high-level
+ routines that manipulate the database depend on a check against this
+ field to determine that the database is up-to-date and safe to
+ modify, and that it will remain so until the operation is complete.
+
+ The main work of this package is done by urecovery_Interact(), which
+ is the body of a long-running thread known as the recovery thread.
+ This thread wakes up periodically to discover if any down servers
+ have come back up and, on the sync site, to locate the latest
+ available database, fetch it if necessary, and make sure it has been
+ distributed to all servers. This work is done in several steps,
+ taken on in sequence every few seconds:
+
+ + The first part of urecovery_Interact() is to identify servers which
+ are believed to be down, and determine whether any of them have
+ come back up. This is done every 30 seconds, even when this is not
+ the sync site. Currently, the database lock is held for the
+ duration of this loop, released only in DoProbe() while actually
+ waiting for the DISK_Probe() RPC to complete. This is mostly
+ unnecessary; the server probe loop needs this lock only to examine
+ the database 'currentDB' flag and to clear the UBIK_RECFOUNDDB bit,
+ and DoProbe() doesn't need it at all. However, the beacon lock is
+ needed to examine and update the database 'up' flag, and DoProbe()
+ needs to hold the server address lock when examining server
+ addresses and, if the server comes back up, to replace connections.
+ See the section on struct ubik_server for more details on this
+ lock.
+
+ + The next step is to determine whether this is the sync site, and
+ update the UBIK_RECSYNCSITE bit appropriately. The database lock
+ should be held for the duration of this step, as
+ ubeacon_AmSyncSite() may end up calling urecovery_ResetState(),
+ which requires that lock be held. If this is not the sync site,
+ the cycle ends here.
+
+ + Step three is to determine the location of the latest database.
+ This is done by making DISK_GetVersion() calls to all active
+ servers, keeping track of the newest version found and the server
+ which reported it. Once this is done, the best version is compared
+ to the local version, update the UBIK_RECHAVEDB bit (so that it is
+ true if and only if the local version is the latest) set the
+ UBIK_RECFOUNDDB bit to indicate the latest database has been
+ located, and clear the UBIK_RECSENTDB bit to force any servers with
+ out-of-date databases to be updated.
+
+ In this section, it is not necessary to hold any locks while
+ collecting versions on remote servers, though as before, it is
+ necessary to hold the beacon lock while examining the server 'up'
+ flag, and the address lock to obtain a reference on each connection
+ as it is used (but this lock should not be held during the RPC!).
+ Once all remote versions have been fetched, the database lock is
+ needed while examining the local version and updating
+ 'urecovery_state'. Note that write transactions may happen while
+ versions are collected, resulting in some servers having newer
+ versions than were detected. However, if this happens, the result
+ is always consistent and ends with the sync site having the latest,
+ canonical version:
+
+ + If the actual latest version X.N before the write was in an epoch
+ created by the current sync site, then that version must
+ necessarily already be present on the sync site, since writes are
+ always committed first on the sync site. Thus, when the sync
+ site updates the version counter, it produces a new version X.N+1
+ which no other server can believe it already has; thus, there is
+ no possibility of inconsistency.
+
+ + If the actual latest version X.N before the write was in an epoch
+ not created by the sync site, then it is possible that after
+ quorum was established but before any writes happened, a new
+ server came online which had a newer version X.N+1 (that was
+ committed to less than a quorum of sites before a crash). If
+ a write transaction begins before recovery detects this fact and
+ fetches the version X.N+1 database, then the new write will start
+ with the version X.N database, essentially creating a version
+ fork. However, when this happens, the resulting database will be
+ version Y.1 (with Y>X), because the first write on a new sync
+ site always results in a new database. Thus, the version X.N+1
+ database will no longer be latest (because the local version,
+ which is examined last, is Y.1), and will end up being discarded.
+ This is no worse than if the server with the version X.N+1
+ database had not come up until completely after the transaction
+ resulting in version Y.1.
+
+ + Step four is to fetch the latest database, if the sync site does
+ not already have it. Since this step replaces the database
+ wholesale, there can be no active transactions while it runs, and
+ all other database activity must be locked out for the duration of
+ the transfer. This means the database lock must be held for this
+ entire step, starting from the check that the UBIK_RECSYNCSITE
+ and/or UBIK_RECFOUNDDB bits are still set. (Note that the existing
+ comment which claims that it is impossible for UBIK_RECFOUNDDB not
+ to be set at this point is true only if the database is not
+ released since checking or updating that bit in step three). In
+ addition, the address lock must be held while setting up the call
+ to be used for DISK_GetFile().
+
+ + Step five, again after verifying that this is still the sync site
+ and that the UBIK_RECHAVEDB bit is set, is to distribute the new
+ database to any servers which do not already have it. Before this
+ can be done, if the database was newly-initialized (with label
+ epoch 1), it is relabeled with version 2.1 before it is distributed
+ to other sites. This allows readany transactions to run on such
+ a database (though probably not successfully, since the database is
+ empty); it is deferred until this point in recovery to insure that
+ if any server in the quorum contains a valid database, that version
+ is used rather than the empty one.
+
+ For this step, the database lock should be held starting at the
+ point at which the UBIK_RECSYNCSITE and/or UBIK_RECHAVEDB bits are
+ tested. In the event the DBWRITING flag is set, the lock must be
+ released while waiting for the active write transaction to end. In
+ this case, once the wait is completed, it is necessary to again
+ check urecovery_state to determine that this is still the sync site
+ and have an up-to-date database before proceeding.
+
+ It is probably simplest for urecovery_Interact() to acquire the
+ database lock at the start of step 2 described above, release it for
+ the duration of the version-collection part of step 3 and while
+ waiting for write transactions to terminate in step 5, and otherwise
+ hold it until the end of the cycle. If no work is needed, this means
+ the lock will be held relatively briefly; if it is necessary to fetch
+ and or send full database versions, these operations themselves must
+ be done with the lock held, and there is little point in releasing it
+ between them. However, it is acceptable to release the lock between
+ any steps described above, provided that each time the database lock
+ is released and reacquired, 'urecovery_state' is checked to verify
+ that no regression has occurred. Note that outside of the recovery
+ thread, the only changes that may occur to 'urecovery_state' are that
+ it may be completely cleared by urecovery_ResetState(), and that the
+ UBIK_RECLABELDB bit may be set by udisk_commit(), on the sync site.
+
+ The function urecovery_AllBetter() is called under the database lock
+ by routines which need to know whether there is a usable database.
+ As noted in the discussion of ubeacon_AmSyncSite() above, callers
+ must then continue to hold the database lock until any database
+ operation is complete. Further, callers which intend to write to the
+ database currently follow a call to urecovery_AllBetter() with a call
+ to ubeacon_AmSyncSite(), to determine whether a write is permissible.
+ Unfortunately, this is not safe, because it is possible (though
+ admittedly unlikely) for urecovery_AllBetter() to succeed on the
+ basis of being a non-sync-site with an up-to-date database, and then
+ for a subsequent call to ubeacon_AmSyncSite() to return a true result
+ (because the latter may block on acquiring the beacon lock, during
+ which time this may become the sync site). To insure that no write
+ operation is performed without an up-to-date, writeable database,
+ urecovery_AllBetter() should be modified to indicate whether a write
+ operation is permitted, which is the case only when UBIK_RECHAVEDB is
+ tested and found to be set.
+
+ The functions urecovery_AbortAll(), urecovery_CheckTid(), and
+ urecovery_ResetState() all expect to be called with the database lock
+ held.
+
+ The function urecovery_LostServer() need not be called with any
+ locks; it is merely a wrapper for ubeacon_ReinitServer().
+
+ The RECOVERY package is also responsible at startup for replaying the
+ transaction log and initializing an empty database. These tasks are
+ handled by urecovery_Initialize() and its helpers, ReplayLog() and
+ InitializeDB(). Because the helpers use PHYS package functions and
+ manipulate database structure fields, urecovery_Initialize() should
+ acquire the database lock before calling them.
+
+
+ REMOTE - remote database I/O (DISK) RPCs
+
+ This package provides the DISK RPC package, which mediates remote
+ access to the local database copy, when another server is sync site.
+ It also includes DISK_UpdateInterfaceAddr(), which is used during
+ initialization to verify consistency of configuration across servers,
+ and DISK_Probe(), which is used by the RECOVERY module to determine
+ whether the server is up.
+
+ As the functions in this package are RPC implementations, they may be
+ called at any time and are entirely responsible for their own
+ locking. There can be at most one active remote write transaction at
+ a time; this rule is enforced by this package, which maintains the
+ global 'ubik_currentTrans' as a pointer to the active transaction.
+
+ For the most part, the RPCs in this module use a common prologue,
+ which includes acquiring the database lock, and do not release that
+ lock until just before returning. However, the common prologue
+ examines both 'ubik_currentTrans' and the transaction flags without
+ benefit of the database lock. This can be corrected fairly easily by
+ acquiring the database lock sooner; however, this requires referring
+ to the database via the ubik_dbase global rather than via the 'dbase'
+ pointer in 'ubik_currentTrans'.
+
+ Note that SDISK_GetVersion() and SDISK_SetVersion() contain a sanity
+ check to ensure they are not running on the sync site, since these
+ are called only from that portion of the recovery loop which runs
+ only on the sync site. These calls to ubeacon_AmSyncSite() must be
+ made with the database lock held, for the results to be valid. The
+ same holds true for a commented-out check in SDISK_GetFile() (which
+ really should remain that way -- there's no reason not to allow any
+ authorized caller to do this at any time).
+
+ Additionally, note that SDISK_Commit() acquires a write lock on the
+ application cache, to insure that no transactions are running which
+ depend on the application cache, page cache, or underlying database
+ to change. The locking order requires that this lock be acquired
+ before taking the database lock; thus, both locks must be acquired
+ earlier.
+
+ SDISK_SendFile() is really quite bad in its handling of the database
+ lock under error conditions. Presently, an error condition may cause
+ it to release the lock too early nor not at all.
+
+ SDISK_UpdateInterfaceAddr() is called by peer servers as part of
+ their startup process, to trade lists of additional addresses and
+ insure sufficient agreement on configuration to allow the new server
+ to start. It will need to acquire the lock protecting the addr[]
+ array attached to each host.
+
+
+ UBIK (high-level API)
+
+ This package contains the high-level APIs exported by Ubik to
+ applications, including initialization, the APIs for creating and
+ using transactions, and various utilities. It also contains a number
+ of internal utilities.
+
+ Initialization
+
+ Initialization of Ubik is handled in ubik_ServerInitCommon(),
+ which is called by two API functions, ubik_ServerInit() and
+ ubik_ServerInitByInfo(), which provide different interfaces for
+ passing configuration information. This code handles
+ initialization of the database, locks, and various globals, calls
+ the initialization routines for each subsystem, makes sure Rx has
+ been initialized, and creates Ubik's Rx services and threads.
+
+ Most of this is done before any Ubik-specific threads are created
+ and before Ubik's RPCs can be called, and so takes place without
+ holding any lock. However, since Rx may have been initialized and
+ started before Ubik, it is possible that some Rx worker threads
+ may have already been created. In practice, locks internal to Rx
+ should insure that any memory writes made before an Rx service is
+ created become visible to any worker running on behalf of that
+ service; nonetheless, this should not be depended on, and proper
+ locking should be used during all phases of initialization.
+
+ Unfortunately, the current code actually starts Rx services before
+ initialization is complete, resulting in a number of possible
+ races. The general THREAD SAFETY section below contains
+ recommendations for reordering the startup process to eliminate
+ this problem, while still avoiding potential deadlock if multiple
+ servers start at once.
+
+ ContactQuorum_*
+
+ Nearly a third of ubik.c consists of various ContactQuorum_*
+ functions, which are used to make a particular RPC to every server
+ in the quorum. These used to be a single ContactQuorum()
+ function, but have been split into multiple functions to regain
+ type-safety. These retain a common prologue and epilogue, some
+ small part of which has been abstracted out, but most of which
+ remains duplicated in each of these functions. Indeed, the only
+ part unique to each function is the actual RPC call and, in the
+ case of ContactQuorum_DISK_WriteV(), code to fall back to multiple
+ calls to DISK_Write() on a per-server basis.
+
+ The code common to all ContactQuorum_* functions will need to
+ acquire the beacon lock when checking whether a server is up (no
+ calls are made to servers which are not marked up), and again when
+ marking a server down. Note that when a server is marked down,
+ the 'up' and 'beaconSinceDown' flags must be cleared at the same
+ time, without releasing the beacon lock. In addition, the helper
+ function Quorum_StartIO() must obtain the server address lock
+ while it obtains a reference on the Rx connection that will be
+ used.
+
+ All of the ContactQuorum_* functions must be called with the
+ database lock held; this is necessary to protect their access to
+ each server's 'version' and 'currentDB' fields and to the local
+ database version. However, these functions release the lock while
+ actually making RPCs. This behavior is new -- ContactQuorum() in
+ OpenAFS 1.4 did not release the database lock during RPCs -- but
+ it is safe, because no caller of ContactQuorum_*() depends on
+ state read under the database lock before the call to remain valid
+ after.
+
+ Transactions
+
+ The main Ubik transaction API consists of ubik_BeginTrans(),
+ ubik_BeginTransReadAny(), ubik_BeginTransReadAnyWrite(),
+ ubik_AbortTrans(), ubik_EndTrans(), ubik_Read(), ubik_Write(),
+ ubik_Flush(), ubik_Seek(), ubik_Tell(), ubik_Truncate(), and
+ ubik_SetLock(). The ubik_Begin* functions require a database
+ pointer; all others require an active transaction pointer. These
+ functions are called directly by the application without any
+ locks, and are responsible for their own locking. Note that many
+ of these examine the transaction type before acquiring any locks;
+ this is permissible because the transaction type is immutable once
+ the transaction is created (but it requires that the call be made
+ from the thread that created the transaction, or that the
+ application have used some mechanism to insure an appropriate
+ memory barrier exists; the same mechanism used to protect the
+ transaction pointer should suffice).
+
+ BeginTrans() contains the common code responsible for starting
+ a new transaction; this is made available to applications as
+ ubik_BeginTrans(), ubik_BeginTransReadAny(), and
+ ubik_BeginTransReadAnyWrite(). If a write transaction is in
+ progress, this function waits for it to complete. However, since
+ doing so releases the database lock, it is necessary to call
+ urecovery_AllBetter() again once the lock has been reacquired.
+ Further, in the UBIK_PAUSE case, the DBVOTING flag may have been
+ set, so this also must be retested (see the BEACON package section
+ for more on the difficulties of UBIK_PAUSE and DBVOTING).
+
+ Both ubik_Flush() and ubik_Write() examine and manipulate the
+ transaction I/O vector without benefit of any lock. These fields
+ ('iovec_info' and 'iovec_data') should be protected by the
+ database lock. In the case of ubik_Write(), this may require
+ releasing the lock before calling ubik_Flush(), or the
+ introduction of a private ubik_Flush_r() which assumes the lock is
+ already held (but note that in either case, the lock is released,
+ because ubik_Flush() calls ContactQuorum_* functions).
+
+ Utilities
+
+ The functions ubik_GetVersion() and ubik_WaitVersion() provide the
+ application with a way to discover the current database version
+ and to wait for it to change. These interfaces are not currently
+ used in OpenAFS. ubik_GetVersion() needs to acquire the database
+ lock while copying the database version.
+
+ The internal function ubikGetPrimaryInterfaceAddr() is used by
+ Ubik RPCs to determine a peer server's primary address, given the
+ IP address from which a call arrived. This needs to hold the
+ server address lock, as described in the section on struct
+ ubik_server.
+
+ Application Cache
+
+ The section on struct ubik_dbase describes the operation of the
+ application cache. The function ubik_CheckCache() is provided to
+ allow an application to insure the cache is up to date and obtain
+ a read lock on the cache which lasts for the duration of the
+ transaction.
+
+ Note that ubik_AbortTrans() currently always invalidates the
+ application cache by clearing ubik_dbase->cachedVersion, for which
+ it requires a write lock on the cache_lock. This means that
+ aborted transactions block waiting for completion of any
+ transactions which hold the cache_lock, which may well include all
+ outstanding transactions. In the case of read transactions, which
+ cannot have modified the database, this is unnecessary and may
+ significantly reduce concurrency.
+
+
+THREAD SAFETY
+
+ This section discusses changes needed to insure thread safety.
+
+ The initialization sequence in ubik_ServerInitCommon() should be cleaned
+ up, to ensure that every subsystem is fully initialized before starting
+ any background threads or accepting RPCs. The correct initialization
+ sequence should look something like this:
+
+ + Initialize error tables
+ + Allocate and initialize the database structure
+ + Initialize Rx
+ + Initialize the various subsystems:
+ + udisk_Init() (formerly disk.c:DInit)
+ + ulock_Init() (new, pulled out of ulock_getLock)
+ + uvote_Init()
+ + urecovery_Initialize()
+ + ubeacon_InitServerList*
+ + Create Rx services
+ + Start an Rx server thread
+ + updateUbikNetworkAddress() (rename and export from beacon.c)
+ + Start the beacon and recovery threads
+
+ Initialization functions may need to be added for some subsystems which
+ do not currently have them:
+
+ + src/ubik/disk.c:DInit() should be exported and renamed
+ udisk_Init(); it can then be called from ubik_ServerInitCommon()
+ instead of from udisk_begin().
+
+ + The lock initialization currently done in ulock_getLock() should
+ instead be done in a separate ulock_Init(), which should be called
+ from ubik_ServerInitCommon()
+
+ A new lock is needed to protect state belonging to the VOTE package,
+ including the globals 'ubik_dbVersion' and 'ubik_dbTid', which represent
+ the sync site's database state. See the VOTE package section for
+ details on the use of the vote lock.
+
+ A new lock is needed to protect state belonging to the BEACON package,
+ including the globals 'ubik_amSyncSite' and 'syncSiteUntil' and several
+ fields in the ubik_server structure. This lock also protects the
+ per-server 'up' flag, which is shared among several modules. In
+ addition, some beacon-specific fields are accessed directly by other
+ modules. Thus, this lock must be visible outside the BEACON module, or
+ else accessors must be provided for these items. See the BEACON package
+ section for details on the use of the beacon lock.
+
+ A new lock is needed to provide additional protection for the database
+ version, transaction counter, and flags, so that these can safely be
+ examined (but not updated) by the beacon thread without the need to
+ acquire the database lock. See the BEACON package section for details
+ on the use of the version lock.
+
+ A new lock is needed to protect per-server lists of non-primary
+ addresses, connections to the VOTE and DISK services on each server, and
+ the client-mode security class objects used to set up these connections.
+ See the section on struct ubik_server for details on the use of the
+ server address lock.
+
+ The required locking order is described by the following list. Any of
+ these locks may be acquired singly; when acquiring multiple locks, they
+ should be acquired in the listed order:
+
+ + application cache lock (dbase->cache_lock)
+ + database lock (DBHOLD/DBRELE)
+ + beacon lock
+ + vote lock
+ + version lock
+ + server address lock
+
+ Some cleanup is required in various places where existing locking
+ conventions are not properly obeyed:
+
+ + SVOTE_Beacon() needs to hold the database lock when calling
+ urecovery_CheckTid()
+
+ + Management of the database lock in SDISK_SendFile() needs to be
+ cleaned up so that it is always released at the proper time.
+
+ + urecovery_Interact() is not consistent about holding the database
+ lock when it examines and updates 'urecovery_state'. It also
+ examines the local database version at least once without benefit
+ of this lock. See the RECOVERY package section for details on
+ which locks are needed when by urecovery_Interact().
+
+ + ubik_Flush() and ubik_Write() need to acquire the database lock
+ before accessing the 'iovec_info' and 'iovec_data' fields.
+
+ + ubik_GetVersion() needs to acquire the database lock before copying
+ the database version.
+
+ The various debugging RPCs in the VOTE packages, and the data collection
+ routines they call in other packages, generally operate without benefit
+ of any locks. This is probably desirable, as it avoids any possibility
+ of debugging operations blocking on or interfering with the rest of the
+ system, and allows debugging data to be collected even in the event of
+ some unforeseen deadlock. It is also "safe", in that it does not risk
+ damage to data structures or access to uninitialized memory, provided
+ that data structure initialization is complete before these functions
+ are called. However, it does mean that these RPCs may return data that
+ is not entirely self-consistent.
+
+ + Correctness requires various locks be held during parts of
+ SVOTE_Debug() and SVOTE_DebugOld(); see the VOTE package section.
+
+ + SVOTE_Debug() also examines 'ubik_currentTrans' without benefit of
+ the database lock. This usage is not "safe", as that transaction
+ may be freed by another thread while SVOTE_Debug() is examining it.
+
+ + SVOTE_XSDebug() should hold the server address lock while copying
+ out server addresses, and other locks while copying per-server
+ state.
+
+ + udisk_Debug() should hold the database lock while examining the
+ database version and collecting page cache statistics.
+
+ + ulock_Debug should use LOCK_LOCK/LOCK_UNLOCK when examining the
+ internals of struct Lock. Ideally, knowledge of these internals
+ should be abstracted to src/lwp/lock.h.
+
+ + ubeacon_Debug() should hold the beacon lock while accessing the
+ value of 'syncSiteUntil' and, if the access is moved here from
+ SVOTE_Debug/SVOTE_DebugOld, 'ubik_amSyncSite'.
+
+CONCURRENCY THOUGHTS
+
+ This section collections information discussed previously about changes
+ that may be required or desirable in order to improve concurrency. Note
+ that none of these changes are needed to insure thread safety, provided
+ the changes discussed in the previous section are made.
+
+ + PHYS could be made more concurrent by maintaining a separate lock for
+ the fdcache, avoiding the need to DBHOLD while calling these.
+ However, higher-layer consistency probably requires that certain calls
+ or groups of calls to this package be atomic and isolated.
+
+ + Could DISK be made more concurrent by maintaining a separate lock for
+ the page cache and or free truncation record list?
+
+ + Could concurrency be improved by maintaining separate per-transaction
+ locks? If so, what is the locking hierarchy with respect to database
+ locks and the DISK and PHYS package locks?
+
+ + Could concurrency be improved by maintaining more reader/writer state
+ and/or separate write locks to insure consistency of database file
+ access without requiring so much DBHOLD?
+
+ + Alternately, could concurrency be improved by requiring that an active
+ transaction be accessed only by the thread that created it?
+
+
+MULTI-DATABASE SUPPORT
+
+ There isn't any. Some of the Ubik interfaces and even the organization
+ of some internal data structures makes it appear as though multiple
+ databases can be supported. However, this is not the case, for several
+ reasons:
+
+ + Only a single set of peer servers is tracked, via a global linked list
+ of server structures. Each server structure includes state such as
+ the server's database version and whether it is up-to-date, so it is
+ not possible to track multiple databases even for a common set of
+ servers.
+ + The physical I/O package (phys.c) tracks cached open file descriptors
+ without regard to what database a file is part of; thus, it cannot
+ handle accessing multiple databases at the same time.
+ + There are a variety of globals containing state which needs to be
+ tracked on a per-database basis.
+ + There are a variety of globals which are effectively protected by
+ a per-database lock, or by no lock at all.
+ + The various RPCs used to communicate between servers assume a single
+ database and do not include a database identifier. Thus, in the
+ presence of multiple databases, it would be impossible to determine,
+ for example, for which database a DISK RPC was intended.
+
+ The upshot of this is that considerable work would be needed to get Ubik
+ into a condition which would allow a single process to maintain multiple
+ independent Ubik databases at the same time. Note that this is
+ orthogonal to the question of whether a single multi-file database is
+ possible.
+
+
+OTHER DATA STRUCTURES
+
+ Some other data structures are mentioned in ubik.p.h or elsewhere, but
+ are of relatively little interest from the point of view of threading.
+ They are described here primarily as an alternative to deleting
+ descriptive text that was already written.
+
+ struct ubik_hdr - on-disk database file header
+
+ This structure represents the on-disk format of the Ubik database
+ file header (label). It is used only for local variables in
+ uphys_getlabel() and uphys_setlabel(), which read and write this
+ header, respectively.
+
+ struct ubik_stat - database file status
+
+ This structure is used to convey the size and modification timestamp
+ of a database file. It is used as the type of an output parameter of
+ uphys_stat(), and for local variables in callers of that function.
+
+ struct ubik_stats - global statistics
+
+ This structure is used to contain "miscellaneous" global statistics.
+ Currently that consists of a single counter which is incremented in
+ one place (an exceptional condition in ubik_EndTrans) and is never
+ read. The one increment is performed only when holding the database
+ lock; thus, this structure may be treated the same as other globals
+ protected by non-global locks.
+
+OTHER NOTES
+
+ There appears to be nothing to insure that calls to disk.c:DRead()
+ during a write transaction won't find and use a "clean" cached page when
+ there is already a dirty page in the cache resulting from an earlier
+ write as part of the same transaction. This means that if a transaction
+ reads a page after writing it, it may read back the original data
+ instead of the new, and if a transaction writes a page more than once,
+ the later writes may corrupt the database. It seems likely that we are
+ merely lucky in this regard, and this problem should be fixed.
+
+ src/ubik/beacon.c:ubeacon_ReinitServer() assumes 'ubik_CRXSecurityRock'
+ is in fact an AFS configuration directory pointer, suitable as an
+ argument to afsconf_UpToDate(). This is inappropriate;
+ 'ubik_CRXSecurityRock' is an opaque argument to
+ (*ubik_CRXSecurityProc)(), Ubik must not make assumptions about what it
+ is.
+
+ It may be worth examining whether it is appropriate for SVOTE_Beacon()
+ to call urecovery_CheckTid() only when voting yes, and not whenever
+ a beacon is received from a server claiming to be sync site.
+
+ Recently, code has been introduced which attempts to insure that the
+ recovery process does not truncate an old database file before a new one
+ has been fully transferred, since this could, depending on the timing of
+ a server failure, deprive the new quorum of a valid database. This
+ so-called "new recovery" code violates the abstraction provided by the
+ PHYS package, encoding specific knowledge of the underlying file store
+ into the REMOTE and RECOVERY packages. This means that the backend
+ provided by phys.c cannot be replaced with an alternative, since in that
+ case recovery would do the wrong thing with the transferred database.
+
+ The DBWRITING loop in urecovery_Interact() is not portable, as it
+ depends the contents of the timeout after a call to select(2). This was
+ fine for IOMGR_Select(), which has defined behavior, but not for
+ select(2).
+