1 This file contains a threading analysis of Ubik prepared by
2 Jeffrey Hutzelman and sent ot the openafs-devel@openafs.org
3 mailing list in February 2011, archived at:
4 https://lists.openafs.org/pipermail/openafs-devel/2011-February/018329.html
8 This document describes the structure of Ubik, with an eye toward
9 thread-safety and concurrency. It begins by discussing the elements,
10 usage, and locking considerations of major shared data structures. This
11 is followed by a discussion of each major subsystem. The emphasis is on
12 deficiencies in thread-safety and the changes needed to correct them.
14 Most of this document focuses on the user-mode Ubik server code, which
15 comprises the bulk of the Ubik library code and also of the complexity.
16 A separate section near the end is dedicated to Ubik client code, which
17 itself is intended primarily for use in user-mode applications. The
18 OpenAFS cache manager includes its own mechanisms for accessing
19 replicated services, including the Ubik-managed VLDB, and for tracking
20 the servers that provide them.
22 Occasionally, issues related to concurrency, performance, and modularity
23 (particularly as regards support for supporting multiple independent
24 databases in a single server process) are also discussed; however, these
25 discussions are peripheral and are included only to record data obtained
26 while examining the question of thread-safety and as a basis for future
27 work. Similarly, throughout this document are detailed discussions of
28 some parts of the code and certain data structures which, while not
29 always directly relevant to thread-safety, were produced in the course
30 of analyzing the Ubik library code. These are included to provide
31 additional background and as a basis for future work in documentation of
32 Ubik library internals and interfaces.
37 This section calls out the major shared data structures used in the Ubik
38 server code and details what they are used for and how shared access to
39 them is protected. Much of this text is probably suitable as a starting
40 point for expanding on the documentation found in comments in ubik.p.h.
42 struct ubik_dbase - database
44 This structure contains nearly everything Ubik knows about an open
45 database (for an explanation of "nearly", see the section below on
46 globals), including database-wide parameters and state and a list of
49 The following structure elements are set only when the database
50 structure is being initialized, after which they are never modified:
52 + pathName - The base path used to construct database filenames
53 + physical later method pointers (read, write, truncate, sync,
54 stat, open, setlabel, getlabel, getnfiles), used to manipulate
55 the physical database files on disk. These currently always
56 point to the methods defined in phys.c, and in fact other
57 modules contain knowledge of the internal implementation of that
60 The primary synchronization mechanism used to protect this structure
61 is the database lock (versionLock), which is manipulated via the
62 DBHOLD() and DBRELE() macros. This lock protects the following
63 structure elements, as well as a number of globals as described
64 elsewhere in this document:
66 + activeTrans - list of active transactions
67 + version - database version (*)
68 + tidCounter - transaction ID of latest transaction (*)
69 + writeTidCounter - transaction ID of latest write transaction (*)
73 (*) These fields are currently protected by the database lock, except
74 that the beacon thread does not hold that lock when examining them,
75 or the global 'ubik_epochTime', which holds the corresponding epoch.
76 The recommendation is for these fields, along with ubik_epochTime, to
77 be additionally protected by a second lock, allowing the beacon
78 thread to examine them without holding the database lock; see the
79 section on the BEACON package for details.
81 The condition variable 'version_cond' is used to signal to that the
82 database version may have changed; it is broadcast in udisk_commit(),
83 in SDISK_SendFile(), and from the recovery thread; it is monitored by
84 ubik_WaitVersion(), which can be called by an application to wait for
85 a database version change (this is not currently used in OpenAFS).
86 This CV is associated with the database lock. When LWP is used, this
87 condition is signalled on &ubik_dbase->version.
89 The condition variable 'flags_cond' is used by udisk_end() to signal
90 that DBWRITING flag has been cleared. This wakes threads waiting in
91 ubik.c:BeginTrans() to begin a new transaction. This CV is
92 associated with the database lock. When LWP is used, this condition
93 is signalled on &ubik_dbase->flags.
95 When LWP is used, signals are also sometimes sent on &urecovery_state
96 and on &ubik_amSyncSite. However, nothing ever waits on these, and
97 no corresponding condition variables exist when pthreads is used.
99 The 'cachedVersion' field holds the database version represented by
100 the application's cache; this is compared against the current
101 database version to discover if the cache is up to date. Both
102 cachedVersion and the application cache itself are protected by
103 cache_lock, an AFS reader/writer lock.
105 When the application wishes to use the cache, it calls
106 ubik_CacheUpdate(), which verifies that the cache is current,
107 updating it if necessary (under a write lock), and then returns with
108 the application cache read-locked. The application can then rely on
109 the cache's contents not to change until it ends the transaction by
110 calling ubik_EndTrans() or ubik_AbortTrans(). When one of these
111 functions is called, the read lock is dropped, and a write lock may
112 be acquired temporarily to update both cachedVersion and the cache
113 (in the case of committing a write transaction) or to clear
114 cachedVersion, indicating the cache is invalid (in the case of
115 aborting a transaction).
117 In the case of ubik_EndTrans() on a write transaction, the cache is
118 held write-locked until udisk_commit() has completed; this is also
119 done in SDISK_Commit(), which insures that the contents of the
120 on-disk database and page cache cannot change while an active
121 transaction holds the cache lock. Note that both the recovery thread
122 and SDISK_SendFile() may replace the database contents wholesale;
123 however, these operations first abort all outstanding transactions.
125 Prior to the introduction of ubik_BeginTransReadAnyWrite(), an
126 application could choose to manage its own cache, updating it during
127 write transactions and when, at the beginning of a read transaction,
128 it was discovered to be out of date (due to a write done by another
129 server). Under this model, the application uses Ubik transaction
130 locks to protect not only the database but also its cache. However,
131 the use of ubik_BeginTransReadAnyWrite() allows some transactions to
132 proceed and read the database without acquiring any locks, which
133 makes it unsafe to depend on Ubik transaction locks to protect the
134 application cache. Applications which make use of this mode must set
135 ubik_SyncWriterCacheProc to point to a procedure to be called to
136 update the cache during ubik_EndTrans(), after udisk_commit() has
137 succeeded and before the cache_lock is released. These applications
138 must then refrain from ever updating the cache other than during that
139 procedure or during the callback passed to ubik_CheckCache().
142 struct ubik_trans - active transaction
144 This structure represents the state of an active transaction,
145 including transaction metadata and state and a list of all changes
146 made as part of the transaction.
148 Transactions are organized into a linked list referenced by the
149 corresponding database structure, protected by the database lock.
150 Each transaction contains an upward pointer to the database,
151 a transaction type, and a transaction ID, all of which are immutable
152 for as long as the transaction exists.
154 A transaction created via the REMOTE interface exists until ended by
155 a call to SDISK_Abort() or SDISK_ReleaseLocks() or aborted in
156 urecovery_CheckTid() due to a transaction ID mismatch or creation of
157 a new transaction (note that in this last case, cleanup will be
158 deferred if SDISK_Lock() is blocked on this transaction). In these
159 cases, ubik_currentTrans is cleared at the same time, under the
160 database lock, and further REMOTE calls will fail gracefully until
161 a new transaction is started. Thus, REMOTE transactions may be
162 considered valid only for as long as the database lock is held.
164 Transactions created via ubik_BeginTrans(), ubik_BeginTransReadAny(),
165 or ubik_BeginTransReadAnyWrite() exist until ended by a call to
166 ubik_AbortTrans() or ubik_EndTrans(). They are never destroyed from
167 within the UBIK package. Code which discovers a transaction by
168 traversing the database's active transaction list may access the
169 transaction only as long as the database lock is held, since once it
170 is released, an application thread may end the transaction.
172 Each transaction contains the following fields, which are protected
173 by the database lock:
175 + locktype - type of lock held by this transaction
176 + minCommitTime - unused
177 + seekFile - file number of database file pointer
178 + seekPos - position of database file pointer
181 File writes are represented by 'iovec_info', an XDR vector structure
182 (type iovec_wrt, a vector of struct ubik_iovec) containing the file,
183 position, and length of each write, and 'iovec_data', an XDR opaque
184 containing the corresponding data. These vectors and their contents
185 are protected by the database lock.
187 File truncations are represented by a linked list of struct
188 ubik_trunc, each of which contains a file number and the new size of
189 that file. This list, and the records contained in it, are protected
190 by the database lock.
192 While transactions are currently protected by the database lock, it
193 may be desirable in the future to introduce a per-transaction lock,
194 in order to facilitate increased concurrency.
197 struct ubik_server - server status
199 This structure is used to track the set of peer servers, including
200 state related to elections and recovery. The global 'ubik_servers'
201 points to a linked list of these structures, which is constructed
202 during initialization. Once the list has been constructed, no
203 entries are ever added or removed; in addition, the following fields
204 are not changed after startup:
206 + addr[0] - server's primary address
207 + magic - true if this is the magic server
208 + isClone - true if this server is a clone
210 The following fields are used by the BEACON package to track beacons
211 to this server and their results. Currently, they are sometimes
212 accessed under the database lock, and sometimes without benefit of
213 any lock. They should probably be protected by the same lock as
214 globals private to the BEACON package.
216 + lastVoteTime - last time this server voted yes for us
217 + lastBeaconSent - last time a beacon was sent to this server
218 + lastVote - true if its late vote response was yes
219 + up - true if this server is believed to be up
220 + beaconSinceDown - true if a beacon has successfully been sent to
221 this server since the last time it was down
223 The following fields are used by the RECOVERY package to track the
224 state of each server. They are also examined and modified by the
225 various ContactQuorum_* functions. Currently, they are sometimes
226 accessed under the database lock, and sometimes without benefit of
227 any lock. They should be fully protected by the database lock.
229 + version - server's database version
230 + currentDB - true if server's database is up-to-date
232 The addr[] array is used to track all known addresses for the remote
233 server; this list is used when attempting to contact a down server
234 via an alternate address, and to identify the source of incoming
235 server-to-server RPCs. The first element of this array is always the
236 primary address and does not change after initialization; other
237 elements are updated during initialization and when a remote server
238 calls DISK_UpdateInterfaceAddr().
240 The system normally maintains an active Rx connection to the VOTE and
241 DISK services on each server; pointers to these are kept in the
242 'vote_rxcid' and 'disk_rxcid' fields.
244 Currently, the addr[] array, 'vote_rxcid', and 'disk_rxcid' are not
245 protected by any lock. Because these are shared among a number of
246 subsystems and it is usually desirable to avoid blocking on the
247 database lock when making RPCs, it is recommended to introduce a new
248 lock (the "server address lock") to protect these fields and the
249 globals 'ubikSecClass' and 'ubikSecIndex', which contain the security
250 classes used for setting up such connections. The new lock would
251 need to be held in the following cases:
252 + Before beginning any RPC using 'vote_rxcid' or 'disk_rxcid', for
253 the time it takes to call rx_GetConnection() or rx_NewCall()
254 + For the duration of ubikGetPrimaryInterfaceAddr()
255 + In ubeacon_Interact(), when building the set of connections to be
256 used for multi_VOTE_Beacon().
257 + In updateUbikNetworkAddress(), first while constructing the set of
258 connections to be used for multi_DISK_UpdateInterfaceAddr(), and
259 then again each time a server's list of addresses is updated.
260 However, the lock must NOT be held while waiting for the RPC to
261 complete, to avoid deadlock when multiple servers are starting at
263 + For the duration of SDISK_UpdateInterfaceAddr()
264 + For the duration of ubeacon_ReinitServer(), with the optional
265 exception of the call to afsconf_UpToDate().
266 + In src/ubik/recovery.c:DoProbe(), first while constructing the set
267 of connections to be used for multi_DISK_Probe() and, if the probe
268 is successful, again while destroying the old connections for the
269 server under test and replacing them with new ones. See the
270 section on the RECOVERY package for additional notes.
275 The library has an unfortunate number of global variables which are not
276 part of any data structure. Nonetheless, some of these are
277 database-specific; see MULTI-DATABASE SUPPORT, below.
279 Globals private to a particular subsystem are discussed in the
280 description of that subsystem. This section discusses global variables
281 which are shared and/or part of external interfaces.
283 The following globals are used to convey configuration parameters from
284 the application to the Ubik library, and are intended to be set before
285 the first call to ubik_ServerInit(). They are not modified by Ubik, and
286 are not protected by locks:
288 + ubik_debugFlag - print debug messages (unused)
289 + ubikPrimaryAddrOnly - use only primary interface
290 + ubik_nBuffers - number of DISK package page cache buffers
291 + ubik_SRXSecurityProc - callback to create server security class
292 + ubik_SRXSecurityRock - argument for ubik_SRXSecurityProc
293 + ubik_CRXSecurityProc - callback to create client security class
294 + ubik_CRXSecurityRock - argument for ubik_CRXSecurityProc
295 + ubik_SyncWriterCacheProc - callback to update application cache
296 + ubik_CheckRXSecurityProc - callback to check if caller is OK
297 + ubik_CheckRXSecurityRock - argument for ubik_CheckRXSecurityProc
299 The following globals contain values which are computed during Ubik
300 initialization and not modified thereafter. They are not protected by
301 any lock, which is safe so long as suitable memory barriers exist across
302 creation of threads and Rx services (see the notes on initialization in
303 the UBIK package section). Note that some of these values are
304 database-specific and properly belong in the ubik_dbase structure:
306 + ubik_servers - linked list of servers (see above)
307 + amIClone - true if this server is a clone
308 + ubik_callPortal - UDP port used for server-to-server RPCs
309 + ubik_quorum - number of servers to make a quorum
310 + ubik_dbase - pointer to the (only) database (see above)
311 + ubik_host[] - array of local addresses
312 + ubik_sc[] - security classes for Ubik RPC services
314 The following globals are currently protected by the database lock,
315 though in most cases there are existing references which do not hold the
316 lock. Later sections of this document contain recommendations in each
317 case to establish new locks to protect these or to correct cases where
318 they are accessed without the correct lock.
320 + ubik_currentTrans - pointer to REMOTE's current transaction
321 + ubik_epochTime - epoch for transaction IDs
322 + urecovery_state - recovery state bits
323 + ubik_dbVersion - sync site's database version
324 + ubik_dbTid - sync site's transaction ID
325 + ubikSecIndex - security class index for making Ubik RPCs
326 + ubikSecClass - security class for making Ubik RPCs
328 The global 'ubik_amSyncSite' belongs to the BEACON package, and should
329 be protected by the same lock as that package's private globals. In
330 fact, it should probably be made private; the only references outside
331 BEACON are in SVOTE_Debug() and SVOTE_DebugOld() and can easily be moved
332 into ubeacon_Debug().
337 This section describes each of the major Ubik server subsystems. For
338 each subsystem, a brief overview is provided, along with a description
339 of that subsystem's handling of synchronization (or lack thereof).
340 Subsystems are described beginning with the lowest layer and working
345 This package is responsible for handling I/O to database files. It
346 maintains a private, global array of MAXFDCACHE(4) fdcache
347 structures, each representing an open file descriptor on a database
348 file, either presently in use or idle and available for use. Entries
349 are indexed by database file ID, and each contain a file descriptor
350 number and a refcount. There is also a global 'inited' flag, and
351 a buffer used to construct pathnames when opening database files.
353 The static function uphys_open() takes a database pointer and fileid
354 and attempts to open the corresponding file. On success, it returns
355 a file descriptor; on failure, it returns the same as open(2) (with
356 errno set). This prefers to reuse an already-open fd whose cache
357 refcount is 0; failing that, it opens a new fd and attempts to enter
358 it in the cache. It prefers first to use an unused cache entry, then
359 to reclaim an idle one whose refcount is 0; if neither is possible,
360 the newly-opened fd is returned without caching. Note that the
361 refcount on an active entry is always exactly 1; the same entry
362 cannot be referenced more than once.
364 The function uphys_close() releases a file descriptor opened by
365 uphys_open(). If the file descriptor was cached, the cache entry is
366 dereferenced but the file is left open for future use; 0 is returned.
367 If the file descriptor is not found in the cache, then the return
368 value is that of an immediate call to close(2). This function also
369 handles cleanup when a cached fd is closed after the file to which it
370 refers is invalidated by uphys_invalidate(). This function is not
371 static, but may as well be -- it is used only by other functions in
374 Functions in this package are called via method pointers in the
375 database structure, which are set during initialization and not
376 changed thereafter. These functions must be called with the database
377 lock held; this protects the file descriptor cache and other globals
378 mentioned above, and prevents conflicting access to database files.
380 The requirement that functions in this package be called with the
381 database lock held means that concurrent access to database files is
382 not possible. This situation could be improved by the addition of
383 a new global lock, which would be held for most of the duration of
384 uphys_open(), uphys_close(), and uphys_invalidate(), but need _not_
385 be held during actual file accesses. This would eliminate the
386 requirement that calls to this package be made with the database lock
387 held, allowing multiple calls to be active concurrently. However, it
388 would still be necessary for higher layers to employ appropriate
389 synchronization as needed to maintain database consistency.
392 DISK - Logical Database I/O and Transaction Management
394 This package is responsible for logical database file I/O, logging,
395 and transaction management. It maintains a private cache of database
396 file page buffers. This entire cache, including buffer space, is
397 allocated and initialized the first time a transaction is started;
398 the global initd flag keeps track of whether or not this has been
399 done. The size of this cache defaults to 20 pages, but may be
400 overridden by setting the global 'ubik_nBuffers' (all OpenAFS Ubik
401 services do this); this global is not protected by any lock, but is
402 read exactly once, when the cache is initialized.
404 Except for udisk_Debug() (see below), all udisk_* interfaces require
405 a pointer either to the database or to an active transaction, and
406 must be called with the database lock held. This protects the DISK
407 package's accesses of the active transaction list and associated
408 state variables as well as its calls into the physical access layer,
409 including insuring that operations requiring multiple calls into that
410 layer are performed atomically.
412 In addition, the database lock protects the page cache hash table and
413 LRU queue, the buffer control structures, buffer contents, the
414 truncation record free list (freeTruncList), and global statistics.
415 However, it may be desirable in the future to introduce a separate
416 lock to protect these structures, in order to facilitate multiple
419 udisk_LogOpcode, udisk_LogEnd, udisk_LogTruncate, udisk_LogWriteData
421 These are internal interfaces for writing to the transaction log.
422 They each construct a log record and then directly call the
423 physical layer to append it to the log; the log file is _not_
424 cached by the page buffer cache, and these functions do not touch
425 the cache. Each of these must be atomic with respect to other
426 writes to the log file, which is accomplished by insuring they are
427 called only with the database locked.
429 DInit, DRead, DTrunc, DRelease, DFlush, DAbort, DSync, DNew
431 These are internal interfaces for accessing the page buffer cache.
432 They must be called with the database lock held.
434 DRead() and DNew() return a pointer to the page data buffer, not
435 to the buffer control structure; this pointer is later passed to
436 DRelease() to release the buffer. This mechanism depends on the
437 fact that buffer control structures are allocated in the same
438 order as the actual page data buffers.
440 DRead() guarantees that a read transaction will always see a clean
441 copy of the requested page, without any modifications made by an
442 in-progress write transaction. However, this is true only if the
443 next layer insures that once DRead() is called with write intent
444 on a page, that no further calls to DRead() are made for that page
445 until the buffer is written and DRelease() is called with the
446 dirty flag. Presently, this is the case because udisk_read() and
447 udisk_write(), which are the only callers of DRead(), are both
448 called with the database lock held.
450 udisk_read, udisk_truncate, udisk_write, udisk_begin, udisk_commit,
451 udisk_abort, udisk_end, udisk_Invalidate
453 These are external interfaces provided by the DISK package to
454 higher layers. Presently, the locking requirements of these
455 functions and the lower-layer functions they call are satisfied by
456 the simple rule that these must always be called with the database
457 lock. In the future, support for multiple databases and/or
458 a greater level of concurrency may require relaxing this rule
459 and/or acquiring additional locks within these functions.
461 In udisk_commit(), when committing the first write transaction
462 after becoming sync site, the database is relabelled. In this
463 case, the UBIK_RECLABELDB recovery state bit should be set before
464 propagating the label change to other servers, rather than after.
465 This is because because ContactQuorum_DISK_Setversion() will
466 release the database lock, at which point the recovery state may
467 be cleared by urecovery_ResetState() running in another thread.
468 It is important that a relabelling which occurs before recovery
469 state is cleared not result in the UBIK_RECLABELDB recovery state
470 bit being set after; otherwise, the server may fail to correctly
471 relabel the database the next time it becomes sync site.
475 The udisk_Debug() interface is called as part of preparing the
476 response to Ubik debugging requests. Unlike other udisk_* APIs,
477 this interface is not called on a specific database and does not
478 require the database lock to be held. It reports the version data
479 of the single database identified by the global 'ubik_dbase' and
480 traverses the buffer cache collecting statistics, both without
481 benefit of holding any locks. This may produce inconsistent data
482 but does not normally risk damaging data structures or accessing
483 invalid or uninitialized memory (but see below), and avoiding
484 locking may be of greater benefit than guaranteeing the return of
487 The global 'ubik_dbase' pointer is set when the first (only)
488 database is initialized, and the buffer cache data structures are
489 allocated the first time a transaction is started, either locally
490 or via the REMOTE package. Once set, the pointers used by
491 udisk_Debug() are never changed, and the data structures they
492 point to remain valid and are never freed. The buffer cache is
493 traversed by iterating over the array of buffer cache elements
494 (this is done in other places as well), not by following a linked
495 list or other data structure that may reordered, and so should not
498 At a minimum, then, it should be arranged that udisk_Debug()
499 cannot be called before the database structure and page cache have
500 been initialized. This means that both of these actions must be
501 taken before any incoming RPCs are accepted. Practically, this
502 means that DInit() should be renamed udisk_init(), and called from
503 ubik_ServerInitCommon() during initialization.
505 Full correctness demands that udisk_Debug actually hold the
506 database lock while copying the database version and examining the
507 page cache. However, it may be desirable to sacrifice this
508 correctness in order to reduce interaction between the debugging
509 interfaces and the rest of the system.
512 + unthread - remove transaction from activeTrans
513 + Dlru, Dmru - move page to front/end of LRUQ
514 + MatchBuffer - check if a buffer's dbase/file/page match
515 + FixupBucket - move page to the right bucket
516 + newslot - find and allocate an unused buffer
517 + DedupBuffer - throw away stale copies after page is flushed
518 + GetTrunc, PutTrunc - trunc record allocation
519 + FindTrunc - find active trunc record for file
520 + DoTruncs - apply truncations
523 LOCK - Database Transaction Locking
525 This package is responsible for managing database locks held as part
526 of a transaction. It supports a single read/write lock, which is
527 represented by the global variable 'rwlock', a struct Lock. The
528 global 'rwlockinit' records whether Lock_Init() has been called on
529 this lock, and is protected by the database lock. Properly, this
530 lock should be treated as belonging to the database.
532 ulock_getLock() expects to be called with the database lock held; it
533 depends on this to protect access to fields within the transaction
534 structure as well as to the global 'rwlockinit'. However, note that
535 the database lock must be temporarily released while obtaining the
536 read/write lock, and so ulock_getLock() should not be called in the
537 middle of a critical section. The 'await' parameter may in theory be
538 set to 0 to effect a non-blocking lock; however, this depends on
539 (incorrect) knowledge of the internals of struct Lock. Fortunately,
540 existing Ubik code always sets await=1. This feature should be fixed
543 ulock_relLock() releases the read/write lock. It expects to be
544 called with the database lock held. Unlock ulock_getLock(), this
545 function does not release the database lock.
547 ulock_Debug() reports the state of the read/write lock, for debugging
548 purposes. It depends on knowledge of the internals of struct Lock,
549 and accesses both the Lock structure and the global rwlockinit
550 without benefit of holding any lock.
552 It would probably be best to eliminate 'rwlockinit' in favor of
553 always initializing the global 'rwlock' during startup; ulock_Debug()
554 would then have no need to examine the flag. Further, ulock_Debug()
555 should at a minimum use the appropriate locks when looking inside
556 struct Lock; ideally, knowledge of that structure's internals should
557 be abstracted into appropriate interfaces in src/lwp/lock.[ch].
560 VOTE - Voting Logic and VOTE_* RPCs
562 This package is responsible for keeping track of who is currently
563 sync site and deciding how to vote and whether this server should try
564 to become sync site. It provides the VOTE RPC package, which is
565 primarily responsible for processing vote requests from peer servers
566 (VOTE_Beacon), but also provides debugging interfaces and an
567 interface (VOTE_GetSyncSite) to allow clients to discover the current
570 Voting decisions are made using state tracked in a set of global
571 variables, listed below. Currently there is no locking protecting
572 these variables; LWP-based Ubik depends on the cooperative nature of
573 LWP and on the fact that routines which examine and manipulate them
574 do not block. Introduction of a new "vote lock" is recommended to
575 protect the following globals:
577 + ubik_lastYesTime - last time VOTE_Beacon voted "yes"
578 + lastYesHost - host voted for at ubik_lastYesTime
579 + lastYesClaim - time lastYesHost started its voting cycle
580 + lastYesState - whether lastYesHost claimed to be sync site
581 + lowestHost - lowest host seen
582 + lowestTime - last time lowestHost was updated/heard from
583 + syncHost - sync host, or 0 if none known
584 + syncTime - last time syncHost was updated
585 + ubik_dbVersion - sync site's database version
586 + ubik_dbTid - sync site's transaction ID
588 The vote lock should be initialized in uvote_Init() and held for the
589 remainder of that function and also during of uvote_ShouldIRun(),
590 uvote_GetSyncSite(). It should also be held for the bulk of
591 SVOTE_Beacon() -- specifically, it should be acquired just before
592 beginning the lowest-host calculation, and released just before
593 calling urecovery_CheckTid(), in the event of a yes vote or else just
596 In addition, the vote lock would protect the globals 'ubik_dbVersion'
597 and 'ubik_dbTid', which represent this server's knowledge of the
598 state of the database on the sync site. This variable is set in two
599 places in the REMOTE package, which change the local and sync site
600 database versions at the same time. It is also examined in one place
601 in the REMOTE package, and compared to the database version in the
602 RECOVERY package. It is imperative that when the local and remote
603 versions are changed at the same time, these changes become visible
604 to the RECOVERY package at the same time. Thus, both the changes and
605 the comparison must be done under both locks. For this purpose,
606 three new functions should be provided by the VOTE package:
608 + A function to safely set 'ubik_dbVersion'
609 + A function to compare 'ubik_dbVersion' to another value
610 + A function to check whether this is sync site and compare
611 'ubik_dbVersion' to another value, without releasing the vote lock
612 in between (to be used in recovery).
614 Existing accesses to 'ubik_dbVersion' in REMOTE and RECOVERY, all of
615 which are already made under the database lock, would be replaced
616 with calls to the new accessors.
618 The result of these changes will require in several places that the
619 vote lock be acquired while the database lock is already held.
620 Therefore, the vote lock MUST NOT be held while acquiring the
623 SVOTE_Beacon() currently calls urecovery_CheckTid() without benefit
624 of the database lock, which must be held when calling that function.
625 This is fairly simple to fix; however, the vote lock must be released
626 first. One simple approach would be to move the call to
627 urecovery_CheckTid() to after the point where the vote lock will be
628 released, wrapping it in DBHOLD/DBRELE, and conditionalizing the
629 whole thing on 'vote' being nonzero.
631 Separately, it may be worth examining whether it is appropriate to
632 call urecovery_CheckTid() only when voting yes, and not whenever
633 a beacon is received from a server claiming to be sync site.
635 As with the debug functions in other packages, SVOTE_Debug() and
636 SVOTE_DebugOld() access global variables belong to this and other
637 modules without benefit of the appropriate locks. There is also very
638 little abstraction here. It is probably desirable to introduce Debug
639 interfaces for the VOTE and RECOVERY packages, to be called from
640 these functions. Full correctness requires...
642 + Holding the vote lock while accessing the VOTE package globals
644 + Holding the database lock while accessing the database flags and
645 tidCounter, ubik_epochTime, ubik_currentTrans, and urecovery_state
648 BEACON - elections, tracking whether this is sync site
650 This package is responsible for running elections, collecting votes,
651 and deciding whether and for how long this is the sync site. It also
652 contains routines responsible for setting up the list of servers and
653 acquiring alternate addresses for other servers during startup.
655 For this purpose, it maintains a number of global variables and
656 ubik_server structure fields. Currently, these variables and fields
657 are sometimes accessed under the database lock, and sometimes without
658 benefit of any lock. In order to allow this package to operate with
659 a minimum of disruption due to database accesses, it is recommended
660 that a new lock be established to protect these. It is particularly
661 desirable to prevent loss of sync due to the potentially long-lived
662 RPCs the RECOVERY package must make under the database lock when
663 transferring full database copies between servers; for this reason,
664 it is important that both the BEACON thread and SVOTE_Beacon() be
665 able to operate without acquiring the database lock. The global
666 variables and ubik_server structure fields to be protected by the new
667 lock are the following:
669 + ubik_amSyncSite - true if this is the sync site
670 + syncSiteUntil - time until which this is the sync site
671 + ts->lastVoteTime - last time this server voted yes for us
672 + ts->lastBeaconSent - last time a beacon was sent to this server
673 + ts->lastVote - true if its late vote response was yes
674 + ts->up - true if this server is believed to be up
675 + ts->beaconSinceDown - true if a beacon has successfully been sent
676 to this server since the last time it was down
678 The new lock would need to be held in the following cases:
679 + In ubeacon_AmSyncSite(), from the first time 'ubik_amSyncSite' is
680 examined until after the potential call to urecovery_ResetState().
681 + In ubeacon_Interact(), when building the list of servers to which
682 to send beacons (to protect access to the 'up' flag). Note that
683 the server address lock must also be held during this loop, and
684 must be acquired _after_ the beacon lock.
685 + In ubeacon_Interact(), when updating status associated with
686 a server after receiving its beacon response.
687 + In ubeacon_Interact(), when updating globals after an election.
688 However, it must be released before calling urecovery_ResetState().
689 + In updateUbikNetworkAddress(), when marking a server down.
690 + In the various ContactQuorum_* functions, when checking whether
691 a server is up before calling it, and again when marking a server
692 down after a call fails.
693 + In ubik_EndTrans, when examining a server's 'beaconSinceDown' flag
694 and 'lastBeaconSent' timestamp to determine whether it has recently
695 gone down, requiring a brief hold until it either comes back up or
698 The following global variables are set during initialization and do
699 not change thereafter:
701 + nServers - number of voting servers
702 + amIMagic - true if this server gets the extra half vote
703 + ubik_singleServer - true if this is the only server
705 Currently, all initialization of this module is done _after_ the Rx
706 services have been created and an Rx server thread started. This is
707 done to avoid a situation in which all servers are blocked in
708 updateUbikNetworkAddress(), which is responsible for exchanging
709 addresses with peer servers, while none of them is prepared to
710 handled the DISK_UpdateInterfaceAddr call made by that function.
712 When using LWP, this early initialization of Rx is not a problem,
713 because the various initialization routines don't actually block until
714 updateUbikNetworkAddress() waits for RPCs to complete. However, when
715 using pthreads, it is a problem, because it means that incoming RPCs
716 may be processed when initialization is not complete.
718 To address this problem and insure orderly initialization, it is
719 recommended to change the initialization sequence so that most work,
720 including the bulk of BEACON package initialization, happens before
721 Ubik's Rx services have been created. The only work that should be
722 deferred until after that point is updateUbikNetworkAddress(),
723 followed by starting the long-running threads belonging to the BEACON
724 and RECOVERY packages. This requires updateUbikNetworkAddress() to
725 be exported (and perhaps to undergo a name change).
727 The major part of the work of this package is done in the function
728 ubeacon_Interact(), which is the body of a long-running thread known
729 as the beacon thread. This thread is responsible for periodically
730 sending "beacons" to other servers to request votes (when the sending
731 server is the sync site, these also include how long it will remain
732 so, the database version, and the active transaction ID). Beacons
733 are sent only from servers that are already sync site or believe they
734 are the best candidate, as determined by uvote_ShouldIRun().
736 On the sync site, the beacon thread is responsible for insuring that
737 new votes are collected frequently enough to avoid loss of quorum.
738 This means it must not block for an extended time; therefore, it must
739 not acquire the database lock, which in other threads may sometimes
740 be held during long-running operations (most notably, database file
741 transfers done under this lock by the recovery thread). Instead,
742 a new lock is proposed (the "version lock"), which is used to allow
743 the beacon thread to examine (but not modify) certain fields without
744 holding the database lock.
746 The version lock should be acquired _in addition to_ the database
747 lock when modifying 'ubik_epochTime' or the database 'version',
748 'flags', 'tidCounter', and 'writeTidCounter' fields; such
749 modifications occur at the following locations:
750 + in ubik.c:BeginTrans(), when beginning a new transaction
751 + in udisk_begin(), when setting DBWRITING
752 + in udisk_commit(), when updating the database version at the end of
753 a write transaction (note this includes the case of relabelling the
754 database on the first write transaction after becoming sync site)
755 + in udisk_end(), when clearing DBWRITING
756 + in recovery.c:InitializeDB(), when setting the initial version of
758 + in urecovery_Interact(), when labelling a new database the first
759 time quorum is established
760 + in urecovery_Interact(), after fetching the latest database from
761 another server (whether successful or not; there are two cases)
762 + in SDISK_SendFile(), when updating the database version both before
763 and after receiving a new database from the sync site. Note the
764 lock must _not_ be held during the transfer.
765 + in SDISK_SetVersion()
767 The version lock should be acquired by the beacon thread while
768 examining these fields, shortly before calling multi_VOTE_Beacon().
769 Since it must release the lock before making that call, the database
770 version will need to be explicitly copied into a local variable, as
771 is already done with the current transaction ID.
773 Note that if UBIK_PAUSE is defined, the DBVOTING flag poses an
774 additional problem, as it must be modified by the beacon thread. If
775 it is desirable to continue to support this variant, then it becomes
776 necessary for all accesses to the database flags to be protected by
777 the version lock. Other UBIK_PAUSE code may not have been reviewed
778 in depth and may pose additional problems.
780 The function ubeacon_AmSyncSite() is called both from the beacon
781 thread and from various other points to determine whether they are
782 running on the sync site, and thus whether various operations are
783 safe and appropriate. Calls to this function from outside the beacon
784 thread (often, via urecovery_AllBetter()) must be made with the
785 database lock held, and for operations modifying the database or
786 relying on its contents to remain constant, this lock must not then
787 be released until the operation is complete. The does not guarantee
788 that this will remain sync site for the duration of the operation,
789 but does insure that any operations done after or as a result of
790 losing sync will in fact occur after the operation holding the lock.
792 Of course, ubeacon_Interact() itself cannot call ubeacon_AmSyncSite()
793 with the database lock held, since it must not block on that lock
794 when running on the sync site (otherwise sync could be lost, as
795 described above). So, that portion of ubeacon_AmSyncSite() other
796 than the call to urecovery_ResetState() should be abstracted out into
797 a new function, which presumably returns separate flags indicating
798 both whether it is in fact running on the sync site (which becomes
799 the return value of ubeacon_AmSyncSite()) and whether a call to
800 urecovery_ResetState() is indicated. When the latter is set, the
801 beacon thread can acquire the database lock (since, if this is not
802 the sync site, blocking on this lock is not a problem) and call
803 urecovery_ResetState().
805 This difference in behavior is acceptable because the required
806 invariant is that when 'ubik_amSyncSite' is cleared, it cannot become
807 true again until urecovery_state has also cleared. This insures that
808 once a server discovers it is not sync site, urecovery_AllBetter()
809 cannot succeed on the basis of being sync site until sync is regained
810 and recovery has run, insuring the database is up to date. Since
811 only the beacon thread ever sets 'ubik_amSyncSite' true, that thread
812 can meet the invariant by calling urecovery_ResetState() before doing
813 anything else, while other threads must continue to hold the beacon
814 lock to prevent the beacon thread from setting 'ubik_amSyncSite'.
816 ubeacon_Debug() reports the value of syncSiteUntil, without benefit
817 of obtaining any locks. It should also report the value of
818 'ubik_amSyncSite', allowing that global to be made private to the
819 BEACON package. Full correctness requires that the beacon lock be
820 held while accessing these globals.
823 RECOVERY - Consistency, Down Server Recovery, Propagation, Log Replay
825 This package is responsible for tracking whether the local database
826 is current and usable and for discovering when a server comes back up
827 after having been marked down. On the sync site, it is also
828 responsible for tracking the database versions on remote servers and
829 for recovering from inconsistencies.
831 To carry out these tasks, the RECOVERY package maintains a single
832 global, 'urecovery_state'. This is not currently effectively
833 protected by any lock; however, it should be protected by the
834 database lock. This is particularly important because high-level
835 routines that manipulate the database depend on a check against this
836 field to determine that the database is up-to-date and safe to
837 modify, and that it will remain so until the operation is complete.
839 The main work of this package is done by urecovery_Interact(), which
840 is the body of a long-running thread known as the recovery thread.
841 This thread wakes up periodically to discover if any down servers
842 have come back up and, on the sync site, to locate the latest
843 available database, fetch it if necessary, and make sure it has been
844 distributed to all servers. This work is done in several steps,
845 taken on in sequence every few seconds:
847 + The first part of urecovery_Interact() is to identify servers which
848 are believed to be down, and determine whether any of them have
849 come back up. This is done every 30 seconds, even when this is not
850 the sync site. Currently, the database lock is held for the
851 duration of this loop, released only in DoProbe() while actually
852 waiting for the DISK_Probe() RPC to complete. This is mostly
853 unnecessary; the server probe loop needs this lock only to examine
854 the database 'currentDB' flag and to clear the UBIK_RECFOUNDDB bit,
855 and DoProbe() doesn't need it at all. However, the beacon lock is
856 needed to examine and update the database 'up' flag, and DoProbe()
857 needs to hold the server address lock when examining server
858 addresses and, if the server comes back up, to replace connections.
859 See the section on struct ubik_server for more details on this
862 + The next step is to determine whether this is the sync site, and
863 update the UBIK_RECSYNCSITE bit appropriately. The database lock
864 should be held for the duration of this step, as
865 ubeacon_AmSyncSite() may end up calling urecovery_ResetState(),
866 which requires that lock be held. If this is not the sync site,
869 + Step three is to determine the location of the latest database.
870 This is done by making DISK_GetVersion() calls to all active
871 servers, keeping track of the newest version found and the server
872 which reported it. Once this is done, the best version is compared
873 to the local version, update the UBIK_RECHAVEDB bit (so that it is
874 true if and only if the local version is the latest) set the
875 UBIK_RECFOUNDDB bit to indicate the latest database has been
876 located, and clear the UBIK_RECSENTDB bit to force any servers with
877 out-of-date databases to be updated.
879 In this section, it is not necessary to hold any locks while
880 collecting versions on remote servers, though as before, it is
881 necessary to hold the beacon lock while examining the server 'up'
882 flag, and the address lock to obtain a reference on each connection
883 as it is used (but this lock should not be held during the RPC!).
884 Once all remote versions have been fetched, the database lock is
885 needed while examining the local version and updating
886 'urecovery_state'. Note that write transactions may happen while
887 versions are collected, resulting in some servers having newer
888 versions than were detected. However, if this happens, the result
889 is always consistent and ends with the sync site having the latest,
892 + If the actual latest version X.N before the write was in an epoch
893 created by the current sync site, then that version must
894 necessarily already be present on the sync site, since writes are
895 always committed first on the sync site. Thus, when the sync
896 site updates the version counter, it produces a new version X.N+1
897 which no other server can believe it already has; thus, there is
898 no possibility of inconsistency.
900 + If the actual latest version X.N before the write was in an epoch
901 not created by the sync site, then it is possible that after
902 quorum was established but before any writes happened, a new
903 server came online which had a newer version X.N+1 (that was
904 committed to less than a quorum of sites before a crash). If
905 a write transaction begins before recovery detects this fact and
906 fetches the version X.N+1 database, then the new write will start
907 with the version X.N database, essentially creating a version
908 fork. However, when this happens, the resulting database will be
909 version Y.1 (with Y>X), because the first write on a new sync
910 site always results in a new database. Thus, the version X.N+1
911 database will no longer be latest (because the local version,
912 which is examined last, is Y.1), and will end up being discarded.
913 This is no worse than if the server with the version X.N+1
914 database had not come up until completely after the transaction
915 resulting in version Y.1.
917 + Step four is to fetch the latest database, if the sync site does
918 not already have it. Since this step replaces the database
919 wholesale, there can be no active transactions while it runs, and
920 all other database activity must be locked out for the duration of
921 the transfer. This means the database lock must be held for this
922 entire step, starting from the check that the UBIK_RECSYNCSITE
923 and/or UBIK_RECFOUNDDB bits are still set. (Note that the existing
924 comment which claims that it is impossible for UBIK_RECFOUNDDB not
925 to be set at this point is true only if the database is not
926 released since checking or updating that bit in step three). In
927 addition, the address lock must be held while setting up the call
928 to be used for DISK_GetFile().
930 + Step five, again after verifying that this is still the sync site
931 and that the UBIK_RECHAVEDB bit is set, is to distribute the new
932 database to any servers which do not already have it. Before this
933 can be done, if the database was newly-initialized (with label
934 epoch 1), it is relabeled with version 2.1 before it is distributed
935 to other sites. This allows readany transactions to run on such
936 a database (though probably not successfully, since the database is
937 empty); it is deferred until this point in recovery to insure that
938 if any server in the quorum contains a valid database, that version
939 is used rather than the empty one.
941 For this step, the database lock should be held starting at the
942 point at which the UBIK_RECSYNCSITE and/or UBIK_RECHAVEDB bits are
943 tested. In the event the DBWRITING flag is set, the lock must be
944 released while waiting for the active write transaction to end. In
945 this case, once the wait is completed, it is necessary to again
946 check urecovery_state to determine that this is still the sync site
947 and have an up-to-date database before proceeding.
949 It is probably simplest for urecovery_Interact() to acquire the
950 database lock at the start of step 2 described above, release it for
951 the duration of the version-collection part of step 3 and while
952 waiting for write transactions to terminate in step 5, and otherwise
953 hold it until the end of the cycle. If no work is needed, this means
954 the lock will be held relatively briefly; if it is necessary to fetch
955 and or send full database versions, these operations themselves must
956 be done with the lock held, and there is little point in releasing it
957 between them. However, it is acceptable to release the lock between
958 any steps described above, provided that each time the database lock
959 is released and reacquired, 'urecovery_state' is checked to verify
960 that no regression has occurred. Note that outside of the recovery
961 thread, the only changes that may occur to 'urecovery_state' are that
962 it may be completely cleared by urecovery_ResetState(), and that the
963 UBIK_RECLABELDB bit may be set by udisk_commit(), on the sync site.
965 The function urecovery_AllBetter() is called under the database lock
966 by routines which need to know whether there is a usable database.
967 As noted in the discussion of ubeacon_AmSyncSite() above, callers
968 must then continue to hold the database lock until any database
969 operation is complete. Further, callers which intend to write to the
970 database currently follow a call to urecovery_AllBetter() with a call
971 to ubeacon_AmSyncSite(), to determine whether a write is permissible.
972 Unfortunately, this is not safe, because it is possible (though
973 admittedly unlikely) for urecovery_AllBetter() to succeed on the
974 basis of being a non-sync-site with an up-to-date database, and then
975 for a subsequent call to ubeacon_AmSyncSite() to return a true result
976 (because the latter may block on acquiring the beacon lock, during
977 which time this may become the sync site). To insure that no write
978 operation is performed without an up-to-date, writeable database,
979 urecovery_AllBetter() should be modified to indicate whether a write
980 operation is permitted, which is the case only when UBIK_RECHAVEDB is
981 tested and found to be set.
983 The functions urecovery_AbortAll(), urecovery_CheckTid(), and
984 urecovery_ResetState() all expect to be called with the database lock
987 The function urecovery_LostServer() need not be called with any
988 locks; it is merely a wrapper for ubeacon_ReinitServer().
990 The RECOVERY package is also responsible at startup for replaying the
991 transaction log and initializing an empty database. These tasks are
992 handled by urecovery_Initialize() and its helpers, ReplayLog() and
993 InitializeDB(). Because the helpers use PHYS package functions and
994 manipulate database structure fields, urecovery_Initialize() should
995 acquire the database lock before calling them.
998 REMOTE - remote database I/O (DISK) RPCs
1000 This package provides the DISK RPC package, which mediates remote
1001 access to the local database copy, when another server is sync site.
1002 It also includes DISK_UpdateInterfaceAddr(), which is used during
1003 initialization to verify consistency of configuration across servers,
1004 and DISK_Probe(), which is used by the RECOVERY module to determine
1005 whether the server is up.
1007 As the functions in this package are RPC implementations, they may be
1008 called at any time and are entirely responsible for their own
1009 locking. There can be at most one active remote write transaction at
1010 a time; this rule is enforced by this package, which maintains the
1011 global 'ubik_currentTrans' as a pointer to the active transaction.
1013 For the most part, the RPCs in this module use a common prologue,
1014 which includes acquiring the database lock, and do not release that
1015 lock until just before returning. However, the common prologue
1016 examines both 'ubik_currentTrans' and the transaction flags without
1017 benefit of the database lock. This can be corrected fairly easily by
1018 acquiring the database lock sooner; however, this requires referring
1019 to the database via the ubik_dbase global rather than via the 'dbase'
1020 pointer in 'ubik_currentTrans'.
1022 Note that SDISK_GetVersion() and SDISK_SetVersion() contain a sanity
1023 check to ensure they are not running on the sync site, since these
1024 are called only from that portion of the recovery loop which runs
1025 only on the sync site. These calls to ubeacon_AmSyncSite() must be
1026 made with the database lock held, for the results to be valid. The
1027 same holds true for a commented-out check in SDISK_GetFile() (which
1028 really should remain that way -- there's no reason not to allow any
1029 authorized caller to do this at any time).
1031 Additionally, note that SDISK_Commit() acquires a write lock on the
1032 application cache, to insure that no transactions are running which
1033 depend on the application cache, page cache, or underlying database
1034 to change. The locking order requires that this lock be acquired
1035 before taking the database lock; thus, both locks must be acquired
1038 SDISK_SendFile() is really quite bad in its handling of the database
1039 lock under error conditions. Presently, an error condition may cause
1040 it to release the lock too early nor not at all.
1042 SDISK_UpdateInterfaceAddr() is called by peer servers as part of
1043 their startup process, to trade lists of additional addresses and
1044 insure sufficient agreement on configuration to allow the new server
1045 to start. It will need to acquire the lock protecting the addr[]
1046 array attached to each host.
1049 UBIK (high-level API)
1051 This package contains the high-level APIs exported by Ubik to
1052 applications, including initialization, the APIs for creating and
1053 using transactions, and various utilities. It also contains a number
1054 of internal utilities.
1058 Initialization of Ubik is handled in ubik_ServerInitCommon(),
1059 which is called by two API functions, ubik_ServerInit() and
1060 ubik_ServerInitByInfo(), which provide different interfaces for
1061 passing configuration information. This code handles
1062 initialization of the database, locks, and various globals, calls
1063 the initialization routines for each subsystem, makes sure Rx has
1064 been initialized, and creates Ubik's Rx services and threads.
1066 Most of this is done before any Ubik-specific threads are created
1067 and before Ubik's RPCs can be called, and so takes place without
1068 holding any lock. However, since Rx may have been initialized and
1069 started before Ubik, it is possible that some Rx worker threads
1070 may have already been created. In practice, locks internal to Rx
1071 should insure that any memory writes made before an Rx service is
1072 created become visible to any worker running on behalf of that
1073 service; nonetheless, this should not be depended on, and proper
1074 locking should be used during all phases of initialization.
1076 Unfortunately, the current code actually starts Rx services before
1077 initialization is complete, resulting in a number of possible
1078 races. The general THREAD SAFETY section below contains
1079 recommendations for reordering the startup process to eliminate
1080 this problem, while still avoiding potential deadlock if multiple
1081 servers start at once.
1085 Nearly a third of ubik.c consists of various ContactQuorum_*
1086 functions, which are used to make a particular RPC to every server
1087 in the quorum. These used to be a single ContactQuorum()
1088 function, but have been split into multiple functions to regain
1089 type-safety. These retain a common prologue and epilogue, some
1090 small part of which has been abstracted out, but most of which
1091 remains duplicated in each of these functions. Indeed, the only
1092 part unique to each function is the actual RPC call and, in the
1093 case of ContactQuorum_DISK_WriteV(), code to fall back to multiple
1094 calls to DISK_Write() on a per-server basis.
1096 The code common to all ContactQuorum_* functions will need to
1097 acquire the beacon lock when checking whether a server is up (no
1098 calls are made to servers which are not marked up), and again when
1099 marking a server down. Note that when a server is marked down,
1100 the 'up' and 'beaconSinceDown' flags must be cleared at the same
1101 time, without releasing the beacon lock. In addition, the helper
1102 function Quorum_StartIO() must obtain the server address lock
1103 while it obtains a reference on the Rx connection that will be
1106 All of the ContactQuorum_* functions must be called with the
1107 database lock held; this is necessary to protect their access to
1108 each server's 'version' and 'currentDB' fields and to the local
1109 database version. However, these functions release the lock while
1110 actually making RPCs. This behavior is new -- ContactQuorum() in
1111 OpenAFS 1.4 did not release the database lock during RPCs -- but
1112 it is safe, because no caller of ContactQuorum_*() depends on
1113 state read under the database lock before the call to remain valid
1118 The main Ubik transaction API consists of ubik_BeginTrans(),
1119 ubik_BeginTransReadAny(), ubik_BeginTransReadAnyWrite(),
1120 ubik_AbortTrans(), ubik_EndTrans(), ubik_Read(), ubik_Write(),
1121 ubik_Flush(), ubik_Seek(), ubik_Tell(), ubik_Truncate(), and
1122 ubik_SetLock(). The ubik_Begin* functions require a database
1123 pointer; all others require an active transaction pointer. These
1124 functions are called directly by the application without any
1125 locks, and are responsible for their own locking. Note that many
1126 of these examine the transaction type before acquiring any locks;
1127 this is permissible because the transaction type is immutable once
1128 the transaction is created (but it requires that the call be made
1129 from the thread that created the transaction, or that the
1130 application have used some mechanism to insure an appropriate
1131 memory barrier exists; the same mechanism used to protect the
1132 transaction pointer should suffice).
1134 BeginTrans() contains the common code responsible for starting
1135 a new transaction; this is made available to applications as
1136 ubik_BeginTrans(), ubik_BeginTransReadAny(), and
1137 ubik_BeginTransReadAnyWrite(). If a write transaction is in
1138 progress, this function waits for it to complete. However, since
1139 doing so releases the database lock, it is necessary to call
1140 urecovery_AllBetter() again once the lock has been reacquired.
1141 Further, in the UBIK_PAUSE case, the DBVOTING flag may have been
1142 set, so this also must be retested (see the BEACON package section
1143 for more on the difficulties of UBIK_PAUSE and DBVOTING).
1145 Both ubik_Flush() and ubik_Write() examine and manipulate the
1146 transaction I/O vector without benefit of any lock. These fields
1147 ('iovec_info' and 'iovec_data') should be protected by the
1148 database lock. In the case of ubik_Write(), this may require
1149 releasing the lock before calling ubik_Flush(), or the
1150 introduction of a private ubik_Flush_r() which assumes the lock is
1151 already held (but note that in either case, the lock is released,
1152 because ubik_Flush() calls ContactQuorum_* functions).
1156 The functions ubik_GetVersion() and ubik_WaitVersion() provide the
1157 application with a way to discover the current database version
1158 and to wait for it to change. These interfaces are not currently
1159 used in OpenAFS. ubik_GetVersion() needs to acquire the database
1160 lock while copying the database version.
1162 The internal function ubikGetPrimaryInterfaceAddr() is used by
1163 Ubik RPCs to determine a peer server's primary address, given the
1164 IP address from which a call arrived. This needs to hold the
1165 server address lock, as described in the section on struct
1170 The section on struct ubik_dbase describes the operation of the
1171 application cache. The function ubik_CheckCache() is provided to
1172 allow an application to insure the cache is up to date and obtain
1173 a read lock on the cache which lasts for the duration of the
1176 Note that ubik_AbortTrans() currently always invalidates the
1177 application cache by clearing ubik_dbase->cachedVersion, for which
1178 it requires a write lock on the cache_lock. This means that
1179 aborted transactions block waiting for completion of any
1180 transactions which hold the cache_lock, which may well include all
1181 outstanding transactions. In the case of read transactions, which
1182 cannot have modified the database, this is unnecessary and may
1183 significantly reduce concurrency.
1188 This section discusses changes needed to insure thread safety.
1190 The initialization sequence in ubik_ServerInitCommon() should be cleaned
1191 up, to ensure that every subsystem is fully initialized before starting
1192 any background threads or accepting RPCs. The correct initialization
1193 sequence should look something like this:
1195 + Initialize error tables
1196 + Allocate and initialize the database structure
1198 + Initialize the various subsystems:
1199 + udisk_Init() (formerly disk.c:DInit)
1200 + ulock_Init() (new, pulled out of ulock_getLock)
1202 + urecovery_Initialize()
1203 + ubeacon_InitServerList*
1204 + Create Rx services
1205 + Start an Rx server thread
1206 + updateUbikNetworkAddress() (rename and export from beacon.c)
1207 + Start the beacon and recovery threads
1209 Initialization functions may need to be added for some subsystems which
1210 do not currently have them:
1212 + src/ubik/disk.c:DInit() should be exported and renamed
1213 udisk_Init(); it can then be called from ubik_ServerInitCommon()
1214 instead of from udisk_begin().
1216 + The lock initialization currently done in ulock_getLock() should
1217 instead be done in a separate ulock_Init(), which should be called
1218 from ubik_ServerInitCommon()
1220 A new lock is needed to protect state belonging to the VOTE package,
1221 including the globals 'ubik_dbVersion' and 'ubik_dbTid', which represent
1222 the sync site's database state. See the VOTE package section for
1223 details on the use of the vote lock.
1225 A new lock is needed to protect state belonging to the BEACON package,
1226 including the globals 'ubik_amSyncSite' and 'syncSiteUntil' and several
1227 fields in the ubik_server structure. This lock also protects the
1228 per-server 'up' flag, which is shared among several modules. In
1229 addition, some beacon-specific fields are accessed directly by other
1230 modules. Thus, this lock must be visible outside the BEACON module, or
1231 else accessors must be provided for these items. See the BEACON package
1232 section for details on the use of the beacon lock.
1234 A new lock is needed to provide additional protection for the database
1235 version, transaction counter, and flags, so that these can safely be
1236 examined (but not updated) by the beacon thread without the need to
1237 acquire the database lock. See the BEACON package section for details
1238 on the use of the version lock.
1240 A new lock is needed to protect per-server lists of non-primary
1241 addresses, connections to the VOTE and DISK services on each server, and
1242 the client-mode security class objects used to set up these connections.
1243 See the section on struct ubik_server for details on the use of the
1244 server address lock.
1246 The required locking order is described by the following list. Any of
1247 these locks may be acquired singly; when acquiring multiple locks, they
1248 should be acquired in the listed order:
1250 + application cache lock (dbase->cache_lock)
1251 + database lock (DBHOLD/DBRELE)
1255 + server address lock
1257 Some cleanup is required in various places where existing locking
1258 conventions are not properly obeyed:
1260 + SVOTE_Beacon() needs to hold the database lock when calling
1261 urecovery_CheckTid()
1263 + Management of the database lock in SDISK_SendFile() needs to be
1264 cleaned up so that it is always released at the proper time.
1266 + urecovery_Interact() is not consistent about holding the database
1267 lock when it examines and updates 'urecovery_state'. It also
1268 examines the local database version at least once without benefit
1269 of this lock. See the RECOVERY package section for details on
1270 which locks are needed when by urecovery_Interact().
1272 + ubik_Flush() and ubik_Write() need to acquire the database lock
1273 before accessing the 'iovec_info' and 'iovec_data' fields.
1275 + ubik_GetVersion() needs to acquire the database lock before copying
1276 the database version.
1278 The various debugging RPCs in the VOTE packages, and the data collection
1279 routines they call in other packages, generally operate without benefit
1280 of any locks. This is probably desirable, as it avoids any possibility
1281 of debugging operations blocking on or interfering with the rest of the
1282 system, and allows debugging data to be collected even in the event of
1283 some unforeseen deadlock. It is also "safe", in that it does not risk
1284 damage to data structures or access to uninitialized memory, provided
1285 that data structure initialization is complete before these functions
1286 are called. However, it does mean that these RPCs may return data that
1287 is not entirely self-consistent.
1289 + Correctness requires various locks be held during parts of
1290 SVOTE_Debug() and SVOTE_DebugOld(); see the VOTE package section.
1292 + SVOTE_Debug() also examines 'ubik_currentTrans' without benefit of
1293 the database lock. This usage is not "safe", as that transaction
1294 may be freed by another thread while SVOTE_Debug() is examining it.
1296 + SVOTE_XSDebug() should hold the server address lock while copying
1297 out server addresses, and other locks while copying per-server
1300 + udisk_Debug() should hold the database lock while examining the
1301 database version and collecting page cache statistics.
1303 + ulock_Debug should use LOCK_LOCK/LOCK_UNLOCK when examining the
1304 internals of struct Lock. Ideally, knowledge of these internals
1305 should be abstracted to src/lwp/lock.h.
1307 + ubeacon_Debug() should hold the beacon lock while accessing the
1308 value of 'syncSiteUntil' and, if the access is moved here from
1309 SVOTE_Debug/SVOTE_DebugOld, 'ubik_amSyncSite'.
1311 CONCURRENCY THOUGHTS
1313 This section collections information discussed previously about changes
1314 that may be required or desirable in order to improve concurrency. Note
1315 that none of these changes are needed to insure thread safety, provided
1316 the changes discussed in the previous section are made.
1318 + PHYS could be made more concurrent by maintaining a separate lock for
1319 the fdcache, avoiding the need to DBHOLD while calling these.
1320 However, higher-layer consistency probably requires that certain calls
1321 or groups of calls to this package be atomic and isolated.
1323 + Could DISK be made more concurrent by maintaining a separate lock for
1324 the page cache and or free truncation record list?
1326 + Could concurrency be improved by maintaining separate per-transaction
1327 locks? If so, what is the locking hierarchy with respect to database
1328 locks and the DISK and PHYS package locks?
1330 + Could concurrency be improved by maintaining more reader/writer state
1331 and/or separate write locks to insure consistency of database file
1332 access without requiring so much DBHOLD?
1334 + Alternately, could concurrency be improved by requiring that an active
1335 transaction be accessed only by the thread that created it?
1338 MULTI-DATABASE SUPPORT
1340 There isn't any. Some of the Ubik interfaces and even the organization
1341 of some internal data structures makes it appear as though multiple
1342 databases can be supported. However, this is not the case, for several
1345 + Only a single set of peer servers is tracked, via a global linked list
1346 of server structures. Each server structure includes state such as
1347 the server's database version and whether it is up-to-date, so it is
1348 not possible to track multiple databases even for a common set of
1350 + The physical I/O package (phys.c) tracks cached open file descriptors
1351 without regard to what database a file is part of; thus, it cannot
1352 handle accessing multiple databases at the same time.
1353 + There are a variety of globals containing state which needs to be
1354 tracked on a per-database basis.
1355 + There are a variety of globals which are effectively protected by
1356 a per-database lock, or by no lock at all.
1357 + The various RPCs used to communicate between servers assume a single
1358 database and do not include a database identifier. Thus, in the
1359 presence of multiple databases, it would be impossible to determine,
1360 for example, for which database a DISK RPC was intended.
1362 The upshot of this is that considerable work would be needed to get Ubik
1363 into a condition which would allow a single process to maintain multiple
1364 independent Ubik databases at the same time. Note that this is
1365 orthogonal to the question of whether a single multi-file database is
1369 OTHER DATA STRUCTURES
1371 Some other data structures are mentioned in ubik.p.h or elsewhere, but
1372 are of relatively little interest from the point of view of threading.
1373 They are described here primarily as an alternative to deleting
1374 descriptive text that was already written.
1376 struct ubik_hdr - on-disk database file header
1378 This structure represents the on-disk format of the Ubik database
1379 file header (label). It is used only for local variables in
1380 uphys_getlabel() and uphys_setlabel(), which read and write this
1381 header, respectively.
1383 struct ubik_stat - database file status
1385 This structure is used to convey the size and modification timestamp
1386 of a database file. It is used as the type of an output parameter of
1387 uphys_stat(), and for local variables in callers of that function.
1389 struct ubik_stats - global statistics
1391 This structure is used to contain "miscellaneous" global statistics.
1392 Currently that consists of a single counter which is incremented in
1393 one place (an exceptional condition in ubik_EndTrans) and is never
1394 read. The one increment is performed only when holding the database
1395 lock; thus, this structure may be treated the same as other globals
1396 protected by non-global locks.
1400 There appears to be nothing to insure that calls to disk.c:DRead()
1401 during a write transaction won't find and use a "clean" cached page when
1402 there is already a dirty page in the cache resulting from an earlier
1403 write as part of the same transaction. This means that if a transaction
1404 reads a page after writing it, it may read back the original data
1405 instead of the new, and if a transaction writes a page more than once,
1406 the later writes may corrupt the database. It seems likely that we are
1407 merely lucky in this regard, and this problem should be fixed.
1409 src/ubik/beacon.c:ubeacon_ReinitServer() assumes 'ubik_CRXSecurityRock'
1410 is in fact an AFS configuration directory pointer, suitable as an
1411 argument to afsconf_UpToDate(). This is inappropriate;
1412 'ubik_CRXSecurityRock' is an opaque argument to
1413 (*ubik_CRXSecurityProc)(), Ubik must not make assumptions about what it
1416 It may be worth examining whether it is appropriate for SVOTE_Beacon()
1417 to call urecovery_CheckTid() only when voting yes, and not whenever
1418 a beacon is received from a server claiming to be sync site.
1420 Recently, code has been introduced which attempts to insure that the
1421 recovery process does not truncate an old database file before a new one
1422 has been fully transferred, since this could, depending on the timing of
1423 a server failure, deprive the new quorum of a valid database. This
1424 so-called "new recovery" code violates the abstraction provided by the
1425 PHYS package, encoding specific knowledge of the underlying file store
1426 into the REMOTE and RECOVERY packages. This means that the backend
1427 provided by phys.c cannot be replaced with an alternative, since in that
1428 case recovery would do the wrong thing with the transferred database.
1430 The DBWRITING loop in urecovery_Interact() is not portable, as it
1431 depends the contents of the timeout after a call to select(2). This was
1432 fine for IOMGR_Select(), which has defined behavior, but not for