1 This file contains a threading analysis of Ubik prepared by
2 Jeffrey Hutzelman and sent to the openafs-devel@openafs.org
3 mailing list in February 2011, archived at:
4 https://lists.openafs.org/pipermail/openafs-devel/2011-February/018329.html
6 A while ago, Jeff Altman asked me to do an analysis of the Ubik code with
7 repsect to threading, to try to determine how ready the code is to be moved
8 from an LWP-based environment to one using preemptive POSIX threads, and
9 what it will take to get it the rest of the way there. Now that the work
10 is complete, I'm posting the complete analysis and my recommendations for
11 discussion and as a backdrop for changes I expect to see proposed in the
14 This work was funded by Your File System, Inc.
20 This document describes the structure of Ubik, with an eye toward
21 thread-safety and concurrency. It begins by discussing the elements,
22 usage, and locking considerations of major shared data structures. This
23 is followed by a discussion of each major subsystem. The emphasis is on
24 deficiencies in thread-safety and the changes needed to correct them.
26 Most of this document focuses on the user-mode Ubik server code, which
27 comprises the bulk of the Ubik library code and also of the complexity.
28 A separate section near the end is dedicated to Ubik client code, which
29 itself is intended primarily for use in user-mode applications. The
30 OpenAFS cache manager includes its own mechanisms for accessing
31 replicated services, including the Ubik-managed VLDB, and for tracking
32 the servers that provide them.
34 Occasionally, issues related to concurrency, performance, and modularity
35 (particularly as regards support for supporting multiple independent
36 databases in a single server process) are also discussed; however, these
37 discussions are peripheral and are included only to record data obtained
38 while examining the question of thread-safety and as a basis for future
39 work. Similarly, throughout this document are detailed discussions of
40 some parts of the code and certain data structures which, while not
41 always directly relevant to thread-safety, were produced in the course
42 of analyzing the Ubik library code. These are included to provide
43 additional background and as a basis for future work in documentation of
44 Ubik library internals and interfaces.
49 This section calls out the major shared data structures used in the Ubik
50 server code and details what they are used for and how shared access to
51 them is protected. Much of this text is probably suitable as a starting
52 point for expanding on the documentation found in comments in ubik.p.h.
54 struct ubik_dbase - database
56 This structure contains nearly everything Ubik knows about an open
57 database (for an explanation of "nearly", see the section below on
58 globals), including database-wide parameters and state and a list of
61 The following structure elements are set only when the database
62 structure is being initialized, after which they are never modified:
64 + pathName - The base path used to construct database filenames
65 + physical later method pointers (read, write, truncate, sync,
66 stat, open, setlabel, getlabel, getnfiles), used to manipulate
67 the physical database files on disk. These currently always
68 point to the methods defined in phys.c, and in fact other
69 modules contain knowledge of the internal implementation of that
72 The primary synchronization mechanism used to protect this structure
73 is the database lock (versionLock), which is manipulated via the
74 DBHOLD() and DBRELE() macros. This lock protects the following
75 structure elements, as well as a number of globals as described
76 elsewhere in this document:
78 + activeTrans - list of active transactions
79 + version - database version (*)
80 + tidCounter - transaction ID of latest transaction (*)
81 + writeTidCounter - transaction ID of latest write transaction (*)
85 (*) These fields are currently protected by the database lock, except
86 that the beacon thread does not hold that lock when examining them,
87 or the global 'ubik_epochTime', which holds the corresponding epoch.
88 The recommendation is for these fields, along with ubik_epochTime, to
89 be additionally protected by a second lock, allowing the beacon
90 thread to examine them without holding the database lock; see the
91 section on the BEACON package for details.
93 The condition variable 'flags_cond' is used by udisk_end() to signal
94 that DBWRITING flag has been cleared. This wakes threads waiting in
95 ubik.c:BeginTrans() to begin a new transaction. This CV is
96 associated with the database lock. When LWP is used, this condition
97 is signalled on &ubik_dbase->flags.
99 When LWP is used, signals are also sometimes sent on &urecovery_state
100 and on &ubik_amSyncSite. However, nothing ever waits on these, and
101 no corresponding condition variables exist when pthreads is used.
103 The 'cachedVersion' field holds the database version represented by
104 the application's cache; this is compared against the current
105 database version to discover if the cache is up to date. Both
106 cachedVersion and the application cache itself are protected by
107 cache_lock, an AFS reader/writer lock.
109 When the application wishes to use the cache, it calls
110 ubik_CacheUpdate(), which verifies that the cache is current,
111 updating it if necessary (under a write lock), and then returns with
112 the application cache read-locked. The application can then rely on
113 the cache's contents not to change until it ends the transaction by
114 calling ubik_EndTrans() or ubik_AbortTrans(). When one of these
115 functions is called, the read lock is dropped, and a write lock may
116 be acquired temporarily to update both cachedVersion and the cache
117 (in the case of committing a write transaction) or to clear
118 cachedVersion, indicating the cache is invalid (in the case of
119 aborting a transaction).
121 In the case of ubik_EndTrans() on a write transaction, the cache is
122 held write-locked until udisk_commit() has completed; this is also
123 done in SDISK_Commit(), which insures that the contents of the
124 on-disk database and page cache cannot change while an active
125 transaction holds the cache lock. Note that both the recovery thread
126 and SDISK_SendFile() may replace the database contents wholesale;
127 however, these operations first abort all outstanding transactions.
129 Prior to the introduction of ubik_BeginTransReadAnyWrite(), an
130 application could choose to manage its own cache, updating it during
131 write transactions and when, at the beginning of a read transaction,
132 it was discovered to be out of date (due to a write done by another
133 server). Under this model, the application uses Ubik transaction
134 locks to protect not only the database but also its cache. However,
135 the use of ubik_BeginTransReadAnyWrite() allows some transactions to
136 proceed and read the database without acquiring any locks, which
137 makes it unsafe to depend on Ubik transaction locks to protect the
138 application cache. Applications which make use of this mode must set
139 ubik_SyncWriterCacheProc to point to a procedure to be called to
140 update the cache during ubik_EndTrans(), after udisk_commit() has
141 succeeded and before the cache_lock is released. These applications
142 must then refrain from ever updating the cache other than during that
143 procedure or during the callback passed to ubik_CheckCache().
146 struct ubik_trans - active transaction
148 This structure represents the state of an active transaction,
149 including transaction metadata and state and a list of all changes
150 made as part of the transaction.
152 Transactions are organized into a linked list referenced by the
153 corresponding database structure, protected by the database lock.
154 Each transaction contains an upward pointer to the database,
155 a transaction type, and a transaction ID, all of which are immutable
156 for as long as the transaction exists.
158 A transaction created via the REMOTE interface exists until ended by
159 a call to SDISK_Abort() or SDISK_ReleaseLocks() or aborted in
160 urecovery_CheckTid() due to a transaction ID mismatch or creation of
161 a new transaction (note that in this last case, cleanup will be
162 deferred if SDISK_Lock() is blocked on this transaction). In these
163 cases, ubik_currentTrans is cleared at the same time, under the
164 database lock, and further REMOTE calls will fail gracefully until
165 a new transaction is started. Thus, REMOTE transactions may be
166 considered valid only for as long as the database lock is held.
168 Transactions created via ubik_BeginTrans(), ubik_BeginTransReadAny(),
169 or ubik_BeginTransReadAnyWrite() exist until ended by a call to
170 ubik_AbortTrans() or ubik_EndTrans(). They are never destroyed from
171 within the UBIK package. Code which discovers a transaction by
172 traversing the database's active transaction list may access the
173 transaction only as long as the database lock is held, since once it
174 is released, an application thread may end the transaction.
176 Each transaction contains the following fields, which are protected
177 by the database lock:
179 + locktype - type of lock held by this transaction
180 + minCommitTime - unused
181 + seekFile - file number of database file pointer
182 + seekPos - position of database file pointer
185 File writes are represented by 'iovec_info', an XDR vector structure
186 (type iovec_wrt, a vector of struct ubik_iovec) containing the file,
187 position, and length of each write, and 'iovec_data', an XDR opaque
188 containing the corresponding data. These vectors and their contents
189 are protected by the database lock.
191 File truncations are represented by a linked list of struct
192 ubik_trunc, each of which contains a file number and the new size of
193 that file. This list, and the records contained in it, are protected
194 by the database lock.
196 While transactions are currently protected by the database lock, it
197 may be desirable in the future to introduce a per-transaction lock,
198 in order to facilitate increased concurrency.
201 struct ubik_server - server status
203 This structure is used to track the set of peer servers, including
204 state related to elections and recovery. The global 'ubik_servers'
205 points to a linked list of these structures, which is constructed
206 during initialization. Once the list has been constructed, no
207 entries are ever added or removed; in addition, the following fields
208 are not changed after startup:
210 + addr[0] - server's primary address
211 + magic - true if this is the magic server
212 + isClone - true if this server is a clone
214 The following fields are used by the BEACON package to track beacons
215 to this server and their results. Currently, they are sometimes
216 accessed under the database lock, and sometimes without benefit of
217 any lock. They should probably be protected by the same lock as
218 globals private to the BEACON package.
220 + lastVoteTime - last time this server voted yes for us
221 + lastBeaconSent - last time a beacon was sent to this server
222 + lastVote - true if its late vote response was yes
223 + up - true if this server is believed to be up
224 + beaconSinceDown - true if a beacon has successfully been sent to
225 this server since the last time it was down
227 The following fields are used by the RECOVERY package to track the
228 state of each server. They are also examined and modified by the
229 various ContactQuorum_* functions. Currently, they are sometimes
230 accessed under the database lock, and sometimes without benefit of
231 any lock. They should be fully protected by the database lock.
233 + version - server's database version
234 + currentDB - true if server's database is up-to-date
236 The addr[] array is used to track all known addresses for the remote
237 server; this list is used when attempting to contact a down server
238 via an alternate address, and to identify the source of incoming
239 server-to-server RPCs. The first element of this array is always the
240 primary address and does not change after initialization; other
241 elements are updated during initialization and when a remote server
242 calls DISK_UpdateInterfaceAddr().
244 The system normally maintains an active Rx connection to the VOTE and
245 DISK services on each server; pointers to these are kept in the
246 'vote_rxcid' and 'disk_rxcid' fields.
248 Currently, the addr[] array, 'vote_rxcid', and 'disk_rxcid' are not
249 protected by any lock. Because these are shared among a number of
250 subsystems and it is usually desirable to avoid blocking on the
251 database lock when making RPCs, it is recommended to introduce a new
252 lock (the "server address lock") to protect these fields and the
253 globals 'ubikSecClass' and 'ubikSecIndex', which contain the security
254 classes used for setting up such connections. The new lock would
255 need to be held in the following cases:
256 + Before beginning any RPC using 'vote_rxcid' or 'disk_rxcid', for
257 the time it takes to call rx_GetConnection() or rx_NewCall()
258 + For the duration of ubikGetPrimaryInterfaceAddr()
259 + In ubeacon_Interact(), when building the set of connections to be
260 used for multi_VOTE_Beacon().
261 + In updateUbikNetworkAddress(), first while constructing the set of
262 connections to be used for multi_DISK_UpdateInterfaceAddr(), and
263 then again each time a server's list of addresses is updated.
264 However, the lock must NOT be held while waiting for the RPC to
265 complete, to avoid deadlock when multiple servers are starting at
267 + For the duration of SDISK_UpdateInterfaceAddr()
268 + For the duration of ubeacon_ReinitServer(), with the optional
269 exception of the call to afsconf_UpToDate().
270 + In src/ubik/recovery.c:DoProbe(), first while constructing the set
271 of connections to be used for multi_DISK_Probe() and, if the probe
272 is successful, again while destroying the old connections for the
273 server under test and replacing them with new ones. See the
274 section on the RECOVERY package for additional notes.
279 The library has an unfortunate number of global variables which are not
280 part of any data structure. Nonetheless, some of these are
281 database-specific; see MULTI-DATABASE SUPPORT, below.
283 Globals private to a particular subsystem are discussed in the
284 description of that subsystem. This section discusses global variables
285 which are shared and/or part of external interfaces.
287 The following globals are used to convey configuration parameters from
288 the application to the Ubik library, and are intended to be set before
289 the first call to ubik_ServerInit(). They are not modified by Ubik, and
290 are not protected by locks:
292 + ubik_debugFlag - print debug messages (unused)
293 + ubikPrimaryAddrOnly - use only primary interface
294 + ubik_nBuffers - number of DISK package page cache buffers
295 + ubik_SRXSecurityProc - callback to create server security class
296 + ubik_SRXSecurityRock - argument for ubik_SRXSecurityProc
297 + ubik_CRXSecurityProc - callback to create client security class
298 + ubik_CRXSecurityRock - argument for ubik_CRXSecurityProc
299 + ubik_SyncWriterCacheProc - callback to update application cache
300 + ubik_CheckRXSecurityProc - callback to check if caller is OK
301 + ubik_CheckRXSecurityRock - argument for ubik_CheckRXSecurityProc
303 The following globals contain values which are computed during Ubik
304 initialization and not modified thereafter. They are not protected by
305 any lock, which is safe so long as suitable memory barriers exist across
306 creation of threads and Rx services (see the notes on initialization in
307 the UBIK package section). Note that some of these values are
308 database-specific and properly belong in the ubik_dbase structure:
310 + ubik_servers - linked list of servers (see above)
311 + amIClone - true if this server is a clone
312 + ubik_callPortal - UDP port used for server-to-server RPCs
313 + ubik_quorum - number of servers to make a quorum
314 + ubik_dbase - pointer to the (only) database (see above)
315 + ubik_host[] - array of local addresses
316 + ubik_sc[] - security classes for Ubik RPC services
318 The following globals are currently protected by the database lock,
319 though in most cases there are existing references which do not hold the
320 lock. Later sections of this document contain recommendations in each
321 case to establish new locks to protect these or to correct cases where
322 they are accessed without the correct lock.
324 + ubik_currentTrans - pointer to REMOTE's current transaction
325 + ubik_epochTime - epoch for transaction IDs
326 + urecovery_state - recovery state bits
327 + ubik_dbVersion - sync site's database version
328 + ubik_dbTid - sync site's transaction ID
329 + ubikSecIndex - security class index for making Ubik RPCs
330 + ubikSecClass - security class for making Ubik RPCs
332 The global 'ubik_amSyncSite' belongs to the BEACON package, and should
333 be protected by the same lock as that package's private globals. In
334 fact, it should probably be made private; the only references outside
335 BEACON are in SVOTE_Debug() and SVOTE_DebugOld() and can easily be moved
336 into ubeacon_Debug().
341 This section describes each of the major Ubik server subsystems. For
342 each subsystem, a brief overview is provided, along with a description
343 of that subsystem's handling of synchronization (or lack thereof).
344 Subsystems are described beginning with the lowest layer and working
349 This package is responsible for handling I/O to database files. It
350 maintains a private, global array of MAXFDCACHE(4) fdcache
351 structures, each representing an open file descriptor on a database
352 file, either presently in use or idle and available for use. Entries
353 are indexed by database file ID, and each contain a file descriptor
354 number and a refcount. There is also a global 'inited' flag, and
355 a buffer used to construct pathnames when opening database files.
357 The static function uphys_open() takes a database pointer and fileid
358 and attempts to open the corresponding file. On success, it returns
359 a file descriptor; on failure, it returns the same as open(2) (with
360 errno set). This prefers to reuse an already-open fd whose cache
361 refcount is 0; failing that, it opens a new fd and attempts to enter
362 it in the cache. It prefers first to use an unused cache entry, then
363 to reclaim an idle one whose refcount is 0; if neither is possible,
364 the newly-opened fd is returned without caching. Note that the
365 refcount on an active entry is always exactly 1; the same entry
366 cannot be referenced more than once.
368 The function uphys_close() releases a file descriptor opened by
369 uphys_open(). If the file descriptor was cached, the cache entry is
370 dereferenced but the file is left open for future use; 0 is returned.
371 If the file descriptor is not found in the cache, then the return
372 value is that of an immediate call to close(2). This function also
373 handles cleanup when a cached fd is closed after the file to which it
374 refers is invalidated by uphys_invalidate(). This function is not
375 static, but may as well be -- it is used only by other functions in
378 Functions in this package are called via method pointers in the
379 database structure, which are set during initialization and not
380 changed thereafter. These functions must be called with the database
381 lock held; this protects the file descriptor cache and other globals
382 mentioned above, and prevents conflicting access to database files.
384 The requirement that functions in this package be called with the
385 database lock held means that concurrent access to database files is
386 not possible. This situation could be improved by the addition of
387 a new global lock, which would be held for most of the duration of
388 uphys_open(), uphys_close(), and uphys_invalidate(), but need _not_
389 be held during actual file accesses. This would eliminate the
390 requirement that calls to this package be made with the database lock
391 held, allowing multiple calls to be active concurrently. However, it
392 would still be necessary for higher layers to employ appropriate
393 synchronization as needed to maintain database consistency.
396 DISK - Logical Database I/O and Transaction Management
398 This package is responsible for logical database file I/O, logging,
399 and transaction management. It maintains a private cache of database
400 file page buffers. This entire cache, including buffer space, is
401 allocated and initialized the first time a transaction is started;
402 the global initd flag keeps track of whether or not this has been
403 done. The size of this cache defaults to 20 pages, but may be
404 overridden by setting the global 'ubik_nBuffers' (all OpenAFS Ubik
405 services do this); this global is not protected by any lock, but is
406 read exactly once, when the cache is initialized.
408 Except for udisk_Debug() (see below), all udisk_* interfaces require
409 a pointer either to the database or to an active transaction, and
410 must be called with the database lock held. This protects the DISK
411 package's accesses of the active transaction list and associated
412 state variables as well as its calls into the physical access layer,
413 including insuring that operations requiring multiple calls into that
414 layer are performed atomically.
416 In addition, the database lock protects the page cache hash table and
417 LRU queue, the buffer control structures, buffer contents, the
418 truncation record free list (freeTruncList), and global statistics.
419 However, it may be desirable in the future to introduce a separate
420 lock to protect these structures, in order to facilitate multiple
423 udisk_LogOpcode, udisk_LogEnd, udisk_LogTruncate, udisk_LogWriteData
425 These are internal interfaces for writing to the transaction log.
426 They each construct a log record and then directly call the
427 physical layer to append it to the log; the log file is _not_
428 cached by the page buffer cache, and these functions do not touch
429 the cache. Each of these must be atomic with respect to other
430 writes to the log file, which is accomplished by insuring they are
431 called only with the database locked.
433 DInit, DRead, DTrunc, DRelease, DFlush, DAbort, DSync, DNew
435 These are internal interfaces for accessing the page buffer cache.
436 They must be called with the database lock held.
438 DRead() and DNew() return a pointer to the page data buffer, not
439 to the buffer control structure; this pointer is later passed to
440 DRelease() to release the buffer. This mechanism depends on the
441 fact that buffer control structures are allocated in the same
442 order as the actual page data buffers.
444 DRead() guarantees that a read transaction will always see a clean
445 copy of the requested page, without any modifications made by an
446 in-progress write transaction. However, this is true only if the
447 next layer insures that once DRead() is called with write intent
448 on a page, that no further calls to DRead() are made for that page
449 until the buffer is written and DRelease() is called with the
450 dirty flag. Presently, this is the case because udisk_read() and
451 udisk_write(), which are the only callers of DRead(), are both
452 called with the database lock held.
454 udisk_read, udisk_truncate, udisk_write, udisk_begin, udisk_commit,
455 udisk_abort, udisk_end, udisk_Invalidate
457 These are external interfaces provided by the DISK package to
458 higher layers. Presently, the locking requirements of these
459 functions and the lower-layer functions they call are satisfied by
460 the simple rule that these must always be called with the database
461 lock. In the future, support for multiple databases and/or
462 a greater level of concurrency may require relaxing this rule
463 and/or acquiring additional locks within these functions.
465 In udisk_commit(), when committing the first write transaction
466 after becoming sync site, the database is relabelled. In this
467 case, the UBIK_RECLABELDB recovery state bit should be set before
468 propagating the label change to other servers, rather than after.
469 This is because because ContactQuorum_DISK_Setversion() will
470 release the database lock, at which point the recovery state may
471 be cleared by urecovery_ResetState() running in another thread.
472 It is important that a relabelling which occurs before recovery
473 state is cleared not result in the UBIK_RECLABELDB recovery state
474 bit being set after; otherwise, the server may fail to correctly
475 relabel the database the next time it becomes sync site.
479 The udisk_Debug() interface is called as part of preparing the
480 response to Ubik debugging requests. Unlike other udisk_* APIs,
481 this interface is not called on a specific database and does not
482 require the database lock to be held. It reports the version data
483 of the single database identified by the global 'ubik_dbase' and
484 traverses the buffer cache collecting statistics, both without
485 benefit of holding any locks. This may produce inconsistent data
486 but does not normally risk damaging data structures or accessing
487 invalid or uninitialized memory (but see below), and avoiding
488 locking may be of greater benefit than guaranteeing the return of
491 The global 'ubik_dbase' pointer is set when the first (only)
492 database is initialized, and the buffer cache data structures are
493 allocated the first time a transaction is started, either locally
494 or via the REMOTE package. Once set, the pointers used by
495 udisk_Debug() are never changed, and the data structures they
496 point to remain valid and are never freed. The buffer cache is
497 traversed by iterating over the array of buffer cache elements
498 (this is done in other places as well), not by following a linked
499 list or other data structure that may reordered, and so should not
502 At a minimum, then, it should be arranged that udisk_Debug()
503 cannot be called before the database structure and page cache have
504 been initialized. This means that both of these actions must be
505 taken before any incoming RPCs are accepted. Practically, this
506 means that DInit() should be renamed udisk_init(), and called from
507 ubik_ServerInitCommon() during initialization.
509 Full correctness demands that udisk_Debug actually hold the
510 database lock while copying the database version and examining the
511 page cache. However, it may be desirable to sacrifice this
512 correctness in order to reduce interaction between the debugging
513 interfaces and the rest of the system.
516 + unthread - remove transaction from activeTrans
517 + Dlru, Dmru - move page to front/end of LRUQ
518 + MatchBuffer - check if a buffer's dbase/file/page match
519 + FixupBucket - move page to the right bucket
520 + newslot - find and allocate an unused buffer
521 + DedupBuffer - throw away stale copies after page is flushed
522 + GetTrunc, PutTrunc - trunc record allocation
523 + FindTrunc - find active trunc record for file
524 + DoTruncs - apply truncations
527 LOCK - Database Transaction Locking
529 This package is responsible for managing database locks held as part
530 of a transaction. It supports a single read/write lock, which is
531 represented by the global variable 'rwlock', a struct Lock. The
532 global 'rwlockinit' records whether Lock_Init() has been called on
533 this lock, and is protected by the database lock. Properly, this
534 lock should be treated as belonging to the database.
536 ulock_getLock() expects to be called with the database lock held; it
537 depends on this to protect access to fields within the transaction
538 structure as well as to the global 'rwlockinit'. However, note that
539 the database lock must be temporarily released while obtaining the
540 read/write lock, and so ulock_getLock() should not be called in the
541 middle of a critical section. The 'await' parameter may in theory be
542 set to 0 to effect a non-blocking lock; however, this depends on
543 (incorrect) knowledge of the internals of struct Lock. Fortunately,
544 existing Ubik code always sets await=1. This feature should be fixed
547 ulock_relLock() releases the read/write lock. It expects to be
548 called with the database lock held. Unlock ulock_getLock(), this
549 function does not release the database lock.
551 ulock_Debug() reports the state of the read/write lock, for debugging
552 purposes. It depends on knowledge of the internals of struct Lock,
553 and accesses both the Lock structure and the global rwlockinit
554 without benefit of holding any lock.
556 It would probably be best to eliminate 'rwlockinit' in favor of
557 always initializing the global 'rwlock' during startup; ulock_Debug()
558 would then have no need to examine the flag. Further, ulock_Debug()
559 should at a minimum use the appropriate locks when looking inside
560 struct Lock; ideally, knowledge of that structure's internals should
561 be abstracted into appropriate interfaces in src/lwp/lock.[ch].
564 VOTE - Voting Logic and VOTE_* RPCs
566 This package is responsible for keeping track of who is currently
567 sync site and deciding how to vote and whether this server should try
568 to become sync site. It provides the VOTE RPC package, which is
569 primarily responsible for processing vote requests from peer servers
570 (VOTE_Beacon), but also provides debugging interfaces and an
571 interface (VOTE_GetSyncSite) to allow clients to discover the current
574 Voting decisions are made using state tracked in a set of global
575 variables, listed below. Currently there is no locking protecting
576 these variables; LWP-based Ubik depends on the cooperative nature of
577 LWP and on the fact that routines which examine and manipulate them
578 do not block. Introduction of a new "vote lock" is recommended to
579 protect the following globals:
581 + ubik_lastYesTime - last time VOTE_Beacon voted "yes"
582 + lastYesHost - host voted for at ubik_lastYesTime
583 + lastYesClaim - time lastYesHost started its voting cycle
584 + lastYesState - whether lastYesHost claimed to be sync site
585 + lowestHost - lowest host seen
586 + lowestTime - last time lowestHost was updated/heard from
587 + syncHost - sync host, or 0 if none known
588 + syncTime - last time syncHost was updated
589 + ubik_dbVersion - sync site's database version
590 + ubik_dbTid - sync site's transaction ID
592 The vote lock should be initialized in uvote_Init() and held for the
593 remainder of that function and also during of uvote_ShouldIRun(),
594 uvote_GetSyncSite(). It should also be held for the bulk of
595 SVOTE_Beacon() -- specifically, it should be acquired just before
596 beginning the lowest-host calculation, and released just before
597 calling urecovery_CheckTid(), in the event of a yes vote or else just
600 In addition, the vote lock would protect the globals 'ubik_dbVersion'
601 and 'ubik_dbTid', which represent this server's knowledge of the
602 state of the database on the sync site. This variable is set in two
603 places in the REMOTE package, which change the local and sync site
604 database versions at the same time. It is also examined in one place
605 in the REMOTE package, and compared to the database version in the
606 RECOVERY package. It is imperative that when the local and remote
607 versions are changed at the same time, these changes become visible
608 to the RECOVERY package at the same time. Thus, both the changes and
609 the comparison must be done under both locks. For this purpose,
610 three new functions should be provided by the VOTE package:
612 + A function to safely set 'ubik_dbVersion'
613 + A function to compare 'ubik_dbVersion' to another value
614 + A function to check whether this is sync site and compare
615 'ubik_dbVersion' to another value, without releasing the vote lock
616 in between (to be used in recovery).
618 Existing accesses to 'ubik_dbVersion' in REMOTE and RECOVERY, all of
619 which are already made under the database lock, would be replaced
620 with calls to the new accessors.
622 The result of these changes will require in several places that the
623 vote lock be acquired while the database lock is already held.
624 Therefore, the vote lock MUST NOT be held while acquiring the
627 SVOTE_Beacon() currently calls urecovery_CheckTid() without benefit
628 of the database lock, which must be held when calling that function.
629 This is fairly simple to fix; however, the vote lock must be released
630 first. One simple approach would be to move the call to
631 urecovery_CheckTid() to after the point where the vote lock will be
632 released, wrapping it in DBHOLD/DBRELE, and conditionalizing the
633 whole thing on 'vote' being nonzero.
635 Separately, it may be worth examining whether it is appropriate to
636 call urecovery_CheckTid() only when voting yes, and not whenever
637 a beacon is received from a server claiming to be sync site.
639 As with the debug functions in other packages, SVOTE_Debug() and
640 SVOTE_DebugOld() access global variables belong to this and other
641 modules without benefit of the appropriate locks. There is also very
642 little abstraction here. It is probably desirable to introduce Debug
643 interfaces for the VOTE and RECOVERY packages, to be called from
644 these functions. Full correctness requires...
646 + Holding the vote lock while accessing the VOTE package globals
648 + Holding the database lock while accessing the database flags and
649 tidCounter, ubik_epochTime, ubik_currentTrans, and urecovery_state
652 BEACON - elections, tracking whether this is sync site
654 This package is responsible for running elections, collecting votes,
655 and deciding whether and for how long this is the sync site. It also
656 contains routines responsible for setting up the list of servers and
657 acquiring alternate addresses for other servers during startup.
659 For this purpose, it maintains a number of global variables and
660 ubik_server structure fields. Currently, these variables and fields
661 are sometimes accessed under the database lock, and sometimes without
662 benefit of any lock. In order to allow this package to operate with
663 a minimum of disruption due to database accesses, it is recommended
664 that a new lock be established to protect these. It is particularly
665 desirable to prevent loss of sync due to the potentially long-lived
666 RPCs the RECOVERY package must make under the database lock when
667 transferring full database copies between servers; for this reason,
668 it is important that both the BEACON thread and SVOTE_Beacon() be
669 able to operate without acquiring the database lock. The global
670 variables and ubik_server structure fields to be protected by the new
671 lock are the following:
673 + ubik_amSyncSite - true if this is the sync site
674 + syncSiteUntil - time until which this is the sync site
675 + ts->lastVoteTime - last time this server voted yes for us
676 + ts->lastBeaconSent - last time a beacon was sent to this server
677 + ts->lastVote - true if its late vote response was yes
678 + ts->up - true if this server is believed to be up
679 + ts->beaconSinceDown - true if a beacon has successfully been sent
680 to this server since the last time it was down
682 The new lock would need to be held in the following cases:
683 + In ubeacon_AmSyncSite(), from the first time 'ubik_amSyncSite' is
684 examined until after the potential call to urecovery_ResetState().
685 + In ubeacon_Interact(), when building the list of servers to which
686 to send beacons (to protect access to the 'up' flag). Note that
687 the server address lock must also be held during this loop, and
688 must be acquired _after_ the beacon lock.
689 + In ubeacon_Interact(), when updating status associated with
690 a server after receiving its beacon response.
691 + In ubeacon_Interact(), when updating globals after an election.
692 However, it must be released before calling urecovery_ResetState().
693 + In updateUbikNetworkAddress(), when marking a server down.
694 + In the various ContactQuorum_* functions, when checking whether
695 a server is up before calling it, and again when marking a server
696 down after a call fails.
697 + In ubik_EndTrans, when examining a server's 'beaconSinceDown' flag
698 and 'lastBeaconSent' timestamp to determine whether it has recently
699 gone down, requiring a brief hold until it either comes back up or
702 The following global variables are set during initialization and do
703 not change thereafter:
705 + nServers - number of voting servers
706 + amIMagic - true if this server gets the extra half vote
707 + ubik_singleServer - true if this is the only server
709 Currently, all initialization of this module is done _after_ the Rx
710 services have been created and an Rx server thread started. This is
711 done to avoid a situation in which all servers are blocked in
712 updateUbikNetworkAddress(), which is responsible for exchanging
713 addresses with peer servers, while none of them is prepared to
714 handled the DISK_UpdateInterfaceAddr call made by that function.
716 When using LWP, this early initialization of Rx is not a problem,
717 because the various initialization routines don't actually block until
718 updateUbikNetworkAddress() waits for RPCs to complete. However, when
719 using pthreads, it is a problem, because it means that incoming RPCs
720 may be processed when initialization is not complete.
722 To address this problem and insure orderly initialization, it is
723 recommended to change the initialization sequence so that most work,
724 including the bulk of BEACON package initialization, happens before
725 Ubik's Rx services have been created. The only work that should be
726 deferred until after that point is updateUbikNetworkAddress(),
727 followed by starting the long-running threads belonging to the BEACON
728 and RECOVERY packages. This requires updateUbikNetworkAddress() to
729 be exported (and perhaps to undergo a name change).
731 The major part of the work of this package is done in the function
732 ubeacon_Interact(), which is the body of a long-running thread known
733 as the beacon thread. This thread is responsible for periodically
734 sending "beacons" to other servers to request votes (when the sending
735 server is the sync site, these also include how long it will remain
736 so, the database version, and the active transaction ID). Beacons
737 are sent only from servers that are already sync site or believe they
738 are the best candidate, as determined by uvote_ShouldIRun().
740 On the sync site, the beacon thread is responsible for insuring that
741 new votes are collected frequently enough to avoid loss of quorum.
742 This means it must not block for an extended time; therefore, it must
743 not acquire the database lock, which in other threads may sometimes
744 be held during long-running operations (most notably, database file
745 transfers done under this lock by the recovery thread). Instead,
746 a new lock is proposed (the "version lock"), which is used to allow
747 the beacon thread to examine (but not modify) certain fields without
748 holding the database lock.
750 The version lock should be acquired _in addition to_ the database
751 lock when modifying 'ubik_epochTime' or the database 'version',
752 'flags', 'tidCounter', and 'writeTidCounter' fields; such
753 modifications occur at the following locations:
754 + in ubik.c:BeginTrans(), when beginning a new transaction
755 + in udisk_begin(), when setting DBWRITING
756 + in udisk_commit(), when updating the database version at the end of
757 a write transaction (note this includes the case of relabelling the
758 database on the first write transaction after becoming sync site)
759 + in udisk_end(), when clearing DBWRITING
760 + in recovery.c:InitializeDB(), when setting the initial version of
762 + in urecovery_Interact(), when labelling a new database the first
763 time quorum is established
764 + in urecovery_Interact(), after fetching the latest database from
765 another server (whether successful or not; there are two cases)
766 + in SDISK_SendFile(), when updating the database version both before
767 and after receiving a new database from the sync site. Note the
768 lock must _not_ be held during the transfer.
769 + in SDISK_SetVersion()
771 The version lock should be acquired by the beacon thread while
772 examining these fields, shortly before calling multi_VOTE_Beacon().
773 Since it must release the lock before making that call, the database
774 version will need to be explicitly copied into a local variable, as
775 is already done with the current transaction ID.
777 Note that if UBIK_PAUSE is defined, the DBVOTING flag poses an
778 additional problem, as it must be modified by the beacon thread. If
779 it is desirable to continue to support this variant, then it becomes
780 necessary for all accesses to the database flags to be protected by
781 the version lock. Other UBIK_PAUSE code may not have been reviewed
782 in depth and may pose additional problems.
784 The function ubeacon_AmSyncSite() is called both from the beacon
785 thread and from various other points to determine whether they are
786 running on the sync site, and thus whether various operations are
787 safe and appropriate. Calls to this function from outside the beacon
788 thread (often, via urecovery_AllBetter()) must be made with the
789 database lock held, and for operations modifying the database or
790 relying on its contents to remain constant, this lock must not then
791 be released until the operation is complete. The does not guarantee
792 that this will remain sync site for the duration of the operation,
793 but does insure that any operations done after or as a result of
794 losing sync will in fact occur after the operation holding the lock.
796 Of course, ubeacon_Interact() itself cannot call ubeacon_AmSyncSite()
797 with the database lock held, since it must not block on that lock
798 when running on the sync site (otherwise sync could be lost, as
799 described above). So, that portion of ubeacon_AmSyncSite() other
800 than the call to urecovery_ResetState() should be abstracted out into
801 a new function, which presumably returns separate flags indicating
802 both whether it is in fact running on the sync site (which becomes
803 the return value of ubeacon_AmSyncSite()) and whether a call to
804 urecovery_ResetState() is indicated. When the latter is set, the
805 beacon thread can acquire the database lock (since, if this is not
806 the sync site, blocking on this lock is not a problem) and call
807 urecovery_ResetState().
809 This difference in behavior is acceptable because the required
810 invariant is that when 'ubik_amSyncSite' is cleared, it cannot become
811 true again until urecovery_state has also cleared. This insures that
812 once a server discovers it is not sync site, urecovery_AllBetter()
813 cannot succeed on the basis of being sync site until sync is regained
814 and recovery has run, insuring the database is up to date. Since
815 only the beacon thread ever sets 'ubik_amSyncSite' true, that thread
816 can meet the invariant by calling urecovery_ResetState() before doing
817 anything else, while other threads must continue to hold the beacon
818 lock to prevent the beacon thread from setting 'ubik_amSyncSite'.
820 ubeacon_Debug() reports the value of syncSiteUntil, without benefit
821 of obtaining any locks. It should also report the value of
822 'ubik_amSyncSite', allowing that global to be made private to the
823 BEACON package. Full correctness requires that the beacon lock be
824 held while accessing these globals.
827 RECOVERY - Consistency, Down Server Recovery, Propagation, Log Replay
829 This package is responsible for tracking whether the local database
830 is current and usable and for discovering when a server comes back up
831 after having been marked down. On the sync site, it is also
832 responsible for tracking the database versions on remote servers and
833 for recovering from inconsistencies.
835 To carry out these tasks, the RECOVERY package maintains a single
836 global, 'urecovery_state'. This is not currently effectively
837 protected by any lock; however, it should be protected by the
838 database lock. This is particularly important because high-level
839 routines that manipulate the database depend on a check against this
840 field to determine that the database is up-to-date and safe to
841 modify, and that it will remain so until the operation is complete.
843 The main work of this package is done by urecovery_Interact(), which
844 is the body of a long-running thread known as the recovery thread.
845 This thread wakes up periodically to discover if any down servers
846 have come back up and, on the sync site, to locate the latest
847 available database, fetch it if necessary, and make sure it has been
848 distributed to all servers. This work is done in several steps,
849 taken on in sequence every few seconds:
851 + The first part of urecovery_Interact() is to identify servers which
852 are believed to be down, and determine whether any of them have
853 come back up. This is done every 30 seconds, even when this is not
854 the sync site. Currently, the database lock is held for the
855 duration of this loop, released only in DoProbe() while actually
856 waiting for the DISK_Probe() RPC to complete. This is mostly
857 unnecessary; the server probe loop needs this lock only to examine
858 the database 'currentDB' flag and to clear the UBIK_RECFOUNDDB bit,
859 and DoProbe() doesn't need it at all. However, the beacon lock is
860 needed to examine and update the database 'up' flag, and DoProbe()
861 needs to hold the server address lock when examining server
862 addresses and, if the server comes back up, to replace connections.
863 See the section on struct ubik_server for more details on this
866 + The next step is to determine whether this is the sync site, and
867 update the UBIK_RECSYNCSITE bit appropriately. The database lock
868 should be held for the duration of this step, as
869 ubeacon_AmSyncSite() may end up calling urecovery_ResetState(),
870 which requires that lock be held. If this is not the sync site,
873 + Step three is to determine the location of the latest database.
874 This is done by making DISK_GetVersion() calls to all active
875 servers, keeping track of the newest version found and the server
876 which reported it. Once this is done, the best version is compared
877 to the local version, update the UBIK_RECHAVEDB bit (so that it is
878 true if and only if the local version is the latest) set the
879 UBIK_RECFOUNDDB bit to indicate the latest database has been
880 located, and clear the UBIK_RECSENTDB bit to force any servers with
881 out-of-date databases to be updated.
883 In this section, it is not necessary to hold any locks while
884 collecting versions on remote servers, though as before, it is
885 necessary to hold the beacon lock while examining the server 'up'
886 flag, and the address lock to obtain a reference on each connection
887 as it is used (but this lock should not be held during the RPC!).
888 Once all remote versions have been fetched, the database lock is
889 needed while examining the local version and updating
890 'urecovery_state'. Note that write transactions may happen while
891 versions are collected, resulting in some servers having newer
892 versions than were detected. However, if this happens, the result
893 is always consistent and ends with the sync site having the latest,
896 + If the actual latest version X.N before the write was in an epoch
897 created by the current sync site, then that version must
898 necessarily already be present on the sync site, since writes are
899 always committed first on the sync site. Thus, when the sync
900 site updates the version counter, it produces a new version X.N+1
901 which no other server can believe it already has; thus, there is
902 no possibility of inconsistency.
904 + If the actual latest version X.N before the write was in an epoch
905 not created by the sync site, then it is possible that after
906 quorum was established but before any writes happened, a new
907 server came online which had a newer version X.N+1 (that was
908 committed to less than a quorum of sites before a crash). If
909 a write transaction begins before recovery detects this fact and
910 fetches the version X.N+1 database, then the new write will start
911 with the version X.N database, essentially creating a version
912 fork. However, when this happens, the resulting database will be
913 version Y.1 (with Y>X), because the first write on a new sync
914 site always results in a new database. Thus, the version X.N+1
915 database will no longer be latest (because the local version,
916 which is examined last, is Y.1), and will end up being discarded.
917 This is no worse than if the server with the version X.N+1
918 database had not come up until completely after the transaction
919 resulting in version Y.1.
921 + Step four is to fetch the latest database, if the sync site does
922 not already have it. Since this step replaces the database
923 wholesale, there can be no active transactions while it runs, and
924 all other database activity must be locked out for the duration of
925 the transfer. This means the database lock must be held for this
926 entire step, starting from the check that the UBIK_RECSYNCSITE
927 and/or UBIK_RECFOUNDDB bits are still set. (Note that the existing
928 comment which claims that it is impossible for UBIK_RECFOUNDDB not
929 to be set at this point is true only if the database is not
930 released since checking or updating that bit in step three). In
931 addition, the address lock must be held while setting up the call
932 to be used for DISK_GetFile().
934 + Step five, again after verifying that this is still the sync site
935 and that the UBIK_RECHAVEDB bit is set, is to distribute the new
936 database to any servers which do not already have it. Before this
937 can be done, if the database was newly-initialized (with label
938 epoch 1), it is relabeled with version 2.1 before it is distributed
939 to other sites. This allows readany transactions to run on such
940 a database (though probably not successfully, since the database is
941 empty); it is deferred until this point in recovery to insure that
942 if any server in the quorum contains a valid database, that version
943 is used rather than the empty one.
945 For this step, the database lock should be held starting at the
946 point at which the UBIK_RECSYNCSITE and/or UBIK_RECHAVEDB bits are
947 tested. In the event the DBWRITING flag is set, the lock must be
948 released while waiting for the active write transaction to end. In
949 this case, once the wait is completed, it is necessary to again
950 check urecovery_state to determine that this is still the sync site
951 and have an up-to-date database before proceeding.
953 It is probably simplest for urecovery_Interact() to acquire the
954 database lock at the start of step 2 described above, release it for
955 the duration of the version-collection part of step 3 and while
956 waiting for write transactions to terminate in step 5, and otherwise
957 hold it until the end of the cycle. If no work is needed, this means
958 the lock will be held relatively briefly; if it is necessary to fetch
959 and or send full database versions, these operations themselves must
960 be done with the lock held, and there is little point in releasing it
961 between them. However, it is acceptable to release the lock between
962 any steps described above, provided that each time the database lock
963 is released and reacquired, 'urecovery_state' is checked to verify
964 that no regression has occurred. Note that outside of the recovery
965 thread, the only changes that may occur to 'urecovery_state' are that
966 it may be completely cleared by urecovery_ResetState(), and that the
967 UBIK_RECLABELDB bit may be set by udisk_commit(), on the sync site.
969 The function urecovery_AllBetter() is called under the database lock
970 by routines which need to know whether there is a usable database.
971 As noted in the discussion of ubeacon_AmSyncSite() above, callers
972 must then continue to hold the database lock until any database
973 operation is complete. Further, callers which intend to write to the
974 database currently follow a call to urecovery_AllBetter() with a call
975 to ubeacon_AmSyncSite(), to determine whether a write is permissible.
976 Unfortunately, this is not safe, because it is possible (though
977 admittedly unlikely) for urecovery_AllBetter() to succeed on the
978 basis of being a non-sync-site with an up-to-date database, and then
979 for a subsequent call to ubeacon_AmSyncSite() to return a true result
980 (because the latter may block on acquiring the beacon lock, during
981 which time this may become the sync site). To insure that no write
982 operation is performed without an up-to-date, writeable database,
983 urecovery_AllBetter() should be modified to indicate whether a write
984 operation is permitted, which is the case only when UBIK_RECHAVEDB is
985 tested and found to be set.
987 The functions urecovery_AbortAll(), urecovery_CheckTid(), and
988 urecovery_ResetState() all expect to be called with the database lock
991 The function urecovery_LostServer() need not be called with any
992 locks; it is merely a wrapper for ubeacon_ReinitServer().
994 The RECOVERY package is also responsible at startup for replaying the
995 transaction log and initializing an empty database. These tasks are
996 handled by urecovery_Initialize() and its helpers, ReplayLog() and
997 InitializeDB(). Because the helpers use PHYS package functions and
998 manipulate database structure fields, urecovery_Initialize() should
999 acquire the database lock before calling them.
1002 REMOTE - remote database I/O (DISK) RPCs
1004 This package provides the DISK RPC package, which mediates remote
1005 access to the local database copy, when another server is sync site.
1006 It also includes DISK_UpdateInterfaceAddr(), which is used during
1007 initialization to verify consistency of configuration across servers,
1008 and DISK_Probe(), which is used by the RECOVERY module to determine
1009 whether the server is up.
1011 As the functions in this package are RPC implementations, they may be
1012 called at any time and are entirely responsible for their own
1013 locking. There can be at most one active remote write transaction at
1014 a time; this rule is enforced by this package, which maintains the
1015 global 'ubik_currentTrans' as a pointer to the active transaction.
1017 For the most part, the RPCs in this module use a common prologue,
1018 which includes acquiring the database lock, and do not release that
1019 lock until just before returning. However, the common prologue
1020 examines both 'ubik_currentTrans' and the transaction flags without
1021 benefit of the database lock. This can be corrected fairly easily by
1022 acquiring the database lock sooner; however, this requires referring
1023 to the database via the ubik_dbase global rather than via the 'dbase'
1024 pointer in 'ubik_currentTrans'.
1026 Note that SDISK_GetVersion() and SDISK_SetVersion() contain a sanity
1027 check to ensure they are not running on the sync site, since these
1028 are called only from that portion of the recovery loop which runs
1029 only on the sync site. These calls to ubeacon_AmSyncSite() must be
1030 made with the database lock held, for the results to be valid. The
1031 same holds true for a commented-out check in SDISK_GetFile() (which
1032 really should remain that way -- there's no reason not to allow any
1033 authorized caller to do this at any time).
1035 Additionally, note that SDISK_Commit() acquires a write lock on the
1036 application cache, to insure that no transactions are running which
1037 depend on the application cache, page cache, or underlying database
1038 to change. The locking order requires that this lock be acquired
1039 before taking the database lock; thus, both locks must be acquired
1042 SDISK_SendFile() is really quite bad in its handling of the database
1043 lock under error conditions. Presently, an error condition may cause
1044 it to release the lock too early nor not at all.
1046 SDISK_UpdateInterfaceAddr() is called by peer servers as part of
1047 their startup process, to trade lists of additional addresses and
1048 insure sufficient agreement on configuration to allow the new server
1049 to start. It will need to acquire the lock protecting the addr[]
1050 array attached to each host.
1053 UBIK (high-level API)
1055 This package contains the high-level APIs exported by Ubik to
1056 applications, including initialization, the APIs for creating and
1057 using transactions, and various utilities. It also contains a number
1058 of internal utilities.
1062 Initialization of Ubik is handled in ubik_ServerInitCommon(),
1063 which is called by two API functions, ubik_ServerInit() and
1064 ubik_ServerInitByInfo(), which provide different interfaces for
1065 passing configuration information. This code handles
1066 initialization of the database, locks, and various globals, calls
1067 the initialization routines for each subsystem, makes sure Rx has
1068 been initialized, and creates Ubik's Rx services and threads.
1070 Most of this is done before any Ubik-specific threads are created
1071 and before Ubik's RPCs can be called, and so takes place without
1072 holding any lock. However, since Rx may have been initialized and
1073 started before Ubik, it is possible that some Rx worker threads
1074 may have already been created. In practice, locks internal to Rx
1075 should insure that any memory writes made before an Rx service is
1076 created become visible to any worker running on behalf of that
1077 service; nonetheless, this should not be depended on, and proper
1078 locking should be used during all phases of initialization.
1080 Unfortunately, the current code actually starts Rx services before
1081 initialization is complete, resulting in a number of possible
1082 races. The general THREAD SAFETY section below contains
1083 recommendations for reordering the startup process to eliminate
1084 this problem, while still avoiding potential deadlock if multiple
1085 servers start at once.
1089 Nearly a third of ubik.c consists of various ContactQuorum_*
1090 functions, which are used to make a particular RPC to every server
1091 in the quorum. These used to be a single ContactQuorum()
1092 function, but have been split into multiple functions to regain
1093 type-safety. These retain a common prologue and epilogue, some
1094 small part of which has been abstracted out, but most of which
1095 remains duplicated in each of these functions. Indeed, the only
1096 part unique to each function is the actual RPC call and, in the
1097 case of ContactQuorum_DISK_WriteV(), code to fall back to multiple
1098 calls to DISK_Write() on a per-server basis.
1100 The code common to all ContactQuorum_* functions will need to
1101 acquire the beacon lock when checking whether a server is up (no
1102 calls are made to servers which are not marked up), and again when
1103 marking a server down. Note that when a server is marked down,
1104 the 'up' and 'beaconSinceDown' flags must be cleared at the same
1105 time, without releasing the beacon lock. In addition, the helper
1106 function Quorum_StartIO() must obtain the server address lock
1107 while it obtains a reference on the Rx connection that will be
1110 All of the ContactQuorum_* functions must be called with the
1111 database lock held; this is necessary to protect their access to
1112 each server's 'version' and 'currentDB' fields and to the local
1113 database version. However, these functions release the lock while
1114 actually making RPCs. This behavior is new -- ContactQuorum() in
1115 OpenAFS 1.4 did not release the database lock during RPCs -- but
1116 it is safe, because no caller of ContactQuorum_*() depends on
1117 state read under the database lock before the call to remain valid
1122 The main Ubik transaction API consists of ubik_BeginTrans(),
1123 ubik_BeginTransReadAny(), ubik_BeginTransReadAnyWrite(),
1124 ubik_AbortTrans(), ubik_EndTrans(), ubik_Read(), ubik_Write(),
1125 ubik_Flush(), ubik_Seek(), ubik_Tell(), ubik_Truncate(), and
1126 ubik_SetLock(). The ubik_Begin* functions require a database
1127 pointer; all others require an active transaction pointer. These
1128 functions are called directly by the application without any
1129 locks, and are responsible for their own locking. Note that many
1130 of these examine the transaction type before acquiring any locks;
1131 this is permissible because the transaction type is immutable once
1132 the transaction is created (but it requires that the call be made
1133 from the thread that created the transaction, or that the
1134 application have used some mechanism to insure an appropriate
1135 memory barrier exists; the same mechanism used to protect the
1136 transaction pointer should suffice).
1138 BeginTrans() contains the common code responsible for starting
1139 a new transaction; this is made available to applications as
1140 ubik_BeginTrans(), ubik_BeginTransReadAny(), and
1141 ubik_BeginTransReadAnyWrite(). If a write transaction is in
1142 progress, this function waits for it to complete. However, since
1143 doing so releases the database lock, it is necessary to call
1144 urecovery_AllBetter() again once the lock has been reacquired.
1145 Further, in the UBIK_PAUSE case, the DBVOTING flag may have been
1146 set, so this also must be retested (see the BEACON package section
1147 for more on the difficulties of UBIK_PAUSE and DBVOTING).
1149 Both ubik_Flush() and ubik_Write() examine and manipulate the
1150 transaction I/O vector without benefit of any lock. These fields
1151 ('iovec_info' and 'iovec_data') should be protected by the
1152 database lock. In the case of ubik_Write(), this may require
1153 releasing the lock before calling ubik_Flush(), or the
1154 introduction of a private ubik_Flush_r() which assumes the lock is
1155 already held (but note that in either case, the lock is released,
1156 because ubik_Flush() calls ContactQuorum_* functions).
1160 The internal function ubikGetPrimaryInterfaceAddr() is used by
1161 Ubik RPCs to determine a peer server's primary address, given the
1162 IP address from which a call arrived. This needs to hold the
1163 server address lock, as described in the section on struct
1168 The section on struct ubik_dbase describes the operation of the
1169 application cache. The function ubik_CheckCache() is provided to
1170 allow an application to insure the cache is up to date and obtain
1171 a read lock on the cache which lasts for the duration of the
1174 Note that ubik_AbortTrans() currently always invalidates the
1175 application cache by clearing ubik_dbase->cachedVersion, for which
1176 it requires a write lock on the cache_lock. This means that
1177 aborted transactions block waiting for completion of any
1178 transactions which hold the cache_lock, which may well include all
1179 outstanding transactions. In the case of read transactions, which
1180 cannot have modified the database, this is unnecessary and may
1181 significantly reduce concurrency.
1186 This section discusses changes needed to insure thread safety.
1188 The initialization sequence in ubik_ServerInitCommon() should be cleaned
1189 up, to ensure that every subsystem is fully initialized before starting
1190 any background threads or accepting RPCs. The correct initialization
1191 sequence should look something like this:
1193 + Initialize error tables
1194 + Allocate and initialize the database structure
1196 + Initialize the various subsystems:
1197 + udisk_Init() (formerly disk.c:DInit)
1198 + ulock_Init() (new, pulled out of ulock_getLock)
1200 + urecovery_Initialize()
1201 + ubeacon_InitServerList*
1202 + Create Rx services
1203 + Start an Rx server thread
1204 + updateUbikNetworkAddress() (rename and export from beacon.c)
1205 + Start the beacon and recovery threads
1207 Initialization functions may need to be added for some subsystems which
1208 do not currently have them:
1210 + src/ubik/disk.c:DInit() should be exported and renamed
1211 udisk_Init(); it can then be called from ubik_ServerInitCommon()
1212 instead of from udisk_begin().
1214 + The lock initialization currently done in ulock_getLock() should
1215 instead be done in a separate ulock_Init(), which should be called
1216 from ubik_ServerInitCommon()
1218 A new lock is needed to protect state belonging to the VOTE package,
1219 including the globals 'ubik_dbVersion' and 'ubik_dbTid', which represent
1220 the sync site's database state. See the VOTE package section for
1221 details on the use of the vote lock.
1223 A new lock is needed to protect state belonging to the BEACON package,
1224 including the globals 'ubik_amSyncSite' and 'syncSiteUntil' and several
1225 fields in the ubik_server structure. This lock also protects the
1226 per-server 'up' flag, which is shared among several modules. In
1227 addition, some beacon-specific fields are accessed directly by other
1228 modules. Thus, this lock must be visible outside the BEACON module, or
1229 else accessors must be provided for these items. See the BEACON package
1230 section for details on the use of the beacon lock.
1232 A new lock is needed to provide additional protection for the database
1233 version, transaction counter, and flags, so that these can safely be
1234 examined (but not updated) by the beacon thread without the need to
1235 acquire the database lock. See the BEACON package section for details
1236 on the use of the version lock.
1238 A new lock is needed to protect per-server lists of non-primary
1239 addresses, connections to the VOTE and DISK services on each server, and
1240 the client-mode security class objects used to set up these connections.
1241 See the section on struct ubik_server for details on the use of the
1242 server address lock.
1244 The required locking order is described by the following list. Any of
1245 these locks may be acquired singly; when acquiring multiple locks, they
1246 should be acquired in the listed order:
1248 + application cache lock (dbase->cache_lock)
1249 + database lock (DBHOLD/DBRELE)
1253 + server address lock
1255 Some cleanup is required in various places where existing locking
1256 conventions are not properly obeyed:
1258 + SVOTE_Beacon() needs to hold the database lock when calling
1259 urecovery_CheckTid()
1261 + Management of the database lock in SDISK_SendFile() needs to be
1262 cleaned up so that it is always released at the proper time.
1264 + urecovery_Interact() is not consistent about holding the database
1265 lock when it examines and updates 'urecovery_state'. It also
1266 examines the local database version at least once without benefit
1267 of this lock. See the RECOVERY package section for details on
1268 which locks are needed when by urecovery_Interact().
1270 + ubik_Flush() and ubik_Write() need to acquire the database lock
1271 before accessing the 'iovec_info' and 'iovec_data' fields.
1273 + ubik_GetVersion() needs to acquire the database lock before copying
1274 the database version.
1276 The various debugging RPCs in the VOTE packages, and the data collection
1277 routines they call in other packages, generally operate without benefit
1278 of any locks. This is probably desirable, as it avoids any possibility
1279 of debugging operations blocking on or interfering with the rest of the
1280 system, and allows debugging data to be collected even in the event of
1281 some unforeseen deadlock. It is also "safe", in that it does not risk
1282 damage to data structures or access to uninitialized memory, provided
1283 that data structure initialization is complete before these functions
1284 are called. However, it does mean that these RPCs may return data that
1285 is not entirely self-consistent.
1287 + Correctness requires various locks be held during parts of
1288 SVOTE_Debug() and SVOTE_DebugOld(); see the VOTE package section.
1290 + SVOTE_Debug() also examines 'ubik_currentTrans' without benefit of
1291 the database lock. This usage is not "safe", as that transaction
1292 may be freed by another thread while SVOTE_Debug() is examining it.
1294 + SVOTE_XSDebug() should hold the server address lock while copying
1295 out server addresses, and other locks while copying per-server
1298 + udisk_Debug() should hold the database lock while examining the
1299 database version and collecting page cache statistics.
1301 + ulock_Debug should use LOCK_LOCK/LOCK_UNLOCK when examining the
1302 internals of struct Lock. Ideally, knowledge of these internals
1303 should be abstracted to src/lwp/lock.h.
1305 + ubeacon_Debug() should hold the beacon lock while accessing the
1306 value of 'syncSiteUntil' and, if the access is moved here from
1307 SVOTE_Debug/SVOTE_DebugOld, 'ubik_amSyncSite'.
1309 CONCURRENCY THOUGHTS
1311 This section collections information discussed previously about changes
1312 that may be required or desirable in order to improve concurrency. Note
1313 that none of these changes are needed to insure thread safety, provided
1314 the changes discussed in the previous section are made.
1316 + PHYS could be made more concurrent by maintaining a separate lock for
1317 the fdcache, avoiding the need to DBHOLD while calling these.
1318 However, higher-layer consistency probably requires that certain calls
1319 or groups of calls to this package be atomic and isolated.
1321 + Could DISK be made more concurrent by maintaining a separate lock for
1322 the page cache and or free truncation record list?
1324 + Could concurrency be improved by maintaining separate per-transaction
1325 locks? If so, what is the locking hierarchy with respect to database
1326 locks and the DISK and PHYS package locks?
1328 + Could concurrency be improved by maintaining more reader/writer state
1329 and/or separate write locks to insure consistency of database file
1330 access without requiring so much DBHOLD?
1332 + Alternately, could concurrency be improved by requiring that an active
1333 transaction be accessed only by the thread that created it?
1336 MULTI-DATABASE SUPPORT
1338 There isn't any. Some of the Ubik interfaces and even the organization
1339 of some internal data structures makes it appear as though multiple
1340 databases can be supported. However, this is not the case, for several
1343 + Only a single set of peer servers is tracked, via a global linked list
1344 of server structures. Each server structure includes state such as
1345 the server's database version and whether it is up-to-date, so it is
1346 not possible to track multiple databases even for a common set of
1348 + The physical I/O package (phys.c) tracks cached open file descriptors
1349 without regard to what database a file is part of; thus, it cannot
1350 handle accessing multiple databases at the same time.
1351 + There are a variety of globals containing state which needs to be
1352 tracked on a per-database basis.
1353 + There are a variety of globals which are effectively protected by
1354 a per-database lock, or by no lock at all.
1355 + The various RPCs used to communicate between servers assume a single
1356 database and do not include a database identifier. Thus, in the
1357 presence of multiple databases, it would be impossible to determine,
1358 for example, for which database a DISK RPC was intended.
1360 The upshot of this is that considerable work would be needed to get Ubik
1361 into a condition which would allow a single process to maintain multiple
1362 independent Ubik databases at the same time. Note that this is
1363 orthogonal to the question of whether a single multi-file database is
1367 OTHER DATA STRUCTURES
1369 Some other data structures are mentioned in ubik.p.h or elsewhere, but
1370 are of relatively little interest from the point of view of threading.
1371 They are described here primarily as an alternative to deleting
1372 descriptive text that was already written.
1374 struct ubik_hdr - on-disk database file header
1376 This structure represents the on-disk format of the Ubik database
1377 file header (label). It is used only for local variables in
1378 uphys_getlabel() and uphys_setlabel(), which read and write this
1379 header, respectively.
1381 struct ubik_stat - database file status
1383 This structure is used to convey the size and modification timestamp
1384 of a database file. It is used as the type of an output parameter of
1385 uphys_stat(), and for local variables in callers of that function.
1387 struct ubik_stats - global statistics
1389 This structure is used to contain "miscellaneous" global statistics.
1390 Currently that consists of a single counter which is incremented in
1391 one place (an exceptional condition in ubik_EndTrans) and is never
1392 read. The one increment is performed only when holding the database
1393 lock; thus, this structure may be treated the same as other globals
1394 protected by non-global locks.
1398 There appears to be nothing to insure that calls to disk.c:DRead()
1399 during a write transaction won't find and use a "clean" cached page when
1400 there is already a dirty page in the cache resulting from an earlier
1401 write as part of the same transaction. This means that if a transaction
1402 reads a page after writing it, it may read back the original data
1403 instead of the new, and if a transaction writes a page more than once,
1404 the later writes may corrupt the database. It seems likely that we are
1405 merely lucky in this regard, and this problem should be fixed.
1407 src/ubik/beacon.c:ubeacon_ReinitServer() assumes 'ubik_CRXSecurityRock'
1408 is in fact an AFS configuration directory pointer, suitable as an
1409 argument to afsconf_UpToDate(). This is inappropriate;
1410 'ubik_CRXSecurityRock' is an opaque argument to
1411 (*ubik_CRXSecurityProc)(), Ubik must not make assumptions about what it
1414 It may be worth examining whether it is appropriate for SVOTE_Beacon()
1415 to call urecovery_CheckTid() only when voting yes, and not whenever
1416 a beacon is received from a server claiming to be sync site.
1418 Recently, code has been introduced which attempts to insure that the
1419 recovery process does not truncate an old database file before a new one
1420 has been fully transferred, since this could, depending on the timing of
1421 a server failure, deprive the new quorum of a valid database. This
1422 so-called "new recovery" code violates the abstraction provided by the
1423 PHYS package, encoding specific knowledge of the underlying file store
1424 into the REMOTE and RECOVERY packages. This means that the backend
1425 provided by phys.c cannot be replaced with an alternative, since in that
1426 case recovery would do the wrong thing with the transferred database.
1428 The DBWRITING loop in urecovery_Interact() is not portable, as it
1429 depends the contents of the timeout after a call to select(2). This was
1430 fine for IOMGR_Select(), which has defined behavior, but not for