From: Marcio Barbosa Date: Fri, 21 Jul 2017 03:02:15 +0000 (-0400) Subject: ubik: update epoch as soon as sync-site is elected X-Git-Tag: BP-openafs-stable-1_8_x~9 X-Git-Url: https://git.openafs.org/?p=openafs.git;a=commitdiff_plain;h=da704137f4bf766250ca87dbdc5a85c2024cb0a6 ubik: update epoch as soon as sync-site is elected The ubik_epochTime represents the time at which the coordinator first received its coordinator mandate. However, this global is currently not updated at the moment when a new sync-site is elected. Instead, ubik_epochTime is only updated at the very end of the first write transaction, when a new database label is written (in udisk_commit). This causes at least 2 different issues: For one, this means that we change ubik_epochTime while a remote transaction is in progress. If VOTE_Beacon is called after ubik_epochTime is updated, but before the remote transaction ends, the remote sites will detect that the transaction id in ubik_currentTrans is wrong (via urecovery_CheckTid(), since the epoch doesn't match), and they will abort the transaction. This means the transaction will fail, and it may cause a loss of quorum until another election is completed. Another issue is that ubik_epochTime can be 0 at the beginning of a write transaction, if this is the first election that this site has won. Since ubik_epochTime is used to construct transaction ids, this means that we can have different transactions that originate from different sites at different times, but they have the same epoch in their tid. For example, say a write transaction starts with epoch 0, but the originating site is killed/interrupted before finishing. That write transaction will linger on remote sites in ubik_currentTrans with an epoch of 0 (since the originating site will never call DISK_ReleaseLocks, or DISK_Abort, etc). Normally the sync site will kill such a lingering transaction via urecovery_CheckTid, but since the epoch is 0, and the election winner's epoch is also 0, the transaction looks valid and may never be killed. If that transaction is holding a lock on the database, this means that the database will forever remain locked, effectively preventing any access to the db on that site. To fix both of these issues, update ubik_epochTime with the current time as soon as we win the election. This ensures that the epoch is not updated in the middle of a transaction, and it ensures that all transactions are created with a unique epoch: the epoch of the election that we won. Note that with this commit, we do not ever set ubik_epochTime to the magic value of '2' during database init. The special '2' epoch only needs to be set in the database itself, and it is never an actual epoch that represents a real quorum that went through the election process. The database will be labelled with a 'real' epoch after the first write, like normal. [kaduk@mit.edu: comment the locking strategy in ubeacon_Interact()] Change-Id: I6cdcf5a73c1ea564622bfc8ab7024d9901af4bc8 Reviewed-on: https://gerrit.openafs.org/12609 Tested-by: BuildBot Reviewed-by: Marcio Brito Barbosa Reviewed-by: Benjamin Kaduk --- diff --git a/src/ubik/beacon.c b/src/ubik/beacon.c index bf32432..1e37e63 100644 --- a/src/ubik/beacon.c +++ b/src/ubik/beacon.c @@ -399,6 +399,11 @@ ubeacon_InitServerListCommon(afs_uint32 ame, struct afsconf_cell *info, beacon_globals.ubik_amSyncSite = 1; /* let's start as sync site */ beacon_globals.syncSiteUntil = 0x7fffffff; /* and be it quite a while */ beacon_globals.ubik_syncSiteAdvertised = 1; + DBHOLD(ubik_dbase); + UBIK_VERSION_LOCK; + version_globals.ubik_epochTime = FT_ApproxTime(); + UBIK_VERSION_UNLOCK; + DBRELE(ubik_dbase); } } else { if (nServers == 1) /* special case 1 server */ @@ -406,8 +411,14 @@ ubeacon_InitServerListCommon(afs_uint32 ame, struct afsconf_cell *info, } if (ubik_singleServer) { - if (!beacon_globals.ubik_amSyncSite) + if (!beacon_globals.ubik_amSyncSite) { ubik_dprint("Ubik: I am the sync site - 1 server\n"); + DBHOLD(ubik_dbase); + UBIK_VERSION_LOCK; + version_globals.ubik_epochTime = FT_ApproxTime(); + UBIK_VERSION_UNLOCK; + DBRELE(ubik_dbase); + } beacon_globals.ubik_amSyncSite = 1; beacon_globals.syncSiteUntil = 0x7fffffff; /* quite a while */ beacon_globals.ubik_syncSiteAdvertised = 1; @@ -431,6 +442,7 @@ ubeacon_Interact(void *dummy) afs_int32 i; struct ubik_server *ts; afs_int32 temp, yesVotes, lastWakeupTime, oldestYesVote, syncsite; + int becameSyncSite; struct ubik_tid ttid; struct ubik_version tversion; afs_int32 startTime; @@ -604,17 +616,20 @@ ubeacon_Interact(void *dummy) /* now decide if we have enough votes to become sync site. * Note that we can still get enough votes even if we didn't for ourself. */ + becameSyncSite = 0; if (yesVotes > nServers) { /* yesVotes is bumped by 2 or 3 for each site */ UBIK_BEACON_LOCK; - if (!beacon_globals.ubik_amSyncSite) + if (!beacon_globals.ubik_amSyncSite) { ubik_dprint("Ubik: I am the sync site\n"); - else { + /* Defer actually changing any variables until we can take the + * DB lock (which is before the beacon lock in the lock order). */ + becameSyncSite = 1; + } else { + beacon_globals.syncSiteUntil = oldestYesVote + SMALLTIME; /* at this point, we have the guarantee that at least quorum * received a beacon packet informing we have a sync-site. */ beacon_globals.ubik_syncSiteAdvertised = 1; } - beacon_globals.ubik_amSyncSite = 1; - beacon_globals.syncSiteUntil = oldestYesVote + SMALLTIME; #ifndef AFS_PTHREAD_ENV /* I did not find a corresponding LWP_WaitProcess(&ubik_amSyncSite) -- this may be a spurious signal call -- sjenkins */ @@ -632,6 +647,25 @@ ubeacon_Interact(void *dummy) urecovery_ResetState(); /* tell recovery we're no longer the sync site */ DBRELE(ubik_dbase); } + /* We cannot take the DB lock around the entire preceding conditional, + * because if we are currently the sync site and this election serves + * to confirm that status, the DB lock may already be held for a long-running + * write transaction. In such a case, attempting to acquire the DB lock + * would cause the beacon thread to block and disrupt election processing. + * However, if we are transitioning from not-sync-site to sync-site, there + * can be no outstanding transactions and acquiring the DB lock should be + * safe without extended blocking. */ + if (becameSyncSite) { + DBHOLD(ubik_dbase); + UBIK_BEACON_LOCK; + UBIK_VERSION_LOCK; + version_globals.ubik_epochTime = FT_ApproxTime(); + beacon_globals.ubik_amSyncSite = 1; + beacon_globals.syncSiteUntil = oldestYesVote + SMALLTIME; + UBIK_VERSION_UNLOCK; + UBIK_BEACON_UNLOCK; + DBRELE(ubik_dbase); + } } /* while loop */ return NULL; diff --git a/src/ubik/disk.c b/src/ubik/disk.c index f4434e9..926c825 100644 --- a/src/ubik/disk.c +++ b/src/ubik/disk.c @@ -877,7 +877,7 @@ udisk_commit(struct ubik_trans *atrans) if (ubeacon_AmSyncSite() && !(urecovery_state & UBIK_RECLABELDB)) { UBIK_VERSION_LOCK; oldversion = dbase->version; - newversion.epoch = FT_ApproxTime();; + newversion.epoch = version_globals.ubik_epochTime; newversion.counter = 1; code = (*dbase->setlabel) (dbase, 0, &newversion); @@ -886,7 +886,6 @@ udisk_commit(struct ubik_trans *atrans) return code; } - version_globals.ubik_epochTime = newversion.epoch; dbase->version = newversion; UBIK_VERSION_UNLOCK; diff --git a/src/ubik/recovery.c b/src/ubik/recovery.c index 28e71c1..0f3e714 100644 --- a/src/ubik/recovery.c +++ b/src/ubik/recovery.c @@ -750,8 +750,7 @@ urecovery_Interact(void *dummy) if (ubik_dbase->version.epoch == 1) { urecovery_AbortAll(ubik_dbase); UBIK_VERSION_LOCK; - version_globals.ubik_epochTime = 2; - ubik_dbase->version.epoch = version_globals.ubik_epochTime; + ubik_dbase->version.epoch = 2; ubik_dbase->version.counter = 1; code = (*ubik_dbase->setlabel) (ubik_dbase, 0, &ubik_dbase->version);