From: Andrew Deason Date: Sun, 1 May 2016 16:24:30 +0000 (-0500) Subject: ubik: Don't RECFOUNDDB if can't contact most sites X-Git-Tag: openafs-stable-1_8_0pre1~62 X-Git-Url: http://git.openafs.org/?p=openafs.git;a=commitdiff_plain;h=d3dbdade7e8eaf6da37dd6f1f53d9f1384626071 ubik: Don't RECFOUNDDB if can't contact most sites Currently, the ubik recovery code will always set UBIK_RECFOUNDDB during recovery, after asking all other sites for their dbversions. This happens regardless of how many sites we were actually able to successfully contact, even if we couldn't contact any of them. This can cause problems when we are unable to contact a majority of sites with DISK_GetVersion. Since, if we haven't contacted a majority of sites, we cannot say with confidence that we know what the best db version available is (which is what UBIK_RECFOUNDDB represents; that we've found which database is the one we should be using). This can also result in UBIK_RECHAVEDB in a similar situation, indicating that we have the best db version locally, even though we never actually asked anyone else what their db version was. For example, say site A is the sync site going through recovery, and DISK_GetVersion fails for the only other sites B and C. Site A will then set UBIK_RECFOUNDDB, and will claim that site A has the best db version available (UBIK_RECHAVEDB). This allows site A to process ubik write transactions (causing the db to be labelled with a new epoch), or possibly to send the db to the other sites via DISK_SendFile, if they quickly become available during recovery. Ubik write transactions can succeed in this situation, because our ContactQuorum_* calls will succeed if we never try to contact a remote site ('rcode' defaults to 0). This situation should be rather rare, because normally a majority of sites must be reachable by site A for site A to be voted the sync site in the first place. However, it is possible for site A to lose connectivity to all other sites immediately after sync site election. It is also possible for site A to proceed far enough in the recovery process to set UBIK_RECHAVEDB before it loses its sync site status. As a result of all of this, if a site with an old database comes online and there are network connectivity problems between the other sites and a ubik write request comes in, it's possible for the "old" database to overwrite the "new" database. This makes it look as if the database has "rolled back" to an earlier version. This should be possible with any ubik database, though how to actually trigger this bug can change due to different ubik servers setting different network timeouts. It is probably the most likely with the VLDB, because the VLDB is typically the most frequently written database. If a VLDB reverts to an earlier version, it can result in existing volumes to appear to not exist in the VLDB, and can result in new volumes re-using volume IDs from existing volumes. This can result in rather confusing errors. To fix this, ensure that we have contacted a majority of sites with DISK_GetVersion before indicating that we have located the best db version. If we've contacted a majority of sites, then we are guaranteed (under ubik assumptions) that we've found the best version, since previous writes to the database should be guaranteed to hit a majority of sites (otherwise they wouldn't be successful). If we cannot reach a majority of sites, we just don't set UBIK_RECFOUNDDB, and the recovery process restarts. Presumably on the next iteration we'll be able to contact them, or we'll lose sync site status if we can't reach the other sites for long enough. Change-Id: I84f745b5e017bb62d93b538dbc9c7de845bee1bd Reviewed-on: https://gerrit.openafs.org/12281 Tested-by: BuildBot Reviewed-by: Benjamin Kaduk --- diff --git a/src/ubik/recovery.c b/src/ubik/recovery.c index fcbaec6..28e71c1 100644 --- a/src/ubik/recovery.c +++ b/src/ubik/recovery.c @@ -530,6 +530,7 @@ urecovery_Interact(void *dummy) * most current database, then go find the most current db. */ if (!(urecovery_state & UBIK_RECFOUNDDB)) { + int okcalls = 0; DBRELE(ubik_dbase); bestServer = (struct ubik_server *)0; bestDBVersion.epoch = 0; @@ -547,6 +548,7 @@ urecovery_Interact(void *dummy) code = DISK_GetVersion(ts->disk_rxcid, &ts->version); UBIK_ADDR_UNLOCK; if (code == 0) { + okcalls++; /* perhaps this is the best version */ if (vcmp(ts->version, bestDBVersion) > 0) { /* new best version */ @@ -555,23 +557,35 @@ urecovery_Interact(void *dummy) } } } - /* take into consideration our version. Remember if we, - * the sync site, have the best version. Also note that - * we may need to send the best version out. - */ + DBHOLD(ubik_dbase); - if (vcmp(ubik_dbase->version, bestDBVersion) >= 0) { - bestDBVersion = ubik_dbase->version; - bestServer = (struct ubik_server *)0; - urecovery_state |= UBIK_RECHAVEDB; - } else { - /* Clear the flag only when we know we have to retrieve - * the db. Because urecovery_AllBetter() looks at it. - */ - urecovery_state &= ~UBIK_RECHAVEDB; - } - urecovery_state |= UBIK_RECFOUNDDB; - urecovery_state &= ~UBIK_RECSENTDB; + + if (okcalls + 1 >= ubik_quorum) { + /* If we've asked a majority of sites about their db version, + * then we can say with confidence that we've found the best db + * version. If we haven't contacted most sites (because + * GetVersion failed or because we already know the server is + * down), then we don't really know if we know about the best + * db version. So we can only proceed in here if 'okcalls' + * indicates we managed to contact a majority of sites. */ + + /* take into consideration our version. Remember if we, + * the sync site, have the best version. Also note that + * we may need to send the best version out. + */ + if (vcmp(ubik_dbase->version, bestDBVersion) >= 0) { + bestDBVersion = ubik_dbase->version; + bestServer = (struct ubik_server *)0; + urecovery_state |= UBIK_RECHAVEDB; + } else { + /* Clear the flag only when we know we have to retrieve + * the db. Because urecovery_AllBetter() looks at it. + */ + urecovery_state &= ~UBIK_RECHAVEDB; + } + urecovery_state |= UBIK_RECFOUNDDB; + urecovery_state &= ~UBIK_RECSENTDB; + } } if (!(urecovery_state & UBIK_RECFOUNDDB)) { DBRELE(ubik_dbase);