From fac0b742960899123dca6016f6ffc6ccc944f217 Mon Sep 17 00:00:00 2001 From: Andrew Deason Date: Sun, 22 May 2016 21:54:30 -0500 Subject: [PATCH 1/1] ubik: Return an error from ContactQuorum when inquorate Currently, when we need to contact all other servers in the ubik quorum (to create a write transaction, and send db changes, etc), we call the ContactQuorum_* family of functions. To contact each server, those functions follow an algorithm like the following pseudocode: { int rcode = 0; int code; int okcalls = 0; for (ts = ubik_servers; ts; ts = ts->next) { if (ts->up) { code = contact_server(ts); if (code) { rcode = code; } else { okcalls++; } } } if (okcalls + 1 >= ubik_quorum) { return 0; } else { return rcode; } } This means that if we successfully contact a majority of ubik sites, we return success, even if some sites returned an error. If most sites fail, then we return an error (we arbitrarily pick the last error we got). This means that in most situations, a successful write transaction is guaranteed to have been transmitted to a majority of ubik sites, so the written data cannot be lost (at least one of the sites that got the new data will be in a future elected quorum). However, if a site is already known to be down (ts->up is 0), then we skip trying to contact that site, but we also don't set any errors. This means that if a majority of sites are already known to be down (ts->up is 0), then we can indicate success for a write transaction, even though the relevant data has not been written to a majority of sites. In that situation, it is possible to lose data. Most of the time this is not possible, since a majority of sites must be 'up' for the sync site to be elected and to allow write transactions at all. There are a few ways, though, in which we can get into a situation where most other sites are 'down', but we still let a write transaction go through. An example scenario: Say we have sites A, B, and C. All 3 sites come up at the same time, and A is the lowest IP so it starts an election (after around BIGTIME seconds). Right after A is elected the sync site, sites B and C will have 'lastYesState' set to 0, since site A hasn't yet sent out a beacon as the sync site. A client can then start a write to the ubik database on site A, which site A will allow since it's the sync site (and presumably all the relevant recovery flags are set). Site A will try to contact sites B and C for a DISK_Begin call, but lastYesState is set to 0 on those sites. This will cause DISK_Begin to return UNOQUORUM (urecovery_AllBetter will return 0, because uvote_HaveSyncAndVersion will return 0, because lastYesState is not set). So site A will get a UNOQUORUM error from sites B and C, and so site A will set 'ts->up' to 0 for sites B and C, and will return UNOQUORUM to the client. The client may then try to retry the call (because UNOQUORUM is not treated as a 'global' error in ubikclient.c's ubik_Call_New), or another client write request could come in. Now that 'ts->up' is unset for both sites B and C, we skip trying to contact any remote sites, and the ContactQuorum functions will return success. So the ubik write will go through successfully, but the new data will only be on site A. At this point, if site A crashes, then sites B and C will elect a quorum, and will not have the modifications that were written to site A (so the data written to site A is lost). If site A stays up, then it will go through database recovery, sending the entire database file to sites B and C. In addition, it's very possible in this scenario for a client to write to the database, and then try to read back data and confusingly get a different result. For example, if someone issues the following two commands while triggering the above scenario: $ pts createuser testuser $ pts examine testuser If the second command contacts site B or C, then it will always fail, saying that the user doesn't exist (even though the first command succeeded). This is because sites B and C don't have the new data written to site A, at least temporarily. While this confusing behavior is not completely avoidable in ubik (this can always happen 'sometimes' due to network errors and such), with the scenario described here, it happens 100% of the time. The general scenario described above can also happen if sites B and C are suddenly legitimately unreachable from site A, instead of throwing the UNOQUORUM error. All of the steps are pretty much the same, but there is a bit of a delay while we wait for the DISK_Begin call to fail. To fix this, do not let 0 be returned if a quorum has not been reached. In some sense, UNOQUORUM could *always* be returned in that case, but it is more in keeping with historical behavior to return a "real" error if there is one available. It is somewhat questionable whether we should even be propagating errors received from calls like DISK_Begin/DISK_Commit to the ubik client (e.g. if we get a -1 from trying to contact a remote site, we return -1 to the client, so the client may think it couldn't reach the site at all). But this commit does not change any of that logic, and should only change behavior when a majority of sites have 'ts->up' unset. A later commit might effect the change to always return UNOQUORUM and ignore the actual error values from the DISK_ calls, but that is not needed to fix the immediate issue. An important note: Before this commit, there was a window of about 15 seconds after a sync site is elected where a write to the ubik db would appear to be successful, but would only modify the ubik db on the sync site. (Details described above.) With this commit, writes during that 15-second window will instead fail, because we cannot guarantee that we won't lose that data. If someone relies on 'udebug' data from the sync site to let them know when writes will go through successfully, this commit could appear to cause new errors. [kaduk@mit.edu: transfer long commit message describing the issue from an alternative fix, and tidy up accordingly] Change-Id: If6842d7122ed4d137f298f0f8b7f20350b1e9de6 Reviewed-on: https://gerrit.openafs.org/12289 Reviewed-by: Mark Vitale Tested-by: BuildBot Reviewed-by: Benjamin Kaduk --- src/ubik/ubik.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/ubik/ubik.c b/src/ubik/ubik.c index a22aaff..229c596 100644 --- a/src/ubik/ubik.c +++ b/src/ubik/ubik.c @@ -201,7 +201,7 @@ ContactQuorum_rcode(int okcalls, afs_int32 rcode) if (okcalls + 1 >= ubik_quorum) return 0; else - return rcode; + return (rcode != 0) ? rcode : UNOQUORUM; } /*! -- 1.9.4