Windows: cm_Analyze mark server down for misc rx errors In cm_Analyze() replace the token error retry logic for miscellaneous rx errors and simply mark the server down. The most common error that will be seen in this category is RX_INVALID_OPERATION which would be received if the Rx service id or security class is not recognized by the peer. This could happen if an AuriStor server is replaced by an AFS3 server or if a packet is reflected. A side effect of this change is that V* and CM_ERROR_* errors will once again be retried. This will permit proper failover to occur. Change-Id: I77e6325eb05643ea6df1fc0bc877bd4ef496c974 Reviewed-on: http://gerrit.openafs.org/11920 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
rx: Remove RX_CALL_BUSY Commit 23d6287f7f494383891a497038e8c0e870e824bf introduced the behavior where a client can immediately retry a call if it receives a "busy" packet from the server (meaning, the call channel is already in use). This happened via Rx returning the error code RX_CALL_BUSY, and the caller was supposed to immediately retry the call, so Rx could reissue the RPC on a different call channel. However, this behavior makes it more likely for the server to process an RPC that the client thinks has not been processed. Say the client issues an RPC, the server replies with a "busy" packet, and the client resends the original packet before it sees the "busy" packet. In this case, the server will get the resent packet for the RPC request and process it, but the client will think the call has failed (and presumably will retry the call on a new channel). For calls that are non-idempotent (e.g. MakeDir), this can result in incorrect errors (e.g. EEXIST) as well as incorrect cache state in the client. There may be some ways to mitigate at least some of the problems here, but this kind of "instant" retry behavior is often not really that helpful. Calls that take a very long time to run on the server are very rare (and usually indicate some other problem), while the occasional short-lived "busy" packet is relatively common (sometimes the server just hasn't cleaned up the call by the time we issue a new call). So just get rid of the retrying behavior to ensure we don't continue to encounter any problems like this. To get rid of this behavior, we remove the RX_CALL_BUSY code, and all code dealing with processing it. This means removing the RX_CALL_BUSY handling from the client, as well as removing rx_SetBusyChannelError(). This effectively reverts most of 23d6287f7f494383891a497038e8c0e870e824bf, and a few other commits related to RX_CALL_BUSY. With this change, if all we get from the server are BUSY packets when we try to issue an RPC, the call will eventually error out with RX_CALL_TIMEOUT (or hang forever, if no timeouts are configured). This can be thought of intuitively as similar to "idle dead" behavior, since we are just waiting for the server to proceed with processing the call. So, if "idle dead" is configured, we still timeout after the "idle dead" timeout. And if no idle or hard dead timeout is configured, we will hang forever; just like if the server started processing the call but then hangs forever. Note that not all of 23d6287f7f494383891a497038e8c0e870e824bf is reverted. Namely, the logic to have rx_NewCall try to pick the "least busy" channel is retained. Thanks to Simon Wilkinson for bringing up and discussing this issue in this thread: <http://thread.gmane.org/gmane.comp.file-systems.openafs.devel/10931> <https://lists.openafs.org/pipermail/openafs-devel/2013-April/019297.html> Change-Id: I272e51f252356aa14bc4b8a3b7c594700deb432c Reviewed-on: http://gerrit.openafs.org/10784 Reviewed-by: Daria Brashear <shadow@your-file-system.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
rx: Remove RX_CALL_IDLE After change Ie0497d24f1bf4ad7d30ab59061f96c3298f47d17, RX_CALL_IDLE is not generated by Rx anymore; "idle dead" timeouts just cause RX_CALL_TIMEOUT errors. Any code dealing with it is thus now dead code (this value was deliberately never sent over the wire), so remove the dead code. Change-Id: I2b38327f77ffc8168712b83506afa1da3eea1224 Reviewed-on: http://gerrit.openafs.org/10783 Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Windows: cm_Analyze retries vs CM_REQ_NORETRY (2) Commit a1b5a1d42280753de13094006dcc130fede978a1 left out a critical part of the patch. The check for "retry < 2" when determining whether retries should be skipped due to CM_REQ_NORETRY. Change-Id: I9b750e2bab11d28813447b2ee92287b8dcfbbba3 Reviewed-on: http://gerrit.openafs.org/11131 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Windows: cm_ForceNewConnections serverp == NULL If serverp == NULL, return immediately. Do not dereference a NULL pointer. Change-Id: I47781a03f22a83289a23b30ff375249fec18ff51 Reviewed-on: http://gerrit.openafs.org/10845 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Perry Ruiter <pruiter@sinenomine.net> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Windows: cm_ConnByServer fix search for replication Separate connection objects are maintained for use when accessing replicated and single source volumes. If the matching connection type cannot be found while holding the cm_connLock shared a second search is performed after the lock is upgraded to an exclusive lock. This second connection search was not enforcing the replication criteria. Change-Id: I408a5d87c3a82da5235fa2255db7d1d7a6bcb6d9 Reviewed-on: http://gerrit.openafs.org/10681 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Windows: cm_connLock not required for cm_GetUCell In cm_ConnByServer() there is no need to hold the cm_connLock across the cm_GetUCell() call. Obtain the cm_ucell_t object before the cm_connLock is obtained. Change-Id: I971b55e0aae7748b59895785c1c22b5461c4fd35 Reviewed-on: http://gerrit.openafs.org/10680 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Windows: cm_Analyze retries vs CM_REQ_NORETRY CM_REQ_NORETRY is set by threads that want all errors returned immediately. However, there are some errors that should never be returned: RX_MSGSIZE RX_CALL_BUSY VNOSERVICE RX_CALL_IDLE RXKADEXPIRED VICECONNBAD VICETOKENDEAD For these errors even if the thread has requested no retries a RPC retry must be performed. Change-Id: I692f65a9fdbbf27fc880ac8912fc72c1d1357c6d Reviewed-on: http://gerrit.openafs.org/10470 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Windows: cm_FindVolumeByFID cm_GetVolumeByFID() does not query the vldb if the volume group is not known to the cache manager. cm_FindVolumeByFID() is to be used in cases where the volume group data must be known for the operation to successfully complete. Change-Id: I9bb3bd13f14dea534952495b00a3348aafd2d591 Reviewed-on: http://gerrit.openafs.org/10465 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Windows: Do not remove scp from hash table on deletion If the CM_SCACHEFLAG_DELETED flag is going to have any benefit, the cm_scache object must not be removed from the hash table in response to a VNOVNODE error. Otherwise, a new cm_scache object is allocated, the CM_SCACHEFLAG_DELETED is not found, and a new callback request is issued to the file server which in response returns VNOVNODE. Do this enough times and the abort threshold is triggered and then the application becomes very unhappy with performance. Change-Id: I5c6e2495c149f52ca192d195897e2a1822cf0d14 Reviewed-on: http://gerrit.openafs.org/10141 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Windows: CM_SCACHEFLAG_DELETED use InterlockedOr When setting CM_SCACHEFLAG_DELETED use InterlockedOr. Change-Id: Ie6ccc8e0a167e5bb6a8b74689606166821168989 Reviewed-on: http://gerrit.openafs.org/10140 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Windows: Protect against cm_GetVolServerList failures In cm_Analyze, if cm_GetVolServerList() fails volServerpp will be NULL which will trigger an exception if passed to either cm_SetServerBusyStatus or cm_ResetServerBusyStatus. Change-Id: I75b4b855b8c3ccfc014532b0c2eb3135807647ef Reviewed-on: http://gerrit.openafs.org/9960 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com> Tested-by: Jeffrey Altman <jaltman@your-file-system.com>
Windows: Protect against infinite VIO retries Keep track of the number of VIO errors reported by the file servers. If the count exceeds 100, abandon the request. Change-Id: I4d18ccca732802752c94c9ca1b36ca9a827c72de Reviewed-on: http://gerrit.openafs.org/9923 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Windows: cm_Analyze if no retry don't sleep If error handling response is not going to result in a retry of the call, do not sleep. Change-Id: I12435612f94a2e6afb77b5a2975f90f66e02823a Reviewed-on: http://gerrit.openafs.org/9881 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com> Tested-by: Jeffrey Altman <jaltman@your-file-system.com>
Windows: only retry ALLBUSY for five minutes Add a volbusyCount field to cm_req_t. Increment the count each time CM_ERROR_ALLBUSY is processed by cm_Analyze for a given request. Wait 15 seconds between retries and retry up to 20 times and then fail. This prevents requests from blocking for a volume that isn't going to come back online for hours. Change-Id: I25e68565700dddceebecedf552d1e04cbe39b22a Reviewed-on: http://gerrit.openafs.org/9876 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com> Tested-by: Jeffrey Altman <jaltman@your-file-system.com>
Windows: Introduce CM_CONN_FLAG_NEW The new CM_CONN_FLAG_NEW flag is set on the cm_conn object whenever a new rx_connection has been created. The flag is cleared in cm_Analyze if the call succeeded or if the error is one that is generated as a result of communicating with the peer. If no communication with the peer has taken place the connection is considered "new". For errors that would result in forcing a new connection, check whether the existing connection is already "new". This avoids an extra RX_CALL_DEAD timeout period in the case where a "new" connection was already in use. Change-Id: If23a5f4b98e7599e4b4e62b474661e9d91aba81b Reviewed-on: http://gerrit.openafs.org/9847 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Windows: Use interlocked ops for cm_conn flags cm_conn flags can be modified by multiple threads. Use interlocked operations for thread safety. Change-Id: Iaaec54ca0962f8f78e1ddaee2c0a8a68041f5ed9 Reviewed-on: http://gerrit.openafs.org/9846 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Windows: cm_Analyze VICECONNBAD and VICETOKENDEAD cm_Analyze forces new rx connections in response to VICECONNBAD and VICETOKENDEAD errors but failed to mark the cm_req_t with CM_REQ_NEW_CONN_FORCED and failed to set 'forcing_new' to true ensuring that a retry would take place even if the cm_req_t included the no retry flag. Change-Id: Ieb2bf141279192a591eb66eacab8150c10d029ce Reviewed-on: http://gerrit.openafs.org/9773 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Derrick Brashear <shadow@your-file-system.com>
Windows: Force new connection upon RXKADEXPIRED cm_Analyze invalidated the credentials for the cell upon receiving an RXKADEXPIRED error from a server but failed for force the establishment of a new rx connection to the server. As a result, the expired credentials would continue to be used until the credential expires. Change-Id: I93a4146d5ca708ce1cca467e7e5f72fea950f8ae Reviewed-on: http://gerrit.openafs.org/9772 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Derrick Brashear <shadow@your-file-system.com>
Windows: Avoid cm_Analyze race on cm_serverRef lists cm_Analyze() accepted as a parameter a pointer to the first element on a cm_serverRef list which is only ever used for VL operations. cm_Analyze() would separately call cm_GetVolServerList() to obtain the cm_serverRef list for RXAFS operations. Then the variable 'serversp' would be set to the first element of the list. 'serversp' was then used to refer to the list and would be passed to cm_SetServerBusyStatus() and cm_ResetServerBusyStatus() which would in turn obtain the cm_serverLock while it manipulated the cm_serverRef status flags for the elements in the list. The problem is that passing a pointer to the first element of the cm_serverRef list without holding cm_serverLock can permit the list contents to be altered including removal of the first element. If the race is lost and the memory associated with the first element is freed before access, the afsd_service.exe will crash. This patchset makes a number of changes. First, the cm_serverRef_t parameter is changed from a pointer to the first element of the list to be a pointer to the HEAD pointer of the list. Since it is ever only used for cm_cell.vlServerp lists, the parameter is renamed to 'vlServerspp'. Second, a separate "cm_serverRef_t ** volServerspp" variable is allocated for the return value from the cm_GetVolServerList() operations. cm_SetServerBusyStatus() and cm_ResetServerBusyStatus() are altered to accept a pointer to the HEAD of the list instead of a pointer to the first element. The cm_serverLock is now held read instead of write because the list itself is not being altered. All of the state changes being applied to the cm_serverRef objects are atomic. Finally, cm_serverLock is held across all list traversals within cm_Analyze(). A read lock is obtained if the elements of the list are not being removed or inserted and a write lock is obtained if they are. Change-Id: I48464e90a828706dad442c019c75a717b06d690b Reviewed-on: http://gerrit.openafs.org/9625 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com> Tested-by: Jeffrey Altman <jaltman@your-file-system.com>