From d03e3c392eee2bd9158c417f42bcbf2009dabfc3 Mon Sep 17 00:00:00 2001 From: Andrew Deason Date: Thu, 13 Sep 2012 17:58:50 -0500 Subject: [PATCH 1/1] rx: Normalize use of some MTU-discovery fields When we store MTUs (peer->ifMTU, peer->natMTU, etc.), we store the maximum transport unit usable by RX, i.e., excluding the IP and UDP headers, but including the RX header. Contrariwise, when we track the size of packets we've sent (conn->lastPacketSize, peer->maxPacketSize), we track logical packet lengths which exclude the RX header (and the IP and UDP headers). However, the consumers of lastPacketSize and maxPacketSize were not always interpreting their values correctly as excluding the RX (and other) headers. Add comments to these fields in their respective structure definitions to help make clear what they contain (and the difference between them). Correct several checks which were using the wrong constant for correcting between these two worldviews (and the wrong sign). Modernize the style of lines that are touched. The lastPacketSize and maxPacketSize variables are only accessed from five places: while sending packets, while processing acks, while sending acks, while handling growMTU events, and in rxi_CheckCall(). They are used to track the size of packets that have been successfully sent (and thus, indirectly, to control the size of packets we send). The maxPacketSize is only set once we have received an ack from the peer for a packet of that size, and lastPacketSize tracks the size of a speculative packet larger than that size, until it is acked. When sending packets, we check if the size of the packet we are sending exceeds the recorded maxPacketSize, and if so, record that speculative size in the lastPacketSize variable, along with the sequence number of that packet in lastPacketSizeSeq. Correspondingly, when processing acks, if the packet tracked in lastPacketSizeSeq is being acked, we know that our speculative large packet was successfully received, and can attempt to update the recorded maxPacketSize for the peer. This is done through an intermediate variable, 'pktsize', which holds either the value of lastPacketSize or lastPingSize, without adjustment for the size of any headers. The ack processing has a bit of code to handle the case where maxPacketSize has been reset to zero, initializing it to a small value which should be expected to always work. The intention seems to have been to initialize maxPacketSize to the smallest permitted value (since RX_MIN_PACKET_SIZE is amount of data available to RX in the smallest permitted IP packet), but the old code was actually initializing maxPacketSize from zero to something a bit larger than the minimum, by RX_IPUDP_SIZE + RX_HEADER_SIZE. This over-large initialization was mostly harmless, see below. After this potential initialization, 'pktsize' was incorrectly used to set a new ifMTU and natMTU for the peer. It correctly determined that a packet larger than the previous maxPacketSize had been acked, but then set the peer's ifMTU and natMTU to smaller values than the acked packet actually indicates. (It is careful to only increase the ifMTU, though.) The actual peer->MTU is *not* updated here, but will presumably trickle through eventually via rxi_Resend() or similar. It is possible that this code should be using rxi_SetPeerMtu() or similar logic, but that might involve locking issues which are outside the scope of this analysis. The over-large initialization of maxPacketSize (from zero) was fortuitously mostly harmless on systems using minimum-sized IP packets, since a correspondingly wrong check was used to detect if a new MTU invalidates this maxPacketSize, with the constants offsetting. Likewise, the checks in rxi_SendAck() had the wrong constants, but they offset to produce the correct boundary between small and large packets while trying to grow the MTU. Unfortunately, the old behavior in the "small" case is not correct, and the grow MTU event would try to send a packet with more padding than was intended. In networks allowing packets slightly larger than the minimum (but not much larger than the minimum), the old code may have been unable to discover the true MTU. In the main (MTU-related) logic of rxi_SendAck, a variable 'padbytes' is set to a small increment of maxPacketSize in the "small" case, and a larger increment of maxMTU in the "large" case. There is a floor for padbytes based on RX_MIN_PACKET_SIZE, which ended up being larger than intended in the old code by approximately the size of the rx header. (Some of the adjustments performed are rather opaque, so the motivations are unclear.) The more interesting places where accesses to lastPacketSize and maxPacketSize happen are during the MTU grow events and in rxi_CheckCall(). In rxi_CheckCall(), the packet size variables are only accessed if the connection has the msgsizeRetryErr flag set, the call is not timed out (whether for idleness or during active waiting), and the call has actually received data. In this case, we conclude that sending packets failed and decrease the MTU. The old code was quite broken in this regard, with a reversed sense of conditional for whether a speculative large packet was outstanding, and a rather large decrease in MTU size of effectively 128 + RX_HEADER_SIZE + RX_IPUDP_SIZE = 212, when only a decrease of 128 was intended. The new code corrects the sense of the conditional and sets the decrease in MTU to the intended value of 128. With respect to MTU grow events, this change only touches rxi_SetPeerMtu(), to correct the conditional on whether the MTU update invalidates/overrides the cached maxPacketSize. There is a window of values which could cause the old code to incorrectly fail to invalidate the cached maxPacketSize. Values in this window could result in the old code being stuck on a bad MTU guess, but should not cause an actual failure to communicate. That conditional zeroing of maxPacketSize is the only access to the PacketSize variables in rxi_SetPeerMtu(). maxPacketSize is also checked in rxi_GrowMTUEvent(), but only against zero to avoid sending RX_ACK_MTU packets needlessly, so it is unaffected by the issue at hand. In summary, in certain network conditions, the old code could fail to find an optimum MTU size, but would always continue to operate. The new code is more likely to find a better MTU size, and the old and the new code should interoperate fine with both each other and themselves. [kaduk@mit.edu add a few missed cases; expound on analysis in commit message] Change-Id: I7494892738d38ffe35bdf1dfd483dbf671cc799a Reviewed-on: http://gerrit.openafs.org/8111 Reviewed-by: Jeffrey Altman Tested-by: BuildBot Reviewed-by: Daria Brashear --- src/rx/rx.c | 23 ++++++++++++----------- src/rx/rx_conn.h | 4 ++-- src/rx/rx_peer.h | 6 +++--- 3 files changed, 17 insertions(+), 16 deletions(-) diff --git a/src/rx/rx.c b/src/rx/rx.c index 6c71368..5079a8c 100644 --- a/src/rx/rx.c +++ b/src/rx/rx.c @@ -2888,7 +2888,7 @@ rxi_SetPeerMtu(struct rx_peer *peer, afs_uint32 host, afs_uint32 port, int mtu) if (peer->ifMTU < OLD_MAX_PACKET_SIZE) peer->maxDgramPackets = 1; /* We no longer have valid peer packet information */ - if (peer->maxPacketSize-RX_IPUDP_SIZE > peer->ifMTU) + if (peer->maxPacketSize + RX_HEADER_SIZE > peer->ifMTU) peer->maxPacketSize = 0; MUTEX_EXIT(&peer->peer_lock); @@ -4396,12 +4396,12 @@ rxi_ReceiveAckPacket(struct rx_call *call, struct rx_packet *np, * but we are clearly receiving. */ if (!peer->maxPacketSize) - peer->maxPacketSize = RX_MIN_PACKET_SIZE+RX_IPUDP_SIZE; + peer->maxPacketSize = RX_MIN_PACKET_SIZE - RX_HEADER_SIZE; if (pktsize > peer->maxPacketSize) { peer->maxPacketSize = pktsize; - if ((pktsize-RX_IPUDP_SIZE > peer->ifMTU)) { - peer->ifMTU=pktsize-RX_IPUDP_SIZE; + if ((pktsize + RX_HEADER_SIZE > peer->ifMTU)) { + peer->ifMTU = pktsize + RX_HEADER_SIZE; peer->natMTU = rxi_AdjustIfMTU(peer->ifMTU); rxi_ScheduleGrowMTUEvent(call, 1); } @@ -5531,7 +5531,7 @@ rxi_SendAck(struct rx_call *call, */ if (call->conn->peer->maxPacketSize && (call->conn->peer->maxPacketSize < OLD_MAX_PACKET_SIZE - +RX_IPUDP_SIZE)) + - RX_HEADER_SIZE)) padbytes = call->conn->peer->maxPacketSize+16; else padbytes = call->conn->peer->maxMTU + 128; @@ -6471,13 +6471,14 @@ mtuout: call->lastReceiveTime) { int oldMTU = conn->peer->ifMTU; - /* if we thought we could send more, perhaps things got worse */ - if (conn->peer->maxPacketSize > conn->lastPacketSize) - /* maxpacketsize will be cleared in rxi_SetPeerMtu */ - newmtu = MAX(conn->peer->maxPacketSize-RX_IPUDP_SIZE, - conn->lastPacketSize-(128+RX_IPUDP_SIZE)); + /* If we thought we could send more, perhaps things got worse. + * Shrink by 128 bytes and try again. */ + if (conn->peer->maxPacketSize < conn->lastPacketSize) + /* maxPacketSize will be cleared in rxi_SetPeerMtu */ + newmtu = MAX(conn->peer->maxPacketSize + RX_HEADER_SIZE, + conn->lastPacketSize - 128 + RX_HEADER_SIZE); else - newmtu = conn->lastPacketSize-(128+RX_IPUDP_SIZE); + newmtu = conn->lastPacketSize - 128 + RX_HEADER_SIZE; /* minimum capped in SetPeerMtu */ rxi_SetPeerMtu(conn->peer, 0, 0, newmtu); diff --git a/src/rx/rx_conn.h b/src/rx/rx_conn.h index 2f3a16e..5007684 100644 --- a/src/rx/rx_conn.h +++ b/src/rx/rx_conn.h @@ -42,9 +42,9 @@ struct rx_connection { * RX_PACKET_TYPE_BUSY packet for this * call slot, or 0 if the slot is not busy */ afs_uint32 serial; /* Next outgoing packet serial number */ - afs_int32 lastPacketSize; /* last >max attempt */ + afs_int32 lastPacketSize; /* size of last >max attempt, excludes headers */ afs_int32 lastPacketSizeSeq; /* seq number of attempt */ - afs_int32 lastPingSize; /* last MTU ping attempt */ + afs_int32 lastPingSize; /* size of last MTU ping attempt, w/o headers */ afs_int32 lastPingSizeSer; /* serial of last MTU ping attempt */ struct rxevent *challengeEvent; /* Scheduled when the server is challenging a */ struct rxevent *delayedAbortEvent; /* Scheduled to throttle looping client */ diff --git a/src/rx/rx_peer.h b/src/rx/rx_peer.h index eef3677..a4acd79 100644 --- a/src/rx/rx_peer.h +++ b/src/rx/rx_peer.h @@ -27,7 +27,7 @@ struct rx_peer { u_short port; /* Remote UDP port, in net byte order */ /* interface mtu probably used for this host - includes RX Header */ - u_short ifMTU; /* doesn't include IP header */ + u_short ifMTU; /* doesn't include IP/UDP header */ /* For garbage collection */ afs_uint32 idleWhen; /* When the refcountwent to zero */ @@ -39,7 +39,7 @@ struct rx_peer { int reSends; /* Total number of retransmissions for this peer, since this structure was created */ /* the "natural" MTU, excluding IP,UDP headers, is negotiated by the endpoints */ - u_short natMTU; + u_short natMTU; /* includes rx header */ u_short maxMTU; /* negotiated maximum number of packets to send in a single datagram. */ u_short maxDgramPackets; @@ -62,7 +62,7 @@ struct rx_peer { afs_uint64 bytesReceived; /* Number of bytes received from this peer */ struct opr_queue rpcStats; /* rpc statistic list */ int lastReachTime; /* Last time we verified reachability */ - afs_int32 maxPacketSize; /* peer packetsize hint */ + afs_int32 maxPacketSize; /* Max size we sent that got acked (w/o hdrs) */ #ifdef AFS_RXERRQ_ENV rx_atomic_t neterrs; -- 1.9.4