From d03e3c392eee2bd9158c417f42bcbf2009dabfc3 Mon Sep 17 00:00:00 2001
From: Andrew Deason <adeason@sinenomine.net>
Date: Thu, 13 Sep 2012 17:58:50 -0500
Subject: [PATCH 1/1] rx: Normalize use of some MTU-discovery fields

When we store MTUs (peer->ifMTU, peer->natMTU, etc.), we store the
maximum transport unit usable by RX, i.e., excluding the IP and UDP
headers, but including the RX header.  Contrariwise, when we track the
size of packets we've sent (conn->lastPacketSize, peer->maxPacketSize),
we track logical packet lengths which exclude the RX header (and the IP
and UDP headers).  However, the consumers of lastPacketSize and
maxPacketSize were not always interpreting their values correctly as
excluding the RX (and other) headers.

Add comments to these fields in their respective structure definitions
to help make clear what they contain (and the difference between them).
Correct several checks which were using the wrong constant for
correcting between these two worldviews (and the wrong sign).  Modernize
the style of lines that are touched.

The lastPacketSize and maxPacketSize variables are only accessed from
five places: while sending packets, while processing acks, while sending
acks, while handling growMTU events, and in rxi_CheckCall().  They are
used to track the size of packets that have been successfully sent (and
thus, indirectly, to control the size of packets we send).  The
maxPacketSize is only set once we have received an ack from the peer for
a packet of that size, and lastPacketSize tracks the size of a
speculative packet larger than that size, until it is acked.

When sending packets, we check if the size of the packet we are sending
exceeds the recorded maxPacketSize, and if so, record that speculative
size in the lastPacketSize variable, along with the sequence number of
that packet in lastPacketSizeSeq.

Correspondingly, when processing acks, if the packet tracked in
lastPacketSizeSeq is being acked, we know that our speculative large
packet was successfully received, and can attempt to update the recorded
maxPacketSize for the peer.  This is done through an intermediate
variable, 'pktsize', which holds either the value of lastPacketSize or
lastPingSize, without adjustment for the size of any headers.

The ack processing has a bit of code to handle the case where
maxPacketSize has been reset to zero, initializing it to a small value
which should be expected to always work.  The intention seems to have
been to initialize maxPacketSize to the smallest permitted value (since
RX_MIN_PACKET_SIZE is amount of data available to RX in the smallest
permitted IP packet), but the old code was actually initializing
maxPacketSize from zero to something a bit larger than the minimum, by
RX_IPUDP_SIZE + RX_HEADER_SIZE.  This over-large initialization was
mostly harmless, see below.  After this potential initialization,
'pktsize' was incorrectly used to set a new ifMTU and natMTU for the
peer.  It correctly determined that a packet larger than the previous
maxPacketSize had been acked, but then set the peer's ifMTU and natMTU
to smaller values than the acked packet actually indicates.  (It is
careful to only increase the ifMTU, though.)  The actual peer->MTU is
*not* updated here, but will presumably trickle through eventually via
rxi_Resend() or similar.  It is possible that this code should be using
rxi_SetPeerMtu() or similar logic, but that might involve locking issues
which are outside the scope of this analysis.

The over-large initialization of maxPacketSize (from zero) was
fortuitously mostly harmless on systems using minimum-sized IP packets,
since a correspondingly wrong check was used to detect if a new MTU
invalidates this maxPacketSize, with the constants offsetting.
Likewise, the checks in rxi_SendAck() had the wrong constants, but they
offset to produce the correct boundary between small and large packets
while trying to grow the MTU.  Unfortunately, the old behavior in the
"small" case is not correct, and the grow MTU event would try to send a
packet with more padding than was intended.  In networks allowing
packets slightly larger than the minimum (but not much larger than the
minimum), the old code may have been unable to discover the true MTU.

In the main (MTU-related) logic of rxi_SendAck, a variable 'padbytes' is
set to a small increment of maxPacketSize in the "small" case, and a
larger increment of maxMTU in the "large" case.  There is a floor for
padbytes based on RX_MIN_PACKET_SIZE, which ended up being larger than
intended in the old code by approximately the size of the rx header.
(Some of the adjustments performed are rather opaque, so the motivations
are unclear.)

The more interesting places where accesses to lastPacketSize and
maxPacketSize happen are during the MTU grow events and in
rxi_CheckCall().

In rxi_CheckCall(), the packet size variables are only accessed if
the connection has the msgsizeRetryErr flag set, the call is not timed
out (whether for idleness or during active waiting), and the call has
actually received data.  In this case, we conclude that sending packets
failed and decrease the MTU.  The old code was quite broken in this
regard, with a reversed sense of conditional for whether a speculative
large packet was outstanding, and a rather large decrease in MTU size
of effectively 128 + RX_HEADER_SIZE + RX_IPUDP_SIZE = 212, when only
a decrease of 128 was intended.  The new code corrects the sense of
the conditional and sets the decrease in MTU to the intended value of 128.

With respect to MTU grow events, this change only touches
rxi_SetPeerMtu(), to correct the conditional on whether the MTU update
invalidates/overrides the cached maxPacketSize.  There is a window of
values which could cause the old code to incorrectly fail to invalidate
the cached maxPacketSize.  Values in this window could result in the old
code being stuck on a bad MTU guess, but should not cause an actual
failure to communicate.  That conditional zeroing of maxPacketSize is
the only access to the PacketSize variables in rxi_SetPeerMtu().
maxPacketSize is also checked in rxi_GrowMTUEvent(), but only against
zero to avoid sending RX_ACK_MTU packets needlessly, so it is unaffected
by the issue at hand.

In summary, in certain network conditions, the old code could fail
to find an optimum MTU size, but would always continue to operate.
The new code is more likely to find a better MTU size, and the old
and the new code should interoperate fine with both each other and
themselves.

[kaduk@mit.edu add a few missed cases; expound on analysis in commit message]

Change-Id: I7494892738d38ffe35bdf1dfd483dbf671cc799a
Reviewed-on: http://gerrit.openafs.org/8111
Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Tested-by: BuildBot <buildbot@rampaginggeek.com>
Reviewed-by: Daria Brashear <shadow@your-file-system.com>
---
 src/rx/rx.c      | 23 ++++++++++++-----------
 src/rx/rx_conn.h |  4 ++--
 src/rx/rx_peer.h |  6 +++---
 3 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/src/rx/rx.c b/src/rx/rx.c
index 6c71368..5079a8c 100644
--- a/src/rx/rx.c
+++ b/src/rx/rx.c
@@ -2888,7 +2888,7 @@ rxi_SetPeerMtu(struct rx_peer *peer, afs_uint32 host, afs_uint32 port, int mtu)
 	if (peer->ifMTU < OLD_MAX_PACKET_SIZE)
 	    peer->maxDgramPackets = 1;
 	/* We no longer have valid peer packet information */
-	if (peer->maxPacketSize-RX_IPUDP_SIZE > peer->ifMTU)
+	if (peer->maxPacketSize + RX_HEADER_SIZE > peer->ifMTU)
 	    peer->maxPacketSize = 0;
         MUTEX_EXIT(&peer->peer_lock);
 
@@ -4396,12 +4396,12 @@ rxi_ReceiveAckPacket(struct rx_call *call, struct rx_packet *np,
 	 * but we are clearly receiving.
 	 */
 	if (!peer->maxPacketSize)
-	    peer->maxPacketSize = RX_MIN_PACKET_SIZE+RX_IPUDP_SIZE;
+	    peer->maxPacketSize = RX_MIN_PACKET_SIZE - RX_HEADER_SIZE;
 
 	if (pktsize > peer->maxPacketSize) {
 	    peer->maxPacketSize = pktsize;
-	    if ((pktsize-RX_IPUDP_SIZE > peer->ifMTU)) {
-		peer->ifMTU=pktsize-RX_IPUDP_SIZE;
+	    if ((pktsize + RX_HEADER_SIZE > peer->ifMTU)) {
+		peer->ifMTU = pktsize + RX_HEADER_SIZE;
 		peer->natMTU = rxi_AdjustIfMTU(peer->ifMTU);
 		rxi_ScheduleGrowMTUEvent(call, 1);
 	    }
@@ -5531,7 +5531,7 @@ rxi_SendAck(struct rx_call *call,
 	 */
 	if (call->conn->peer->maxPacketSize &&
 	    (call->conn->peer->maxPacketSize < OLD_MAX_PACKET_SIZE
-	     +RX_IPUDP_SIZE))
+	     - RX_HEADER_SIZE))
 	    padbytes = call->conn->peer->maxPacketSize+16;
 	else
 	    padbytes = call->conn->peer->maxMTU + 128;
@@ -6471,13 +6471,14 @@ mtuout:
         call->lastReceiveTime) {
 	int oldMTU = conn->peer->ifMTU;
 
-	/* if we thought we could send more, perhaps things got worse */
-	if (conn->peer->maxPacketSize > conn->lastPacketSize)
-	    /* maxpacketsize will be cleared in rxi_SetPeerMtu */
-	    newmtu = MAX(conn->peer->maxPacketSize-RX_IPUDP_SIZE,
-			 conn->lastPacketSize-(128+RX_IPUDP_SIZE));
+	/* If we thought we could send more, perhaps things got worse.
+	 * Shrink by 128 bytes and try again. */
+	if (conn->peer->maxPacketSize < conn->lastPacketSize)
+	    /* maxPacketSize will be cleared in rxi_SetPeerMtu */
+	    newmtu = MAX(conn->peer->maxPacketSize + RX_HEADER_SIZE,
+			 conn->lastPacketSize - 128 + RX_HEADER_SIZE);
 	else
-	    newmtu = conn->lastPacketSize-(128+RX_IPUDP_SIZE);
+	    newmtu = conn->lastPacketSize - 128 + RX_HEADER_SIZE;
 
 	/* minimum capped in SetPeerMtu */
 	rxi_SetPeerMtu(conn->peer, 0, 0, newmtu);
diff --git a/src/rx/rx_conn.h b/src/rx/rx_conn.h
index 2f3a16e..5007684 100644
--- a/src/rx/rx_conn.h
+++ b/src/rx/rx_conn.h
@@ -42,9 +42,9 @@ struct rx_connection {
                                        * RX_PACKET_TYPE_BUSY packet for this
                                        * call slot, or 0 if the slot is not busy */
     afs_uint32 serial;		/* Next outgoing packet serial number */
-    afs_int32 lastPacketSize; /* last >max attempt */
+    afs_int32 lastPacketSize; /* size of last >max attempt, excludes headers */
     afs_int32 lastPacketSizeSeq; /* seq number of attempt */
-    afs_int32 lastPingSize; /* last MTU ping attempt */
+    afs_int32 lastPingSize; /* size of last MTU ping attempt, w/o headers */
     afs_int32 lastPingSizeSer; /* serial of last MTU ping attempt */
     struct rxevent *challengeEvent;	/* Scheduled when the server is challenging a     */
     struct rxevent *delayedAbortEvent;	/* Scheduled to throttle looping client */
diff --git a/src/rx/rx_peer.h b/src/rx/rx_peer.h
index eef3677..a4acd79 100644
--- a/src/rx/rx_peer.h
+++ b/src/rx/rx_peer.h
@@ -27,7 +27,7 @@ struct rx_peer {
     u_short port;		/* Remote UDP port, in net byte order */
 
     /* interface mtu probably used for this host  -  includes RX Header */
-    u_short ifMTU;		/* doesn't include IP header */
+    u_short ifMTU;		/* doesn't include IP/UDP header */
 
     /* For garbage collection */
     afs_uint32 idleWhen;	/* When the refcountwent to zero */
@@ -39,7 +39,7 @@ struct rx_peer {
     int reSends;		/* Total number of retransmissions for this peer, since this structure was created */
 
     /* the "natural" MTU, excluding IP,UDP headers, is negotiated by the endpoints */
-    u_short natMTU;
+    u_short natMTU;		/* includes rx header */
     u_short maxMTU;
     /* negotiated maximum number of packets to send in a single datagram. */
     u_short maxDgramPackets;
@@ -62,7 +62,7 @@ struct rx_peer {
     afs_uint64 bytesReceived;	/* Number of bytes received from this peer */
     struct opr_queue rpcStats;	/* rpc statistic list */
     int lastReachTime;		/* Last time we verified reachability */
-    afs_int32 maxPacketSize;    /* peer packetsize hint */
+    afs_int32 maxPacketSize;    /* Max size we sent that got acked (w/o hdrs) */
 #ifdef AFS_RXERRQ_ENV
     rx_atomic_t neterrs;
 
-- 
1.9.4