Pages

Tuesday, August 7, 2012

TCP (Transmission Control Protocol)

TCP is a transport layer protocol used by applications that require guaranteed delivery. It is a sliding window protocol that provides handling for both timeouts and retransmissions.
TCP establishes a full duplex virtual connection between two endpoints. Each endpoint is defined by an IP address and a TCP port number. The operation of TCP is implemented as a finite state machine.
The byte stream is transferred in segments. The window size determines the number of bytes of data that can be sent before an acknowledgment from the receiver is necessary.

MAC headerIP headerTCP headerData :::

TCP header:

0001020304050607080910111213141516171819202122232425262728293031
Source PortDestination Port
Sequence Number
Acknowledgment Number
Data OffsetreservedECNControl BitsWindow
ChecksumUrgent Pointer
Options and padding :::
Data :::

Source Port. 16 bits.

Destination Port. 16 bits.

Sequence Number. 32 bits.
The sequence number of the first data byte in this segment. If the SYN bit is set, the sequence number is the initial sequence number and the first data byte is initial sequence number + 1.

Acknowledgment Number. 32 bits.
If the ACK bit is set, this field contains the value of the next sequence number the sender of the segment is expecting to receive. Once a connection is established this is always sent.

Data Offset. 4 bits.
The number of 32-bit words in the TCP header. This indicates where the data begins. The length of the TCP header is always a multiple of 32 bits.

reserved. 3 bits.
Must be cleared to zero.

ECN, Explicit Congestion Notification. 3 bits.
Added in RFC 3168.
000102
NCE
N, NS, Nonce Sum. 1 bit.
Added in RFC 3540. This is an optional field added to ECN intended to protect against accidental or malicious concealment of marked packets from the TCP sender.
C, CWR. 1 bit.
E, ECE, ECN-Echo. 1 bit.
Control Bits. 6 bits.
000102030405
UAPRSF
U, URG. 1 bit.
Urgent pointer valid flag.
A, ACK. 1 bit.
Acknowledgment number valid flag.
P, PSH. 1 bit.
Push flag.
R, RST. 1 bit.
Reset connection flag.
S, SYN. 1 bit.
Synchronize sequence numbers flag.
F, FIN. 1 bit.
End of data flag.
Window. 16 bits, unsigned.
The number of data bytes beginning with the one indicated in the acknowledgment field which the sender of this segment is willing to accept.

Checksum. 16 bits.
This is computed as the 16-bit one's complement of the one's complement sum of a pseudo header of information from the IP header, the TCP header, and the data, padded as needed with zero bytes at the end to make a multiple of two bytes. The pseudo header contains the following fields:

0001020304050607080910111213141516171819202122232425262728293031
Source IP address
Destination IP address
0IP ProtocolTotal length

Urgent Pointer. 16 bits, unsigned.
If the URG bit is set, this field points to the sequence number of the last byte in a sequence of urgent data.

Options. 0 to 40 bytes.
Options occupy space at the end of the TCP header. All options are included in the checksum. An option may begin on any byte boundary. The TCP header must be padded with zeros to make the header length a multiple of 32 bits.

KindLengthDescriptionReferences
01End of option list.RFC 793
11No operation.RFC 793
24MSS, Maximum Segment Size.RFC 793
33WSOPT, Window scale factor.RFC 1323
42SACK permitted.RFC 2018
5Variable.SACK.RFC 2018RFC 2883
66Echo. (obsolete).RFC 1072
76Echo reply. (obsolete).RFC 1072
810TSOPT, Timestamp.RFC 1323
92Partial Order Connection permitted.RFC 1693
103Partial Order service profile.RFC 1693
116CC, Connection Count.RFC 1644
126CC.NEWRFC 1644
136CC.ECHORFC 1644
143Alternate checksum request.RFC 1146
15Variable.Alternate checksum data.RFC 1146
16Skeeter.
17Bubba.
183Trailer Checksum Option.
1918MD5 signature.RFC 2385
20SCPS Capabilities.
21Selective Negative Acknowledgements.
22Record Boundaries.
23Corruption experienced.
24SNAP.
25
26TCP Compression Filter.
278Quick-Start Response.RFC 4782
284User Timeout.RFC 5482
29TCP-AO, TCP Authentication Option.RFC 5925
30MPTCP
31
-
252
253RFC3692-style Experiment 1.RFC 4727
254RFC3692-style Experiment 2.RFC 4727
255

Data. Variable length.

TCP State machine:
StateDescription
CLOSE-WAITWaits for a connection termination request from the remote host.
CLOSEDRepresents no connection state at all.
CLOSINGWaits for a connection termination request acknowledgment from the remote host.
ESTABLISHEDRepresents an open connection, data received can be delivered to the user. The normal state for the data transfer phase of the connection.
FIN-WAIT-1
FIN-WAIT-2Waits for a connection termination request from the remote host.
LAST-ACKWaits for an acknowledgment of the connection termination request previously sent to the remote host (which includes an acknowledgment of its connection termination request).
LISTENWaits for a connection request from any remote TCP and port.
SYN-RECEIVEDWaits for a confirming connection request acknowledgment after having both received and sent a connection request.
SYN-SENTWaits for a matching connection request after having sent a connection request.
TIME-WAITWaits for enough time to pass to be sure the remote host received the acknowledgment of its connection termination request.


The CLOSED state is the entry point to the TCP state machine.

Connection establishment

To establish a connection, TCP uses a three-way handshake. Before a client attempts to connect with a server, the server must first bind to a port to open it up for connections: this is called a passive open. Once the passive open is established, a client may initiate an active open. To establish a connection, the three-way (or 3-step) handshake occurs:
  1. SYN: The active open is performed by the client sending an SYN to the server. The client sets the segment's sequence number to a random value A.
  2. SYN-ACK: In response, the server replies with an SYN-ACK. The acknowledgment number is set to one more than the received sequence number (A + 1), and the sequence number that the server chooses for the packet is another random number, B.
  3. ACK: Finally, the client sends an ACK back to the server. The sequence number is set to the received acknowledgment value i.e. A + 1, and the acknowledgment number is set to one more than the received sequence number i.e. B + 1.
At this point, both the client and server have received an acknowledgment of the connection.

[edit]Connection termination


Connection termination
The connection termination phase uses, at most, a four-way handshake, with each side of the connection terminating independently. When an endpoint wishes to stop its half of the connection, it transmits a FIN packet, which the other end acknowledges with an ACK. Therefore, a typical tear-down requires a pair of FIN and ACK segments from each TCP endpoint. After both FIN/ACK exchanges are concluded, the terminating side waits for a timeout before finally closing the connection, during which time the local port is unavailable for new connections; this prevents confusion due to delayed packets being delivered during subsequent connections.
A connection can be "half-open", in which case one side has terminated its end, but the other has not. The side that has terminated can no longer send any data into the connection, but the other side can. The terminating side should continue reading the data until the other side terminates as well.
It is also possible to terminate the connection by a 3-way handshake, when host A sends a FIN and host B replies with a FIN &; ACK (merely combines 2 steps into one) and host A replies with an ACK. This is perhaps the most common method.
It is possible for both hosts to send FINs simultaneously then both just have to ACK. This could possibly be considered a 2-way handshake since the FIN/ACK sequence is done in parallel for both directions.
Some host TCP stacks may implement a half-duplex close sequence, as Linux or HP-UX do. If such a host actively closes a connection but still has not read all the incoming data the stack already received from the link, this host sends a RST instead of a FIN (Section 4.2.2.13 in RFC 1122). This allows a TCP application to be sure the remote application has read all the data the former sent—waiting the FIN from the remote side, when it actively closes the connection. However, the remote TCP stack cannot distinguish between connection Aborting RST and this Data Loss RST. Both cause the remote stack to throw away all the data it received, but that the application still didn't read

Glossary:

ABC, Appropriate Byte Counting.
Congestion control algorithm. A modification to the algorithm for increasing TCP's congestion window (cwnd) that improves both performance and security. Rather than increasing a TCP's congestion window based on the number of acknowledgments (ACKs) that arrive at the data sender, the congestion window is increased based on the number of bytes acknowledged by the arriving ACKs. The algorithm improves performance by mitigating the impact of delayed ACKs on the growth of cwnd. At the same time, the algorithm provides cwnd growth in direct relation to the probed capacity of a network path, therefore providing a more measured response to ACKs that cover only small amounts of data (less than a full segment size) than ACK counting. This more appropriate cwnd growth can improve both performance and can prevent inappropriate cwnd growth in response to a misbehaving receiver. On the other hand, in some cases the modified cwnd growth algorithm causes larger bursts of segments to be sent into the network. In some cases this can lead to a non-negligible increase in the drop rate and reduced performance.
active open.

AIMD, Additive Increase, Multiplicative Decrease.
Congestion control algorithm. (RFC 2914) In the absence of congestion, the TCP sender increases its congestion window by at most one packet per roundtrip time. In response to a congestion indication, the TCP sender decreases its congestion window by half. More precisely, the new congestion window is half of the minimum of the congestion window and the receiver's advertised window.

Congestion Avoidance.
Congestion control algorithm.

Connection.
A logical communication path identified by a pair of endpoints.

cwnd, congestion window.
TCP state variable. This variable limits the amount of data a TCP can send. At any given time, a TCP MUST NOT send data with a sequence number higher than the sum of the highest acknowledged sequence number and the minimum of cwnd and rwnd.
TCP uses two algorithms for increasing the congestion window. During steady-state, TCP uses the Congestion Avoidance algorithm to linearly increase the value of cwnd. At the beginning of a transfer, after a retransmission timeout or after a long idle period (in some implementations), TCP uses the Slow Start algorithm to increase cwnd exponentially. Slow Start bases the cwnd increase on the number of incoming acknowledgments. During congestion avoidance RFC 2581 allows more latitude in increasing cwnd, but traditionally implementations have based the increase on the number of arriving ACKs.

CWV, Congestion Window Validation. Algorithm.
This algorithm limits the amount of unused cwnd a TCP connection can accumulate. ABC can be used in conjunction with CWV to obtain an accurate measure of the network path.
Eifel. Algorithm.
(RFC 3522) This algorithm allows a TCP sender to detect a posteriori whether it has entered loss recovery unnecessarily. It requires that the TCP Timestamp option is enabled for a connection. Eifel makes use of the fact that the TCP Timestamp option eliminates the retransmission ambiguity in TCP. Based on the timestamp of the first acceptable ACK that arrives during loss recovery, it decides whether loss recovery was entered unnecessarily. The Eifel detection algorithm provides a basis for future TCP enhancements. This includes response algorithms to back out of loss recovery by restoring a TCP sender's congestion control state.

Fast Recovery. Congestion control algorithm.
A sender invokes the Fast Recovery after Fast Retransmit. This algorithm allows the sender to transmit at half its previous rate (regulating the growth of its window based on congestion avoidance), rather than having to begin a Slow Start. This also saves time.

Fast Retransmit. Congestion control algorithm.
(RFC 2757) When a TCP sender receives several duplicate ACKs, fast retransmit allows it to infer that a segment was lost. The sender retransmits what it considers to be this lost segment without waiting for the full timeout, thus saving time.

flight size.
The amount of data that has been sent but not yet acknowledged.

full sized segment.
A segment that contains the maximum number of data bytes permitted.

IW, Initial Window.
The size of the sender's congestion window after the three-way handshake is completed.

LFN, Long Fat Network.
A communications path with a large bandwidth * delay product.

LW, Loss Window.
The size of the congestion window after a TCP sender detects loss using its retransmission timer.

MSL, Maximum Segment Lifetime.
The maximum time in seconds that a segment may be held before being discarded.

MSS, Maximum Segment Size.
When IPv4 is used as the network protocol, the MSS is calculated as the maximum size of an IPv4 datagram minus 40 bytes.
When IPv6 is used as the network protcol, the MSS is calculated as the maximum packet size minus 60 bytes. An MSS of 65535 should be interpreted as infinity.

passive open.
PAWS, Protect Against Wrapped Sequences.
A mechanism to reject old duplicate segments that might corrupt an open TCP connection. PAWS uses the same TCP timestamp option as the RTTM mechanism and assumes that every received TCP segment (including data and ACK segments) contains a timestamp whose values are monotone non-decreasing in time. The basic idea is that a segment can be discarded as an old duplicate if it is received with a timestamp less than some timestamp recently received on this connection.

RMSS, Receiver Maximum Segment Size.
The size of the largest segment the receiver is willing to accept. This is the value specified in the MSS option sent by the receiver during connection startup. Or, if the MSS option is not used, 536 bytes. The size does not include the TCP headers and options.

RTT, Round trip time.
RTTM, Round-Trip Time Measurement.
A technique for measuring the RTT by use of timestamps. The data segments are timestamped using the TSOPT option. The resulting ACK packets contain timestamps from the receiver. The resulting RTT can then be determined by the difference in the timestamps.

RW, Restart Window.
The size of the congestion window after a TCP restarts transmission after an idle period.

rwmd, Receiver Window. TCP state variable.
The most recently advertised receiver window.

SACK, Selective Acknowledgement. Algorithm.
This technique allows the data receiver to inform the sender about all segments that have arrived successfully, so the sender need retransmit only the segments that have actually been lost. This extension uses two TCP options. The first is an enabling option, SACK permitted, which may be sent in a SYN segment to indicate that the SACK option can be used once the connection is established. The other is the SACK option itself, which may be sent over an established connection once permission has been given.

segment.
A TCP data or acknowledgment packet.

Slow Start. Congestion control algorithm.
This algorithm is used to gradually increase the size of the TCP congestion window. It operates by observing that the rate at which new packets should be injected into the network is the rate at which the acknowledgments are returned by the other end.

SMSS, Sender Maximum Segment Size.
The size of the largest segment that the sender can transmit. This value can be based on the maximum transmission unit of the network, the path MTU discovery algorithm, RMSS, or other factors. The size does not include the TCP headers and options.

SWS, Silly Window Syndrome.
TFRC, TCP Friendly Rate Control. Algorithm.
A congestion control mechanism for unicast flows operating in a best effort Internet environment. It is reasonably fair when competing for bandwidth with TCP flows, but has a much lower variation of throughput over time compared with TCP, making it more suitable for applications such as telephony or streaming media where a relatively smooth sending rate is of importance. TFRC is designed for applications that use a fixed packet size and vary their sending rate in packets per second in response to congestion.

Congestion window

In Transmission Control Protocol (TCP), the congestion window is one of the factors that determines the number of bytes that can be outstanding at any time. This is not to be confused with the TCP window size which is maintained by the receiver. This is a means of stopping the link between two places from getting overloaded with too much traffic. The size of this window is calculated by estimating how much congestion there is between the two places. The sender maintains the congestion window.
When a connection is set up, the congestion window is set to the maximum segment size (MSS) allowed on that connection. Further variance in the congestion window is dictated by an Additive Increase/Multiplicative Decrease approach.
This means that if all segments are received and the acknowledgments reach the sender on time, some constant is added to the window size. The window keeps growing exponentially until a timeout occurs or the receiver reaches its limit (a threshold value "ssthresh"). After this, the congestion window increases linearly at the rate of 1/(congestion window)packets on each new acknowledgment received.
On timeout:
  1. Congestion window is reset to 1 MSS
  2. "ssthresh" is set to half the window size before packet loss started
  3. "slow start" is initiated.

system administrator may adjust the maximum window size limit, or adjust the constant added during additive increase, as part of TCP tuning.
The flow of data over a TCP connection is also controlled by the use of the receiver advertised TCP Receive Window. By comparing its own congestion window with the receive window of the receiver, a sender can determine how much data it may send at any given time.

Maximum segment size

The maximum segment size (MSS) is a parameter of the TCP protocol that specifies the largest amount of data, specified in octets, that a computer or communications device can receive in a single TCP segment, and therefore in a single IP datagram. It does not count the TCP header or the IP header. The IP datagram containing a TCP segment may be self-contained within a single packet, or it may be reconstructed from several fragmented pieces; either way, the MSS limit applies to the total amount of data contained within the final reconstructed TCP segment.

Therefore: Headers + MSS ≤ MTU

The Minimum MSS = Largest datagram size that any host is required to be able to reassemble - IP header size - TCP header size
So every IPv4 host is required to be able to handle an MSS of at least 536 octets (= 576 - 20 - 20)
and every IPv6 host is required to be able to handle an MSS of at least 1220 octets (= 1280 - 40 - 20).[2]

For most computer users, the MSS option is established by operating system on the SYN packet during the TCP handshake. Each direction of data flow can use a different MSS.

Slow-start

Slow-start is one of the algorithms that TCP uses to control congestion inside the network. It is also known as the exponential growth phase.
During the exponential growth phase, slow-start works by increasing the TCP congestion window each time the acknowledgment is received. It increases the window size by the number of segments acknowledged. This happens until either an acknowledgment is not received for some segment or a predetermined threshold value is reached. If a loss event occurs, TCP assumes that it is due to network congestion and takes steps to reduce the offered load on the network. Once a loss event has occurred or the threshold has been reached, TCP enters the linear growth (congestion avoidance) phase. At this point, the window is increased by 1 segment for each RTT. This happens until a loss event occurs.

Although the strategy is referred to as "slow-start", its congestion window growth is quite aggressive, more aggressive than the congestion avoidance phase (Jacobson, 1988). Before "slow start" was introduced in TCP, the initial pre-congestion avoidance phase was even faster.
The TCP congestion avoidance algorithm is the primary basis for congestion control in the Internet
To avoid congestion collapse, TCP uses a multi-faceted congestion control strategy. For each connection, TCP maintains a congestion window, limiting the total number of unacknowledged packets that may be in transit end-to-end. This is somewhat analogous to TCP's sliding window used for flow control. TCP uses a mechanism called slow start[7]to increase the congestion window after a connection is initialized and after a timeout. It starts with a window of two times the maximum segment size (MSS). Although the initial rate is low, the rate of increase is very rapid: for every packet acknowledged, the congestion window increases by 1 MSS so that the congestion window effectively doubles for every round trip time (RTT). When the congestion window exceeds a threshold ssthresh the algorithm enters a new state, called congestion avoidance. In some implementations (e.g., Linux), the initial ssthresh is large, and so the first slow start usually ends after a loss. However, ssthresh is updated at the end of each slow start, and will often affect subsequent slow starts triggered by timeouts.
Congestion avoidance: As long as non-duplicate ACKs are received, the congestion window is additively increased by one MSS every round trip time. When a packet is lost, the likelihood of duplicate ACKs being received is very high (it's possible though unlikely that the stream just underwent extreme packet reordering, which would also prompt duplicate ACKs). The behavior of Tahoe and Reno differ in how they detect and react to packet loss:
  • Tahoe: Triple duplicate ACKS are treated the same as a timeout. Tahoe will perform "fast retransmit", reduce congestion window to 1 MSS, and reset to slow-start state.
  • Reno: If three duplicate ACKs are received (i.e., four ACKs acknowledging the same packet, which are not piggybacked on data, and do not change the receiver's advertised window), Reno will halve the congestion window, perform a fast retransmit, and enter a phase called Fast Recovery. If an ACK times out, slow start is used as it is with Tahoe.
Fast Recovery. (Reno Only) In this state, TCP retransmits the missing packet that was signaled by three duplicate ACKs, and waits for an acknowledgment of the entire transmit window before returning to congestion avoidance. If there is no acknowledgment, TCP Reno experiences a timeout and enters the slow-start state.
Both algorithms reduce congestion window to 1 MSS on a timeout event.

How does TCP try to avoid network meltdown?
TCP includes several mechanisms that attempt to sustain good data transfer rates while avoiding placing excessive load on the network. TCP's "Slow Start", "Congestion Avoidance", "Fast Retransmit" and "Fast Recovery" algorithms are summarised in RFC 2001. TCP also mandates an algorithm that avoids "Silly Window Syndrome" (SWS), an undesirable condition that results in very small chunks of data being transferred between sender and receiver. SWS Avoidance is discussed in RFC 813. The "Nagle Algorithm", which prevents the sending side of TCP from flooding the network with a train of small frames, is described in RFC 896.

1 comment: