TCP在发送数据的时候会设置一个定时器，如果ACK没有准时的收到，就会触发重传。定时器所设定的时间就是叫作RTO，原文有点拗口.

The timeout occurs after an interval called the retransmission timeout (RTO)

### Simple Timeout and Retransmission Example

net.ipv4.tcp_retries1：3

net.ipv4.tcp_retries2：15,大约13-30分钟

net.ipv4.tcp_syn_retries：5，这是active open那端重传的次数

net.ipv4.tcp_synack_retries：5，这是passive open那端的次数

### Setting the Retransmission Timeout (RTO)

#### The Classic Method

$$SRTT = \alpha (SRTT)+(1-\alpha)RTT_{s}$$
SRTT会根据它原先的值和新得到的RTT sample而改变。常量$\alpha$推荐设置为0.9或者0.8。在这种方法中（假设$\alpha$ =0.8），它（SRTT）的80%来自它先前的数据，20%来自RTT sample。这种均值也被称为exponentially weighted moving average (EWMA) or low-pass filter。

$$RTO=min(ubound,max(lbound,(SRTT)\beta))$$

It generally results in the RTO being set either to 1s, or to about twice SRTT.

#### The Standard Method

that the timer specified by [RFC0793] cannot keep up with wide fluctuations in the RTT (and in particular, it causes unnecessary retransmissions when the real RTT is much larger than expected).

M：就是sample RTT

g：为1/8，即$2^{-3}$

h：为1/4, 即 $2^{-2}$

PS：其中srtt和rttvar的初始值并在接下来会介绍。

g和h选择为2的负次方，是因为在计算的时候更加简单，因为乘1/4相当于右移2位。

##### Clock Granularity and RTO Bounds

TCP内部也有一个时钟，它并不是系统的时钟，它只是一个变量，随着系统时钟的增加而增加。不过并不是一对一的同步更新。

Rather, the TCP clock is usually the value of a variable that is updated as the system clock advances, not necessarily one-for-one

TCP时钟”滴答“一次的时长叫作粒度（granularity）,通常的话，这个值都会比较大，超过500ms,但是在最近的Linux中，这个值通常为1ms。粒度进一步的能够影响RTO的设置

In [RFC6298], the granularity is used to refine how updates to the RTO are made.

$$RTO=max(srtt+max(G,4(rttvar)),1000)$$

##### Initial Values

$$srtt=M \\ rttvar=M/2$$

##### Retransmission Ambiguity and Karn's Algorithm

It happens because unless the Timestamps option is being used, an ACK provides only the ACK number with no indication of which copy (e.g., first or second) of a sequence number is being ACKed

Karn算法就是来避免这个问题的，Karn算法有两部分。

Karn算法的第一部分：如果发生了重传，返回来的RTT都不能更新RTT estimators(就是前面的，srtt和rttvar)。

Karn算法的第二部分：如果我们直接忽略了重传回来的ACK对于RTO的影响，那么此时网络情况所反应的信息也被直接忽略。比如说，如果发生了重传，此时网络内部可能是比较拥塞的。对于发送方来说，应该适当的较低它的发送速度，这也就是前面的指数回退(expoential backoff)的理由。TCP将backoff factor应用到了RTO中，它将之前的超时时间乘2，直到一个正确的ACK被接收，此时的backoff factor会被设置为1，这就是karn算法的第二部分。注意，这里正确的意思是：返回的ACK并非是重传数据的ACK，而是发送了只有一次的数据的ACK。

Note that when TCP times out, it also invokes congestion control procedures that alter its sending rate.

When an acknowledgement arrives for a packet that has been sent more than once (i.e., is retransmitted at least once), ignore any round-trip measurement based on this packet, thus avoiding the retransmission ambiguity problem. In addition, the backed-off RTO for this packet is kept for the next packet. Only when it (or a succeeding packet) is acknowledged without an intervening retransmission will the RTO be recalculated from SRTT -------quoted directly from the 1987 paper [KP87]

Karn算法从rfc1122以来的一段时间以内一直都是必须所实现的部分。但是也有一个例外，如果TCP timestamp选项使用的话，那么就可以避免Karn算法的第一个部分。

##### RTT measurement(RTTM) with Timestam Option

Round Trip Time (RTT) is the length time it takes for a data packet to be sent to a destination plus the time it takes for an acknowledgment of that packet to be received back at the origin. ------MDN--RTT

RTT是报文发送在加上ACK返回的时间。

timestamp value(TSV)填充在TSOPT的第一个部分，然后伴随着SYN报文发送出去，最后在TSER中跟着SYN+ACK报文返回来，这些timestamp可以用于设置srtt,rttvar,和RTO。这样的方法虽然很直观，但是接收方并不会为每一个报文都返回ACK

This seems straightforward enough but is made more complex because TCP does not always provide an ACK for each segment it receives.

For example, TCP often provides one ACK for every other segment (see Chapter 15) when large volumes of data are transferred

In addition, when data is lost, reordered, or successfully retransmitted, the cumulative ACK mechanism of TCP means that there is not necessarily any fixed correspondence between a segment and its ACK.

1. 发送的报文包含着32bit的timestamp TSV，这个字段包含着此时的TCP时钟值（注意该时钟并不是系统的时钟，看Clock Granularity and TRO bounds那节）。
2. 接收方的TCP会保存着它刚接收到的报文的TSV并且会在ACK中返回（保存在TsRecent），而且还会保存刚才发出去的ACK号（保存在LastACK）,注意ACK中包含的是接收方下想要收到的下一个报文的的序列号
3. 如果新到达的报文的序列号和LastACK相匹配，那么就将附带过来的TSV放在TsRecent中。
4. 当接收方返回ACK的时候，在TSER中放着刚才收到的TsRecent。
5. 当发送方接收到ACK的时候，滑动它的窗口，将它此时的TCP时钟和返回来的TSER相减来得到RTT，并且以此来更新RTT estimators。

PS：我最开始在思考，为什么要这么做，十分麻烦。难道在本地保存一个临时变量，然后在接收到ACK的时候在获得此时的时间，减一下不就好了？

#### The Linux Method

The combination of frequent measurements of the RTT and the fine-grain clock contributes to a more accurate esti- mate of the RTT but also tends to minimize the value of rttvar over time

Linux除了使用之前已经介绍过的两个变量：srtt和rttvar之外，还引入了两个变量mdev和mdev_max。

mdev:使用标准方法中的计算rttvar的算法来计算得到。

mdev_max：是从上一次RTT测量之后所遇到的mdev的最大值，且不能小于50ms、通常，rttvar都会设置为至少和mdev_max一样大。因为前面公式指定了RTO = srtt+4(rttvar)，而rttvar至少和medv_max一样大，所以RTO不会小于200ms。

rttvar会更新为mdev_max的值，且总是将RTO设置为srtt和4(rttvar)的和而且确保RTO不能超过TCP_RTO_MAX，它的默认值是120s。

Linux updates rttvar to the value of mdev_max whenever the maximum increases. It always sets the RTO to be the sum of srtt and 4(rttvar) and ensures that the RTO never exceeds TCP _RTO_MAX, which defaults to 120s

medv和标准方法中计算rttvar的方法一样，为M/2，所以medv=8ms

medv_max是medv和TCP_RTO_MAX两者之间的最大值，medv_max= max(medv,TCP_RTO_MAX) = 50ms

rttvar通常会更新为mdev_max的值，rttvar=50ms。

RTO=srtt+4(rttvar)=216ms

### Timer-Based Retransmission

The other way is to keep increasing a multiplicative backoff factor applied to the RTO each time a retransmitted segment is again retransmitted. This is implemented in the “second part” of Karn’s algorithm mentioned previously

### Fast Retransmit

As a result, packet loss can often be more quickly and efficiently repaired using fast retransmit than with timer-based retransmission.

TCP发送方观察到dupthresh个重复ACK，就开始重传一个或者多个分组，而不会等待定时器的超时，此外冗余ACK的出现，也将会被认为是网络拥塞的一种表现。如果没有SACK，那么每次只能发送一个缺失的分组，如果有选择确认机制(SACK)，那么在一个RTT之内可以发送多个和分组。

Without SACK, no more than one segment is typically retransmitted until an acceptable ACK is received.With SACK, ACKs contain additional information allowing the sender to fill more than one hole in the receiver per RTT

#### Example

Thus, it is not counted toward the three-duplicate-ACK threshold required to initiate a fast retransmit.

TCP is considered to be recovering from loss after a retransmission until it receives an ACK that matches or exceeds the sequence number of the recovery point

When partial ACKs arrive, the sending TCP immediately sends the segment that appears to be missing (26601 in this case)

If permitted by congestion control procedures (see Chapter 16), it may also send new data it has not yet sent

Because no SACKs are being used, the sender can learn of at most one receiver hole per round-trip time

### Retransmission with Selective Acknowledgement

In many circumstances, the properly operating SACK sender is able to fill these holes more quickly and with fewer unnecessary retransmissions than a comparable non-SACK sender because it does not have to wait an entire RTT to learn about additional holes

SACK被启用的时候，SACK选项中至多包含着3或4个SACK block(这取决于是否使用了timestamp选项，看上一章中对于option那节可以理解)。每一个SACK block包含着两个32bit的序列号，分别表示着失序的数据块的左边界的序列号和右边界的序列号+1。如下图所示：

Here we see that the ACK for 23801 contains a SACK block of [25201,26601], indicating a hole at the receiver. The receiver is missing the sequence number range [23801,25200], which corresponds to the single 1400-byte packet starting with sequence number 23801

ACK=23801,第一个SACK block=[25201,26601]。所以发送方就知道[23801,25200]这里的数据缺少了，需要重新发送。

If not limited by congestion control (see Chapter 16), all three could be filled within one round-trip time using a SACK-capable sender

Because the space in a SACK option is limited, it is best to ensure that the most recent information is always provided to the sending TCP , if possible

Other SACK blocks are listed in the order in which they appeared as first blocks in previous SACK options.

### Suprious Timesouts and Retransmission

PS：伪重传这名字看起来有些诡异，因为它更应该被描述为RTT较大使得ACK丢失引起的不必要的重传问题。

#### Duplicate SACK Extension

that causes the first SACK block to indicate the sequence numbers of a duplicate segment that has arrived at the receive

D-SACK的实现非常简单，不需要在原先的SACK基础上做太多的改进。但是需要双方对第一个SACK blocks的理解做一些改变。如果一个non-DSACK TCP接受到了一个来自DSACK TCP的SACK block，那么它将会误解。

If a non-DSACK TCP shares a connection with a DSACK TCP , they will interoperate, but without any of the benefits of DSACK.

DSACK information is not repeated across multiple SACKs as conventional SACK information is. As a consequence, DSACKs are less robust to ACK loss than regular SACKs.

#### The Eifel Detection Algorithm

Eifel算法的思路是：在发生重传之后，如果下一个返回的ACK是可接受的ACK，但是这个引发这个ACK是原始所发送的分组，那么就可以断定伪重传发生了。

Eifel算法的很简单，它需要TCP使用timestamp选项。当重传开始后，记录下此时的TSV，当可接受的ACK返回的时候，就去检查TESR。因为，如果是重传5所返回的ACK，那么TESR肯定和TSV相等。如果不是，那么先前ACK的TSER肯定小于此时的TSV，所以我们就知道伪重传发生了

PS：这里的可接受的意思应该是，比如说，当前序列号1400，发送了1400字节的数据。那么返回的ACK应该是2801。对于其他ACK就不能称作可接受的ACK。

Eifel算法和DSACK相比的好处在于：在重传之后，需要等待对方发送DSACK给发送方。而Eifel算法，表现的更加主动一些，它只要等待下一个ACK来临即可。

#### Forward-RTO Recovery (F-RTO)

F-RTO是伪重传检测的标准方法，它不需要时间戳选项，因此在一些老的TCP中也挺好用。不过它只能检测那些由定时器超时而导致的伪重传，不能解决其他造成伪重传的原因

It attempts to detect only spurious retransmissions caused by expiration of the retransmission timer; it does not deal with the other causes for spurious retransmissions or duplications mentioned befor

F-RTO在重传且接收到了重传后的第一个ACK，在发送一个之前未被发送的数据过去，在等待一个ACK返回。如果这两个ACK是不同的，那么说明新发送的数据使得接收方窗口移动了，说明之前的数据都被正确的接收了，此时是伪重传。如果返回的ACK是相同的，说明窗口不能移动，也就是说确实数据存在空洞，所以此次重传有效的。

If such data is only causing duplicate ACKs, there must be one or more holes at the receiver.

#### The Eifel Response Algorithm

Because the response algorithm is logically decoupled from the Eifel Detection Algorithm, it can be used with any of the detection algorithms we just discussed

Eifel响应算法只能用于第一类检测算法，下面介绍Eifel相应算法。

$$srttPrev = srrt + 2(G) \\$$

$$rttvarPrev=rttvar$$

m是超时后首个收到的ACK的RTT sample。

### Packet Reordering and Duplicate

#### Reordering

TCP是基于IP协议之上的传输层协议，IP协议并不能保证数据的有效传输。此外，对于IP协议来说，不保证IP数据报的有序性是有好处的。因为IP数据报可以选择不同的链路来达到目标主机，所以会导致新插入到网络中的数据先于之前插入的数据。导致了在接收方中，数据的接受顺序和发送顺序的不一致。

cause traffic freshly injected into the network to pass ahead of older traffic, resulting in the order of packet arrivals at the receiver not matching the order of transmission at the sender

This can lead to an unwanted burstiness (instantaneous high-speed sending) behavior in the sending pattern of TCP and also trouble in taking advantage of available network bandwidth, because of the behavior of TCP’s congestion control

Fortunately , severe reordering on the Internet is not common [J03], so setting dupthresh to a relatively small number (such as the default of 3) handles most circumstance

