Selective Packet Loss Troubleshooting on International Lines

Background

First, let me explain what selective or targeted packet loss is. This is just my description of certain problem scenarios, not professional terms. For example, MTU problem scenario is one of them. Data packets that exceed the fixed size limit cannot be transmitted normally, etc.

This article introduces a special packet loss scenario. After opening a new international line, the application developer found that he could not connect to the server. He thought there might be network packet loss, so he upgraded the problem for investigation.

Case study from SharkFest 2011 “Packet Trace Whispering”

Problem Information

The basic information of the packet trace file is as follows:

λ capinfos Session-I1-Case1-HK.pcap
File name:           Session-I1-Case1-HK.pcap
File type:           Wireshark/tcpdump/... - pcap
File encapsulation:  Ethernet
File timestamp precision:  microseconds (6)
Packet size limit:   file hdr: 1618 bytes
Packet size limit:   inferred: 54 bytes
Number of packets:   133
File size:           11 kB
Data size:           28 kB
Capture duration:    741.679716 seconds
First packet time:   2011-03-16 17:53:09.318379
Last packet time:    2011-03-16 18:05:30.998095
Data byte rate:      39 bytes/s
Data bit rate:       312 bits/s
Average packet size: 218.00 bytes
Average packet rate: 0 packets/s
SHA256:              12e030a7c5abbe16991abf0817091d0d758ce60384cad3666c29a209b457dba8
RIPEMD160:           17f23fb2e4776b7670f3740c2ecfa82b49c0832e
SHA1:                5f645d95ac02c16c8cf31170376bfa4a077ebb59
Strict time order:   True
Number of interfaces in file: 1
Interface #0 info:
                     Encapsulation = Ethernet (1 - ether)
                     Capture length = 1618
                     Time precision = microseconds (6)
                     Time ticks per second = 1000000
                     Number of stat entries = 0
                     Number of packets = 133

The trace file was captured by tcpdump on linux. The number of packets is not large, only 133, the length is truncated to 54 bytes, the file data size is 28k bytes, the capture time is relatively long 741.68 seconds, and the average rate is only 312 bps.

In the statistical session information, we can see the IP address information, infer that it has been anonymized, and there are 4 TCP flows.

The expert information is as follows. We can see that there are some Error problems at the protocol parsing level, Warning problems such as TCP ACKed unseen segment, and relatively common (suspected) retransmission and DUP ACK phenomena. The number is not large and further actual analysis is needed.

Problem Analysis

Expand the packet trace file to see the following packet details:

By counting the session information, we can see that there are only 4 TCP streams. A quick filter and browse shows that TCP Steam 0-2 is basically normal, with no packet loss or retransmission .

tcp.stream in {0,1,2}

Or you can click the black arrow to jump directly to the problem. You can clearly see the TCP retransmission and DUP ACK problems, which exist in TCP Stream 3.

To analyze TCP Stream 3, first we need to look at the TCP three-way handshake information.

Server port 22, and then we can know that both ends are running SSH protocol version 2.0;
IRTT is 0.243327 seconds, about 243 ms, which shows that the client is really far from the server through the international line;
The MSS of both the client and server is 1460;
The client supports SACK, but the server does not;
In addition, TTL, client 122, server 52, can determine that the packet capture point is in the middle path.

Go to the TCP retransmission information location, the main analysis is as follows:

1. After receiving the data segment from the client No.112, the server acknowledges the data with No.113 ACK, but from then on, from No.114 to the end, there are only one-way data packets from the server;

2. In No.114 – No.120, the 7 consecutive data segments sent by the server were not confirmed by the client, which seemed to be packet loss. The client did not receive the relevant data packets;

3. After that, because the server did not receive the confirmation from the client, a timeout retransmission occurred. It can be seen that TCP retransmitted 7 times in an exponential backoff mode, No.121-No.125, No.130, No.133, with intervals of about 2.5s, 5s, 10s, 20s, 40s, 64s and 64s. Another rare phenomenon is that TCP performed an aggregate retransmission here, which was not a separate timeout retransmission of the previous 7 data segments, but was sent out at one time through a TCP data segment of 928 bytes;

a. It can be verified by TCP Seq Num, 2401 – 3329.b. It can also be verified by TCP Len, 88+212+68+308+100+100+52=928, which is 982 – 54 (14 Ethernet II headers + 20 IP headers + 20 TCP headers).c. The timeout retransmission behavior is quite special. It is not clear whether it is a certain

timeout retransmission algorithm or a special behavior under a certain kernel version. If anyone knows, please let me know. Thank you.

4. Of course, data will not be retransmitted indefinitely and repeatedly. When a certain number of retransmissions is reached and there is still no confirmation returned, it will be judged that an abnormality has occurred in the network or the peer server, and TCP will forcibly close the connection and notify the upper-layer application that the communication abnormality has been forcibly terminated;

5. There are also TCP ACKed unseen segments. By comparing ACK Num 1313 with 1312, it seems that a FIN sent by the client has been confirmed. The subsequent DUP ACK has the same problem. The packet trace file does not seem to capture some packets from the client.

So what is the actual reason for packet loss? By comparing the fields of the data packets one by one, we found the following root causes:

1. The DSCP value of the data segment No.111 on the server side is the default 0, and the DSCP value of the ACK packet No.113 is also 0, including all server-side data packets before No.111, and the two-way interaction is normal ;

2. However, starting from No.114, the DSCP value of each data packet on the server side becomes 4, Unknown, undefined.

Final explanation:

Back to the initial description of the problem, because these data packets pass through international lines from end to end, either the source server (personal guess is a small probability) or the intermediate operator equipment (high probability) has modified the DSCP value of some data packets. In the subsequent transmission path, the DSCP value cannot match the normal forwarding, resulting in packet loss. Continuous retransmission still cannot solve the post-FIN connection. Therefore, from the client’s perspective, it is impossible to connect to the server.

Summary of the problem

Uncommon problems, strange phenomena and final causes, but everything is possible, right?

Click to rate this post!

[Total: 0 Average: 0]

Post Views: 64

Unicorn Network Threat Analyzer

Selective Packet Loss Troubleshooting on International Lines

Background

Problem Information

Problem Analysis

Summary of the problem

Real-time,Accuracy and Efficiency

Products

Quick Links

Company

Unicorn Network Threat Analyzer

Selective Packet Loss Troubleshooting on International Lines

Background

Problem Information

Problem Analysis

Summary of the problem

Related posts:

Real-time,Accuracy and Efficiency

Products

Quick Links

Company