IPv4 Fragmentation – routingloop.net

One of the attributes that helped IPv4 stand the test of time is the ability to slice and dice big datagrams into multiple smaller ones on the fly. This provides tremendous flexibility for different data links and their maximum packet size, reduced MTUs inside of tunnels, and the like. The process of chopping up IP packets to accommodate smaller maximum packet size is called fragmentation. In this article we will explore how it works and how to interpret packet fragmentation in packet captures.

Demo Network

I opted to use my physical home lab equipment again for this test using the topology shown below. The two switches are only present for SPAN to enable packet captures on both sides of R2. I’m using two docking station NICs on my laptop to take two captures at the same time. R2’s interface toward R3 is configured with IP MTU 1400. This is 100 bytes less than the default, allowing me to send 1500-byte packets from R1 to R3 and experience IP fragmentation.

Fragmentation Overview

I’ve already mentioned MTU but have not defined it. MTU is the maximum transmission unit, the largest packet of data that an interface can transmit onto a link. The default MTU of ethernet is 1500 octets, excluding the ethernet header and trailer. The standard ethernet header (no 802.1Q shim) is 18 octets, so if we have a 1500-byte IP packet inside of a standard ethernet frame, the total frame length is 1518 bytes. Many routers will allow you to configure the IP MTU on interfaces. In this lab, R2s interface toward R3 is configured with IP MTU 1400. This means that the largest IP packet, including the first IP header, but excluding the data link header, is 1400 bytes long.

There are two basic ways to allow for datagram fragmentation on intermediate nodes (routers etc.). One approach is to allow for fragmentation to take place but have other routers along the way do reassembly. This approach has many drawbacks. The most significant in my opinion is that it requires all fragments to traverse the same router. If a single fragment takes a different path around the reassembly point, the entire packet must be discarded. Another issue is that this process can add delay. With this model, it is feasible for a packet to be fragmented and reassembled many times between source and destination. Each reassembling router would have to wait for all fragments to arrive before reassembling and forwarding the united bytes. Simply forwarding each fragment individually does not incur the delay of having to wait for all fragments to arrive multiple times. It also decreases the load on (and hopefully the cost of) routers.

The other option is to allow routers to do fragmentation as needed and have the receiving host perform reassembly. The creators of IPv4 seem to have put great consideration into fragmentation and this is the method that was chosen. A packet that has been fragmented can be fragmented again as necessary. Although this decision was made in the early days of packet switching, it aligns with the current philosophy of pushing complexity and state to the edge of the network and keeping the core as simple as possible. It does increase load on destination host though. If the destination host is a router, reassembly can cause high load and low throughput. Termination of a fragmented GRE tunnel is an example where this could happen.

Fragmentation isn’t free though. It still increases the load on routers performing fragmentation. It also increases overhead because each fragment requires its own data link and IP headers, leaving less bandwidth available for goodput. It can have implications for Quality of Service, as every fragment may not contain a transport layer header to use in identifying traffic. This can cause the fragment containing the transport layer header to be classified and marked as desired while other fragments are treated as best effort. Fragments have also been used in various attacks.

To make fragmentation possible, the IPv4 header natively includes bits and fields to aid in fragment identification and reassembly. In order for a receiver to reassemble a fragment, the values in the Source IP, Destination IP, Protocol Number, and Identification field must match. The Identification field is a 16-bit value in the IP header to allow for fragments between the same source and destination host and protocol to be uniquely identified. In the snippet below, the ID field is F in hex, or 15 in decimal.

Shown above are 3 other header values related to fragmentation. The Don’t Fragment (aka DF bit) indicates whether a packet is eligible for fragmentation. DF value 0 allows fragmentation, DF value 1 means the packet should not be fragmented. If the DF bit is set and fragmentation is needed, the packet may be discarded. The DF bit may be set when transmitting to a host that does not have resources to perform reassembly. The next bit is the More Fragments bit. Value 0 means there are no fragments of the same data stream expected after this packet. Value 1 means more fragments are expected. Imagine a scenario where a packet is fragmented into 3 packets. The first two packets will have the MF bit set to 1, and the 3rd (final) packet will have it set to 0.

The final field related to fragmentation is the Fragment Offset field. This will be 0 in non-fragmented packets or the first fragment. If you see More Fragments set to 1 and Frag Offset 0, this will be the first packet of a fragment. The fragment offset describes where the data in the packet starts in relation to the data in the original packet, in units of 8 octets. In a hypothetical scenario where the fragment offset is 1, the data contained in the packet would start on the 8th payload byte of the original packet. I know that’s weird, hopefully it will make more sense with a demo later. This 8-byte boundary means that data in the previous fragment must be a multiple of 8 bytes. This 8-byte data boundary is not a requirement for the last fragment. If the More Fragments bit is on, the payload byte count of the packet should be evenly divisible by 8. This requirement comes from the Fragment Offset field only being 13 bits. The Total Length field of an IPv4 header is 16 bits, allowing for a maximum packet length of 65,535 octets. 13**2 = 8192, 8192*8 = 65,536. This scaling allows us to address the maximum possible packet length with just 13 bits.

Packet Captures!

With some introduction and theory out of the way, it is time to examine some real fragments. As described previously, the lab I set up will allow us to see packets before and after being transmitted from the 1400 octet MTU of R2s northside interface.

I started on R1 by sending a ping to R3 with a 1500-byte packet size. Below is the full-size packet as captured on the switch between R1 and R2. Notice that the frame size is 1514 bytes, the IP total length is 1500 bytes, and the IP Header Length is 20 bytes. 1500-byte total length – 20 bytes of IP header leaves us with 1480 bytes of payload. The Identification field value is 15. The TTL is 255, this packet has not been routed yet.

Next, we can see this packet after being sent out of the 1400-byte interface on R2 and being fragmented into two packets. Below is the 1st fragment. I was surprised to see that the router is sending the smaller of the two fragments first¹. Notice the IP packet total length is 124 bytes. If we subtract 20 bytes of IP header, we’re left with 104 bytes of payload data. The Identification field value is still 15 to uniquely identify this stream for reassembly. Note that the More Fragments bit is set, and the Fragment offset is 0. It is noteworthy that this fragment contains the ICMP header. TTL is 254, it has been routed one hop.

Next is the 2nd and final fragment of the original packet. Notice that the Total Length is 1396. Subtracting 20 bytes of IP header gives us the remainder 1376 payload data. 104 bytes in the first fragment + 1376 in the second fragment gives us all 1480 bytes we started with before fragmentation. The Identification field is the same as the unfragmented packet and the first fragment, 15 (or F in hex). Notice that the More Fragments bit is 0 with a non 0 Fragment Offset. This is the last expected fragment. The Fragment Offset value displayed is 104. This is Wireshark calculating the decimal byte value for us. Recall that the 1st fragment contained 104 bytes of payload. The Fragment Offset value carried in the packet is actually 0x000d, or 13 in decimal. Remember that the Fragment Offset value is expressed in 8 octet blocks. Wireshark is simply multiplying the actual Fragment Offset value by 8 to advise us of the offset in bytes. To the left of the field name, the binary values are displayed. In this example, only the least significant hextet contains a non 0 value. If you convert the displayed bits to hex, you get value D. It is noteworthy that this packet does not contain an ICMP header, only data that will be appended to the original data and header during reassembly. TTL is 254 after being routed by R2.

If you highlight the Fragment Offset field in Wireshark and view the raw hexadecimal data, you’ll see 000d as the value.

Fragmenting the Fragmented

Datagrams that have already been fragmented can be fragmented further as required by the network. Below is an example. I won’t go though these in as much detail, but I’ll call out the packet sizes and fragment offset.

Using the same network, I sent a 2000-byte ping from R1 to R3. Because R1’s outgoing interface has a 1500-byte MTU, R1 transmitted fragments. The first fragment has a total length of 1500, so 1480 bytes of payload data. The packet contains an ICMP header.

The second fragment contains the remaining 500 bytes of payload data.

Next, the 1500-byte first fragment hits the R2 output interface with 1400-byte MTU. It is fragmented again. The first fragment contains 104 bytes of payload data, including the ICMP header.

The second fragment contains 1376 bytes of payload data and no ICMP header. The More Fragments flag is set. Frag Offset is 13, or 104 bytes. After two fragments arrived, 1480 bytes of payload have been delivered.

The third and final fragment is next. More Fragments flag is off, and the packet contains 500 bytes of payload data. The Fragment Offset value is 185. 185*8=1480. The packet has no ICMP header.

The DF Bit in Action

As mentioned earlier, the Don’t Fragment bit indicates when a packet should not be fragmented. If a packet is not eligible for fragmentation needs to be transmitted on an interface where the MTU is less than the packet length. It will likely be discarded. The router or host that needed to fragment may respond with an ICMP message indicating that fragment is needed.

To demonstrate, I attempted to send 1401-byte ICMP Echo packets from R1 to R3 with the DF bit set. Because this is 1 byte larger than R2’s output interface, the packets are discarded and R2 responds with an ICMP Destination Unreachable “Packet Too Big” message.

RoutingLoop_R1#ping 172.16.0.1 size 1401 df-bit
Type escape sequence to abort.
Sending 5, 1401-byte ICMP Echos to 172.16.0.1, timeout is 2 seconds:
Packet sent with the DF bit set
M.M.M
Success rate is 0 percent (0/5)

The snippets below display the 1401-byte packet coming from R1 and the ICMP Type 3 Code 4 message returned by R2. Notice that the ICMP message provides the MTU of the link where fragmentation was required but not permitted by the DF bit.

When Abstractions Leak

Fragmentation at the IP layer can seem to be invisible to the transport layer. Fragmented IP packets are reassembled at the destination host before the transport layer segment is presented upward for de-encapsulation. This is a form of abstraction. However, if a single IP packet fragment that is carrying TCP data is lost in transit, the entire segment will be retransmitted and fragmented again. The abstraction is leaking. What would be invisible to upper layers is having a direct impact by causing TCP to have to retransmit. In lossy networks with fragmentation this can cause inefficient use of bandwidth by a sizable portion of bandwidth being consumed by retransmission of the same data multiple times because a single fragment is lost.

A second possible scenario is issues with ECMP hashing. Many routers peek into the transport layer header to include port numbers in their ECMP hashing algorithm. Because only 1 fragment typically contains the next layer header, there is no value to hash on. In some cases, this can cause different fragments to take different paths, and feasibly arrive out of order. If a packet without the transport header arrives at the destination first, it is likely to be dropped. This could feasibly happen endlessly, until the source host user gives up and calls the NOC.

As a precaution, older versions of Wireshark had a bug that caused the fragment offset values to be shifted by 3 bit positions. I spent about 5 hours trying to understand why I wasn’t seeing the expected values. After updating to the newest version the issue went away. I was using a version that is years old though, I should have updated sooner!

RFC 1812 section 4.2.2.7 states “When a router fragments an IP datagram, it SHOULD minimize the number of fragments. When a router fragments an IP datagram, it SHOULD send the fragments in order. A fragmentation method that may generate one IP fragment that is significantly smaller than the other MAY cause the first IP fragment to be the smaller one.” ↩︎