Network Protocols — Programmer’s Compendium
The network stack does several seemingly-impossible things. It does reliable transmission over our unreliable networks, usually without any detectable hiccups. It adapts smoothly to network congestion. It provides addressing to billions of active nodes. It routes packets around damaged network infrastructure, reassembling them in the correct order on the other side even if they arrived out of order. It accommodates esoteric analog hardware needs, like balancing the charge on the two ends of an Ethernet cable.
In the old days of analog telephones, making a phone call meant a continuous electrical connection from your phone to your friend’s. It was as if a wire were run directly from you to them. There was no such wire, of course — the connection was made through complex switching systems — but it was electrically equivalent to a single wire.
Each router between my laptop and google.com is connected to a number of other routers, maintaining a crude routing table showing which routers are closer to which parts of the Internet.
When a packet arrives destined for google.com, a quick lookup in the routing table tells the router where the packet should go next to bring it closer to Google.
When a packet arrives destined for google.com, a quick lookup in the routing table tells the router where the packet should go next to bring it closer to Google. The packets are small, so each router in the chain ties up the next router for only a tiny fraction of a second.
When a packet arrives destined for google.com, a quick lookup in the routing table tells the router where the packet should go next to bring it closer to Google. The packets are small, so each router in the chain ties up the next router for only a tiny fraction of a second.
To route quickly, routers maintain routing tables indicating the paths to various groups of IP addresses.
Routing breaks down into two sub-problems. First, addressing: what is the data’s destination? This is handled by IP, the Internet Protocol, whence IP addresses.
IPv4, still the most common version of IP, provides only 32 bits of address space. It’s now fully allocated, so adding a node to the public Internet requires reusing an existing IP address.
IPv6 allows 2128 addresses (about 1038), but only has about 20% adoption as of 2017.
Routing happens fast, so there’s no time to query remote databases for routing information. As an example, Cisco ASR 9922 routers have a maximum capacity of 160 terabits per second. Assuming full 1,500 byte packets (12,000 bits), that’s 13,333,333,333 packets per second in a single 19 inch rack!
When a new packet arrives, the router looks it up in the table, telling it which peer is closest to the destination. It sends the packet to that peer and moves on to the next. BGP’s job is to communicate this routing table information between different routers, ensuring up-to-date route tables.
IP and BGP together don’t make a useful Internet, unfortunately, because they provide no way to transfer data reliably. If a router becomes overloaded and drops a packet, we need a way to detect the loss and request retransmission.
We could try to design a network where the 88.5 MB document is sent from the web server to the first router, then to the second, and so on. Unfortunately, that network wouldn’t work at Internet scale, or even at intranet scale.
First, computers are finite machines with finite amounts of storage. If a given router has only 88.4 MB of buffer memory available, it simply can’t store the 88.5 MB video file. The data will be dropped on the floor and, worse, I’ll get no indication. If a router is so busy that it’s dropping data, it can’t take the time to tell me about dropped data.
For these reasons and more, we don’t send 88.5 MB messages across the Internet. Instead, we break them down into packets, usually in the neighborhood of 1,400 bytes each. Our video file will be broken into 63,214 or so separate packets for transmission.
TCP packet reassembly is done using the simplest imaginable mechanism: a counter. Each packet is assigned a sequence number when it’s sent. On the receiving side, the packets are put in order by sequence number. Once they’re all in order, with no gaps, we know the whole file is present.
Actual TCP sequence numbers tend not to be integers simply increasing by 1 each time, but that detail isn’t important here.
How do we know when the file is finished, though? TCP doesn’t say anything about that; it’s the job of higher-level protocols.
For example, HTTP responses contain a “Content-Length” header specifying the response length in bytes. The client reads the Content-Length, then keeps reading TCP packets, assembling them back into their original order, until it has all of the bytes specified by Content-Length. This is one reason that HTTP headers (and most other protocols’ headers) come before the response payload: otherwise, we wouldn’t know the payload’s size.
TCP reassembly happens inside the kernel, so applications like web browsers and curl and wget don’t have to manually reassemble TCP packets.
But the kernel doesn’t handle HTTP, so applications do have to understand the Content-Length header and know how many bytes to read.
Occasionally, my computer sends a message to the server saying, for example, “I’ve received all packets up to and including packet number 564,753.” That’s an ACK, for acknowledgement: my computer acknowledges receipt of the server’s packets.
On a new connection, the Linux kernel sends an ACK after every ten packets. This is controlled by the TCP_INIT_CWND constant, which we can see defined in the Linux kernel’s source code.
The CWND in TCP_INIT_CWND stands for congestion window: the amount of data allowed in flight at once. If the network becomes congested — overloaded — then the window size will be reduced, slowing packet transmission.
Ten packets is about 14 KB, so we’re limited to 14 KB of data in flight at a time. This is part of TCP slow start: connections begin with small congestion windows. If no packets are lost, the receiver will continually increase the congestion window, allowing more packets in flight at once.
Eventually, a packet will be lost, so the receive window will be decreased, slowing transmission. By automatically adjusting the congestion window, as well as some other parameters, the sender and receiver keep data moving as quickly as the network will allow, but no quicker.
each side ACKs the other side’s messages, and each side maintains its own congestion window. Asymmetric windows allow the protocol to take full advantage of network connections with asymmetric upstream and downstream bandwidth, like most residential and mobile Internet connections.
TCP has no special “I lost a packet!” message. Instead, ACKs are cleverly reused to indicate loss.
Any out-of-order packet causes the receiver to re-ACK the last “good” packet — the last one in the correct order. In effect, the receiver is saying “I received packet 5, which I’m ACKing. I also received something after that, but I know it wasn’t packet 6 because it didn’t match the next sequence number in packet 5.”
if the packet was truly lost, unexpected packets will continue to arrive and the receiver will continue to send duplicate ACKs of the last good packet. This can result in hundreds of duplicate ACKs.
When the sender sees three duplicate ACKs in a row, it assumes that the following packet was lost and retransmits it. This is called TCP fast retransmit because it’s faster than the older, timeout-based approach.
the protocol itself doesn’t have any explicit way to say “please retransmit this immediately!” Instead, multiple ACKs arising naturally from the protocol serve as the trigger.
Ethernet is a physical layer protocol: it describes how the bits turn into electrical signals in a cable. It’s also a link layer protocol: it describes the direct connection of one node to another. However, it’s purely point-to-point and says nothing about how data is routed on a network. There’s no concept of a connection in the sense of a TCP connection, or of reassignable addresses in the sense of an IP address.
As a protocol, ethernet has two primary jobs. First, each device needs to notice that it’s connected to something, and some parameters like connection speed need to be negotiated.
Second, once link is established, Ethernet needs to carry data.
Like the higher-level protocols TCP and IP, Ethernet data is broken into packets. The core of a packet is a frame, which has a 1,500 byte payload, plus another 22 bytes for header information like source and destination MAC address, payload length, and checksum.
Digital systems don’t exist; everything is analog.
Internet protocols are best thought of as a stack of layers.
Ethernet provides physical data transfer and link between two point-to-point devices.
IP provides a layer of addressing, allowing routers and large-scale networks to exist, but it’s connectionless.
Packets are fired into the ether, with no indication of whether they arrived or not.
TCP adds a layer of reliable transmission by using sequence numbers, acknowledgement, and retransmission.
IP and TCP save application developers from constantly reimplementing packet retransmission and addressing and so on.
Higher-level protocols tend to be built on lower-level ones: HTTP is built on TCP is built on IP is built on Ethernet.
To squeeze every byte out of small requests, HTTP/2 specifies compression for headers, which are usually small. Without context from TCP, IP, and Ethernet, this seems silly: why add compression to a protocol’s headers to save only a few bytes? Because, as the HTTP/2 spec says in the introduction to section 2, compression allows “many requests to be compressed into one packet”.
HTTP/2 does header compression to meet the constraints of TCP, which come from constraints in IP, which come from constraints in Ethernet, which was developed in the 1970s, introduced commercially in 1980, and standardized in 1983.
One final question: why is the Ethernet payload size set at 1,500 bytes? There’s no deep reason; it’s just a nice trade-off point. There are 42 bytes of non-payload data needed for each frame. If the payload maximum were only 100 bytes, only 70% (100/142) of time would be spent sending payload. A payload of 1,500 bytes means about 97% (1500/1542) of time is spent sending payload, which is a nice level of efficiency.