Let's code a TCP/IP stack, 4: TCP Data Flow & Socket API
Previously, we introduced ourselves to the TCP header and how a connection is established between two parties.
In this post, we will look into TCP data communication and how it is managed.
Additionally, we will provide an interface from the networking stack that applications can use for network communication. This Socket API is then utilized by our example application to send a simple HTTP request to a website.
Contents
- Transmission Control Block
- TCP Data Communication
- TCP Connection Termination
- Socket API
- Testing our Socket API
- Conclusion
- Sources
Transmission Control Block
It is beneficial to start the discussion on TCP data management by defining the variables that record the data flow state.
In short, the TCP has to keep track of the sequences of data it has sent and received acknowledgments for. To achieve this, a data structure called the Transmission Control Block (TCB) is initialized for every opened connection.
The variables for the outgoing (sending) side are:
In turn, the following data is recorded for the receiving side:
Additionally, helper variables of the current segment being processed are defined followingly:
Together, these variables constitute most of the TCP control logic for a given connection.
TCP Data Communication
Once a connection is established, explicit handling of the data flow starts. Three variables from the TCB are important for basic tracking of the state:
SND.NXT
- The sender will track the next sequence number to use inSND.NXT
.RCV.NXT
- The receiver records the next sequence number to expect inRCV.NXT
.SND.UNA
- The sender will record the oldest unacknowledged sequence number inSND.UNA
.
Given a sufficient time period when TCP is managing the data communication and no transmit occurs, all these three variables will be equal.
For example, when A decides to send a segment with data to B, the following happens:
-
TCP A sends a segment and advances
SND.NXT
in its own records (TCB). -
TCB B receives the segment and acknowledges it by advancing
RCV.NXT
and sends an ACK. -
TCB A receives the ACK and advances
SND.UNA
.
The amount by which the variables are advanced is the length of the data in the segment.
This is the basis for TCP control logic over the transmit of data. Let’s see how this looks like with tcpdump(1)
, a popular utility for capturing network traffic:
The address 10.0.0.4 (host A) initiates a connection from port 12000 to host 10.0.0.5 (host B) listening on port 8000.
After the three-way handshake, the connection is established and their internal TCP socket state is set to ESTABLISHED
. Initial sequence numbers are 1525252 for A, and 825056904 for B.
Host A sends a segment with 17 bytes of data, which host B acknowledges with an ACK segment. Relative sequence numbers are shown by default with tcpdump
to ease readability. Thus, ack 18
is actually 1525253 + 17.
Internally, the TCP of the receiving host (B) has advanced RCV.NXT
with the number 17.
The interplay of sending data and acknowledging it continues. As can be seen, the segments with length 0
only have the ACK flag set, but the acknowledgement sequence numbers are precisely increment based on the previously received segment’s length.
Host B informs host A that it has no more data to send by generating a FIN segment. In turn, host A acknowledges this.
To finish the connection, host A also has to signal that it has no more data to send.
TCP Connection Termination
Closing a TCP connection is likewise an involved operation, and can be forcibly terminated (RST) or finished with a mutual agreement (FIN).
The basic scenario is as follows:
- The active closer sends a FIN segment.
- The passive closer acknowledges this by sending an ACK segment.
- The passive closer starts its own close operation (when it has no more data to send) and effectively becomes an active closer.
- Once both sides have sent a FIN to each other and they have acknowledged them to both directions, the connection closes.
Evidently, the closing of a TCP connection requires four segments, in contrast to the three segments of the TCP connection establishment (three-way handshake).
Additionally, TCP is a bi-directional protocol, so it is possible to have the other end announce it has no more data to send, but stay online for incoming data. This is called TCP Half-close.
The unreliable nature of packet-switched networks introduce additional complexity to the connection termination - FIN segments can disappear or never intentionally be sent, which leaves the connection in an awkward state. For example, in Linux the kernel parameter tcp_fin_timeout
controls how many seconds TCP waits for a final FIN packet, before forcibly closing the connection. This is a violation of the specification, but is needed for Denial of Service (DoS) prevention.1
Aborting a connection involves a segment with the RST flag set. Resets can occur because of many reasons, but some usual ones are:
- Connection request to a nonexistent port or interface
- The other TCP has crashed and ends up in a out-of-sync connection state
- Attempts to disturb existing connections2
Thus, the happy path of TCP data transmission never involves a RST segment.
Socket API
To be able to utilize the networking stack, some kind of an interface has to be provided for applications. The BSD Socket API is the most famous one and it originates from the 4.2BSD UNIX release from 1983.3 The Socket API in Linux is compatible to the BSD Socket API.4
A socket is reserved from the networking stack by calling socket(2)
, passing the type of the socket and protocol as parameters. Common values are AF_INET
for the type and SOCK_STREAM
as domain. This will default to a TCP-over-IPv4 socket.
After succesfully reserving a TCP socket from the networking stack, it will be connected to a remote endpoint. This is where connect(2)
is used and calling it will launch the TCP handshake.
From that point on, we can just write(2)
and read(2)
data from our socket.
The networking stack will handle queueing, retransmission, error-checking and reassembly of the data in the TCP stream. For the application, the inner acting of TCP is mostly opaque. The only thing the application can rely on is that the TCP has acknowledged the responsibility of sending and receiving the stream of data, and that it will inform the application of unexpected behavior through the socket API.
As an example, let’s look at the system calls that a simple invocation of curl(1)
does:
We observe the Socket API calls with strace(1)
, a tool for tracing system calls and signals. The steps are:
-
A socket is opened with
socket
, and the type is specified as IPv4/TCP. -
connect
launches the TCP handshake. Destination address and port are passed to the function. -
After the connection is successfully established,
sendto(3)
is used to write data to the socket - in this case, the usual HTTP GET request. -
From that point on,
curl
eventually reads the incoming data withrecvfrom
.
The astute reader may have noticed that no read
or write
system calls were placed. This is because the actual socket API does not contain those functions, but normal I/O operations can also be used. From man socket(7)
:4
In addition, the standard I/O operations like write(2), writev(2), sendfile(2), read(2), and readv(2) can be used to read and write data.
In the end, the Socket API contains multiple functions for just writing and reading data. This is complicated by the I/O family of functions, which can also be used to manipulate the socket’s file descriptor.
Testing our Socket API
Now that our networking stack provides a socket interface, we can write applications against it.
The popular tool curl
is used to transmit data with a given protocol. We can replicate the HTTP GET behavior of curl
by writing a minimal implementation:
In the end, sending a HTTP GET request exercises the underlying networking stack only minimally.
Conclusion
We have now essentially implemented a rudimentary TCP with simple data management and provided an interface applications can use.
However, TCP data communication is not a simple problem. The packets can get corrupted, reordered or lost in transit. Furthermore, the data transmit can congest arbitrary elements in the network.
For this, the TCP data communication needs to include more sophisticated logic. In the next post, we will look into TCP Window Management and TCP Retransmission Timeout mechanisms, to better cope with more challenging settings.
The source code for the project is hosted at GitHub.
If you liked this post, you can share it with your followers and follow me on Twitter!