For the past several months, we actively have worked on extending support of our Workload IAM from Kubernetes to virtual machines.
There were numerous challenges that we had to overcome. One of the most complex and time-consuming ended up being interactions with network access translation (NAT) functionality.
Aembit’s edge component relies heavily on NAT functionality, and as we moved forward, we soon exceeded our understanding of how it works.
One thing which amazed me (about iptables, conntrack, nftables, and related tools and systems) is that there is tons of mid-level documentation available, describing both the machinery and specific commands. However, there is a lack of good high-level documentation. As a result, you are forced to construct a mental model on your own. Also, if something doesn’t behave as you expected, there are no clear steps on how to troubleshoot and debug it.
The rest of this article is an attempt to fill this gap, mostly to provide a conceptual framework of how you should think about NAT in relation to iptables and conntrack, as well as some information to help with troubleshooting.
I think this post will work best for people who have some, even rudimentary, understanding of networking and have some exposure to iptables (and NAT specifically).
Pretty much, if you are like me and able to construct simple iptable rules – but are completely at loss when something goes wrong – please read on.
Any iptables (and Netfilter architecture) discussion is incomplete without the reference to existing documentation:
- Netfilter documentation
- iptables tutorial by Oskar Andreasson
- Packetflow in netfiler by Jan Engelhardt
History and High-Level Overview
The networking stack implementation in Linux is known for its astounding complexity. Around the year 2000, Rusty Russell initiated a project that evolved into what we now know as Netfilter. The core concept of this project was to integrate hooks into the Linux kernel. This allowed other kernel modules to register callbacks with the kernel’s networking stack, providing an alternative to directly altering it.
These hooks, introduced by Netfilter, are utilized by features such as Iptables, conntrack, nftables, and NAT to interact with packets processed by the networking stack. In doing so, these modules offer a higher-level interface, like iptables rules, which enables end-users to easily control and define the behavior of the network.
The majority of the features provided by these modules operate in a stateless manner. This means that they don’t require the tracking of packet history or status. For instance, if the objective is simply to block all packets headed to a specific destination, there’s no need to maintain any state information. However, there are exceptions where stateful functionality is necessary. An example of this is NAT, which requires tracking the state of network connections
How NAT Works
The NAT functionality in iptables monitors packets as they pass through the system, whether incoming (ingress) or outgoing (egress). These packets are checked against the NAT rules defined in iptables. When a packet matches a rule, it is modified – typically, this involves changing the destination and/or source address. Essentially, NAT can be thought of as a set of rules applied to packets during their journey from client to server and vice versa.
NAT’s operation is closely linked with the conntrack module, which is responsible for tracking all network connections or flows within the system.
This represents the level of understanding I had about six months ago. In the following sections, I’ll detail the findings we’ve gathered over the past several months. While these insights are still at a high level, they provide a much more accurate picture of the subject
NAT Evaluates Iptables Rules Only for the Initial Packet
It is clearly stated in this documentation: “(The NAT) table is slightly different from the `filter’ table, in that only the first packet of a new connection will traverse the table: the result of this traversal is then applied to all future packets in the same connection.”
The conntrack module calls the NAT module to evaluate NAT rules for every new connection (i.e. first packet on a connection) it creates. NAT module evaluates the rule based upon the connection tuple and direction of the connection. To apply NAT to the future packets on the matching connection, the NAT module saves the state of the match in the conntrack’s connection entry.
There are two primary reasons why only the first packet is evaluated.
First, evaluating all iptables rules could be highly resource intensive (especially, if you are doing it on a per-packet basis). By creating a hash table of connections, conntrack makes this process more performant. Second, keeping track of existing connections allows NAT to have stateful functionality.
Conntrack Doesn’t Track Connections, It Tracks Flows
You have to actually read the fine print to realize that conntrack doesn’t really track “connections”; it just sees packets going both directions and matches source and destination to what it already has in the table.
Firstly, it’s more appropriate to call these entities “flows” (a more generic name), because conntrack also tracks connectionless protocols like UDP, ICMP, etc.
Secondly, it’s better to think about it as a flow even for TCP. When we talk about a TCP connection, we mean that both client and server keep track of state, handle reliability (retransmission, rearranging the order of packets), and several other attributes of TCP connection. The interesting thing about conntrack is that it doesn’t use most of these attributes. It will see all the packets in TCP (including wrong ordered packets, retransmissions and so on) and apply rules regardless. It almost doesn’t care whether it’s a connection or not, it just acts on these packets. To be fair, conntrack tracks the beginning and end of TCP connections, but this is an extent of its awareness about the connection.
An interesting anecdotal fact about conntrack: If conntrack is not loaded by the kernel yet and you apply any iptables rule that requires conntrack, the kernel will load it and conntrack will start tracking the connection. The interesting piece is that it will create entries for existing TCP connections, which can erroneously be based on the ongoing traffic, since the TCP connection was established before the conntrack module was loaded. Something like that would be impossible, if conntrack needed to understand the connection deeply (and know its state).
So, the right mental model is thinking about flows. And if you have this mental model, it becomes way more obvious why there is a long list of timeouts around tracking the flow (since it needs to handle all of these underlying edge cases which are normally abstracted away from us).
Conntrack is Inherently Racy
Several months ago we published an article about NAT-related race conditions. Troubleshooting race conditions involving conntrack and ephemeral port reuse was quite challenging. Recently, while working on VM support, we encountered two more issues: another race condition and an unexpected source network address translation (SNAT).
During this process, it became clear to me that conntrack essentially reconstructs parts of the kernel network stack state by observing traffic, including the source, destination, and TCP flags of packets.
This understanding clarified why race conditions are inherent around conntrack. There’s the actual networking stack state, and then there’s conntrack’s simplified version of it. Since conntrack doesn’t directly observe changes in the networking stack state, discrepancies are inevitable. For instance, in the article mentioned earlier, the networking stack released an ephemeral port, but conntrack, for a time, didn’t update its table to reflect this change, mistakenly acting as though the port was still in use.
This method of reconstructing state unfortunately means that the views of the networking stack and conntrack can become desynchronized.
To be fair, not every iptables rule necessarily leads to race conditions. However, more complex rules involving conntrack are likely to encounter these issues
Another simpler realization that dawned on me – but which is not explicitly stated – is that If you have local traffic, meaning traffic that originates from and returns to the same machine, it will pass through the NAT/conntrack functionality twice. First, when it exits the local process, and second, when it heads towards another local process
I prefer to learn new things by gradually getting a deeper understanding. First, have a good high-level picture; second, get some practical experience; and, finally, dive deep into details.
As we were working on NAT-related rules, I felt that I never had a good high-level picture and was forced to step back to construct the mental model presented above, which helped immensely while drilling into specific questions.
That being said, we didn’t stop at just having a high-level understanding and putting several iptables rules together. One of our engineers dove much deeper for a truly detailed understanding of how things work.
There are too many details to be able to truly lay out everything you need to know about how to troubleshoot issues around iptables, NAT and more. However, here is a concise list of tools and resources that can further help you.
- You can use the conntrack tool both to see the current state of the conntrack table and the events that it produces. (By the way, should you need to programmatically interact with it, there is a netlink protocol that is used under the hood to get from userspace to kernel space to get all this information).
- You can always create LOG iptable rules to see when you get a rule match.
- Tcpdump helps you capture all the packets and view traffic flows.
- Bpftrace is a generic tool to trace what’s going on in the kernel. However, it comes in handy if you want to “add logs” to the guts of conntrack, NAT, and so on.
And finally, there is netfliter source code. If nothing else helps, you can alway go and read how it all works.