Next Previous Contents

3. Information For Users

This section should describe everything from a user's perspective, ie. how to use the tools provided to create the network special effects you are after.

This section generally won't cover the merits of various approaches, or why you would want to do these things. That is the domain of larger documents (such as a book on networking). Sure, there are places when this document will tell you how to do terrible, terrible things which you shouldn't ever do. Ever.

On the other hand, if I make it easier for you to do them, then all those Linux consultants get paid the really big bucks to clean up after you.

3.1 Packet Filtering with iptables

iptables is the name given to the latest packet filtering system in the ipfwadm and ipchains line of Linux packet filtering methods. Like those two, it has both a kernel component and a userspace component. Unlike the others, it is usually built as a kernel module, and was designed to run on top of the netfilter framework.

The first thing you'll notice about iptables is the similarity to ipchains. This is partially to ease transition, and partially because I wrote both of them. The main benefit of iptables (for user, programmer and feature-lover) extensibility.

So let's start with a quick guide to the differences between iptables and ipchains from a user's point of view, then go into a blow-by-blow description of the features.

iptables vs ipchains

iptables takes away...

[* Replaced by individual modules which provide this facility]

iptables gives...

Random changes...

How iptables works

iptables may look a little like ipchains, but under the skin, iptables shows itself to be a simple framework for simple IP filtering, with the possibility of adding in new and funky features.

As an example, ipchains understands TCP, UDP and ICMP, so it can filter them. iptables does not; it loads up extra modules to help with these when specified.

There are many niche needs which ipchains didn't meet: this is my way of having my cake and eating it, too. It also means that distributions shipping pre-built kernels don't need to agonize over the growing kb wasted on packet filtering: it's modular.

Also, this allowed me to get the core code smaller than the old ipfwadm kernel code (unlike, *cough*, the ipchains kernel code), and I don't have to be embarrassed next time I see Jos Vos.

All You Need To Know About The Kernel Side

The kernel part of iptables is a module, called "ip_tables": you simply insmod it. It takes one optional argument: "forward=n", where n is 0 or 1, which sets the default policy for forwarding: DROP or ACCEPT. If it's not specified, the default is DROP.

Sometimes, when a rule asks for a specific (non-builtin) target or test, such as the "REJECT" target, or "tcp" test, it will require additional modules. If placed in the right directory (usually /lib/modules/`uname -r`/net/), they should auto-load. Otherwise these modules (with names like "ipt_REJECT.o" and "ipt_tcp.o") can be manually insmod'd after the core ip_tables module.

Once this has been done, the iptables program can be run. Using `iptables --help' will give reasonable usage information.

Listing Rules And Chains

The `-L' or `--list' argument allows you to list the chains. If it is followed by a chain name, only that chain is listed (if it exists).

The `-v' or `--verbose' argument causes all information to be printed. The `-n' or `--numeric' option causes information to be printed numerically; useful for suppressing name lookup attempts. Finally, the `-x' or `--exact' option, when used with the -v option, causes the exact packet and byte counters (not the `324k'-style abbreviations) to be printed.

The `-v' or `--verbose' argument can also be used with any other command, to see exactly what is happening.

Creating and Deleting a Rule

The `-A' or `--append' argument, followed by a chain name, is used to append a rule to the end of a chain. Similarly, `-I' or `--insert' inserts a rule at the beginning of a chain (or, followed by a rule number, insert at a given position in a chain, starting with the head at number 1).

The `-R' or `--replace' argument (followed by a chain name and a number) is used for atomically replacing one rule with another.

A rule can be deleted (`-d' or `--delete' followed by the chain name) with a similar syntax to appending; the first rule in the chain which matches the specification will be deleted. The other method is to delete by number, which simply follows the chain name.

Creating and Deleting Chains

New (user-defined) chains can be created using `-N' or `--new', followed by a chain name. An empty chain of that name is created if one does not already exist.

Empty user-defined chains can be deleted if no rules have it as a target, using the `-X' or `--delete-chain' arguments followed by the victim chain name.

IP Packet Matching Options

Many fields in the standard IP header can be filtered on:

--source (-s)

Followed by an optional `!' (meaning not), then an IP address or name. If an IP address, it can be followed by a mask, such as `/8' or `/255.0.0.0', both of which mean that only the first 8 bits of the address should be compared.

--destination (-d)

Used identically to the source address specification.

--in-interface (-i)

Followed by an optional `!', and an interface name. The name of the input interface to match. A packet passing through the OUTPUT chain has an input interface of "". If the interface name ends in a `+', then it means that any interface name starting with those characters should be matched, eg `ppp+' will match `ppp0' and `ppp10'.

--out-interface (-o)

Used identically to the above. A packet passing through the INPUT chain has an output interface of "".

--TOS (-t)

Followed by an optional `!' and a number (usually a hexadecimal number prefixed with 0x), this matches only IP packets with the given Type of Service field.

--fragment (-f)

Only match non-first fragments, ie. IP packets with a non-zero offset field. This argument can be preceeded by a `!' to indicate that it is to be inverted (ie. match IP packets with a zero offset field only).

--protocol (-p)

Followed by an optional `!' and a protocol name or number, this matches only IP packets of the given protocol. It also has a side effect of loading the per-protocol module, if any, as we will see below.

Extended Packet Matching Options

In addition to the standard fields above, extended options are available in two cases. Firstly, the `-m' or `--match' argument can be used to load up a match module, which may provide extra options. Secondly, if no `--match' is specified, but a `--protocol' argument is specified, then if that protocol provides a match module, that it may provide extra options. If you want the help message to include help on a particular set of extended packet matching options, just use the `--protocol' or `--match' before the `--help' argument.

These match modules have two parts: the kernel part is described above, and the userspace part lives in a shared library. By default, iptables tries to load the library from the `/usr/local/lib/iptables/' directory.

Man, I don't get paid enough to write documentation.

Anyway, if you get an "Unknown arg `--syn'" error, or similar, it could be that you didn't specify the match or protocol argument first.

There are four extended packet matching modules included in the base package at the moment. These are:

tcp

This module is automatically loaded if `--protocol tcp' is specified, and no other match is specified. It provides the following options:

--tcp-flags

Followed by an optional `!', then two strings of flags, allows you to filter on specific TCP flags. The first string of flags is the mask: a list of flags you want to examine. The second string of flags tells which one(s) should be set. For example,

# iptables -A INPUT --protocol tcp --tcp-flags ALL SYN,ACK -j DENY

This indicates that all flags should be examined (`ALL' is synonomous with `SYN,ACK,FIN,RST,URG,PSH'), but only SYN and ACK should be set. There is also an argument `NONE' meaning no flags.

--syn

Optionally preceeded by a `!', this is shorthand for `--tcp-flags SYN,RST,ACK SYN'.

--source-port

followed by an optional `!', then either a single TCP port, or a range of ports. Ports can be port names, as listed in /etc/services, or numeric. Ranges are either two port names separated by a `:', or (to specify greater than or equal to a given port) a port with a `:' appended, or (to specify less than or equal to a given port), a port preceeded by a `:'.

--sport

is synonymous with `--source-port'.

--destination-port

and

--dport

are the same as above, only they specify the destination, rather than source, port to match.

--tcp-option

followed by an optional `!' and a number, matches a packet with a TCP option equalling that number.

udp

This module is automatically loaded if `--protocol udp' is specified, and no other match is specified. It provides the options `--source-port', `--sport', `--destination-port' and `--dport' as detailed for TCP above.

icmp

This module is automatically loaded if `--protocol icmp' is specified, and no other match is specified. It provides only one option:

--icmp-type

followed by an optional `!', then either an icmp type name (eg `host-unreachable'), or a numeric type (eg. `3'), or a numeric type and code separated by a `/' (eg. `3/3'). A list of available icmp type names is given using `-p icmp --help'.

mac

This module must be explicitly specified with `-m mac' or `--match mac'. It is used for matching incoming packet's source ethernet (MAC) address, and thus only useful for packets traversing the INPUT and FORWARD chains. It provides only one option:

--mac-source

followed by an optional `!', then an ethernet address in colon-separated hexbyte notation, eg `--mac-source 00:60:08:91:CC:B7'.

Standard Targets

Once you've specified what packets to match, you have to specify what to do with the matched packets. This is done using the `-j' option. If the end of a user-defined chain is reached, then the packet traversal resumes at the chain which called it. If the end of a built-in chain is reached, then the chain's policy is consulted: this is an unconditional rule at the end of the chain which says what to do in this case (often, DROP the packet).

The standard targets are:

ACCEPT

Pass the packet on.

DROP

Drop the packet; eat it.

RETURN

Act as if this rule was the last in its chain; if it's a user-defined chain, this returns to the calling chain. If it's a built-in chain, the chain's policy is consulted.

QUEUE

Queue the packet for userspace handling. If there is no program waiting to handle the packet, or there are too many packets queued, this has the same effect as DROP.

chainname

If the option after `-j' is the name of a user-defined chain, then any packet which matches the rule will begin traversing that chain.

none

If no `-j' option is specified, then the next rule in that chain will be consulted. As each rule has a packet and a byte counter, this is useful for counting types of packets.

Extended Targets

New target modules can be written for iptables, which add options in a similar manner to the way new packet-matching modules add options.

Like packet-matching modules, target modules have two parts: the kernel part is described above, and the userspace part lives in a shared library. By default, iptables tries to load the library from the `/usr/local/lib/iptables/' directory, same as extended match modules.

There are two extended target modules included in the default distribution. These are:

LOG

This module (when written) provides kernel logging of matching packets. It provides these additional options:

--log-level

Followed by a level number or name. Valid names are (case-insensitive) `debug', `info', `notice', `warning', `err', `crit', `alert' and `emerg', corresponding to numbers 7 through 0. See the man page for syslog.conf for an explanation of these levels.

--log-limit

This option specifies that the messages should be limited; bursts are allowed, but the average rate can never exceed one message every 5 seconds. This avoid severe log-flooding or overloading.

--log-prefix

Followed by a string of up to 14 characters, this message is sent at the start of the log message, to allow it to be uniquely identified.

REJECT

This module has the same effect as `DROP', except that the sender is sent an ICMP `port unreachable' error message. Note that the ICMP error message is not sent if (see RFC 1122):

3.2 Packet Filtering with ipfwadm or ipchains

One day, when I write the neccessary compatibility layer, you'll be able to do a simple "modprobe ipfwadm.o" and use the normal ipfwadm tool, as it was used for Linux 2.0. A similar approach will allow use of ipchains, as was used for Linux 2.2.

Of course, I have to write it. Then I have to debug it. Then I have to decide if I'm doing to do the trickier things, like masquerading and redirect.

3.3 Network Address Translation with ipnatctl

There is a new Network Address Translation system which works on top of the netfilter framework. Like iptables, it has a kernel part and a userspace part, and it is extensible to cover new protocols, and other wierd cases.

An Introduction To Network Address Translation

Network Address Translation is funky. The idea is to mangle packets on the fly as they pass through one way, and hope you can recognize the reply packets passing through the other way so you can unmangle them.

Note that this requires that both the original and reply packets pass through the Network Address Translation box (this is important to realize if you're getting really tricky).

On one level, this is simple. For example, if you have a network with IP addresses 1.2.0.0/16 behind your Linux box, and you want them to have addresses 1.3.0.0/16 instead, you can get the Linux box to alter all the source IP addresses on the way out of your network from 1.2 to 1.3, and the destination IP addresses on the way into your network from 1.3 to 1.2.

That is called static NAT, or (as implemented by Alexey Kuznetsov in Linux) "Fast NAT". It's actually quite easy to do, and is controlled by the routing code.

Unfortunately, life isn't always that easy. Sometimes you want to map a range of addresses onto a smaller range. The most frequent use of NAT in the world at the moment is Linux 2.0 and 2.2's "masquerading" feature, in which an entire network is mapped onto a single IP address (the IP address of the masquerading box's external interface), which is also used by the masquerading box itself!

On top of that, some standard protocols (ftp) don't like (ftp) being masqueraded (ftp), but I won't (ftp) mention any (ftp) names just yet. Proprietary protocols are even worse in this regard.

Another use of NAT is what I call RNAT (Reverse NAT), or load-sharing NAT. In this case it is the destination, not the source, which is altered: frequently this is used to map a single IP address onto a farm of servers, such as for a heavy porn... err... Web server.

The most common form of RNAT today is Linux 2.2's "port-forwarding" feature, which is usually used to direct connections to a single TCP port to another server. This is frequently used in combination with masquerading, where the server in question doesn't have a valid IP address, and so cannot be connected to directly.

One of the good things about Free Software is the cool people involved, such as Freshmeat's Patrick Lenz, who provides me with Freshmeat stats using ipchains, to count the total number of connections, and the number which are probably from masqueraded connections. Here are a recent snapshot of those stats:

Chain input (policy ACCEPT: 228314119 packets, 21789959697 bytes):
     pkts      bytes target     prot opt    tosa tosx  ifname     mark       outsize  source                destination           ports
  1083067   55439939 -          tcp  -y---- 0xFF 0x00  any                            anywhere              anywhere              61000:65095 ->   any
 12363685  621294605 -          tcp  -y---- 0xFF 0x00  any                            anywhere              anywhere              any ->   any

That's 8.76% of connections to Freshmeat are masqueraded.

Another common form used is "transparent proxying", where connections which would ordinary pass through the masquerading box are RNAT'ed to the box itself.

Using The ipnatctl Tool

Your gateway to the wonderful world of Linux NAT is the tool ipnatctl. You can think of this as a wonderful tool for screwing your network over worse than you ever imagined was possible.

ipnatctl allows you to insert (`-I') and delete (`-D') rules. Rules are implicitly ordered, like the way routing information is implicitly ordered: more specific rules take precedence over less specific rules.

Each rule has three parts:

  1. A input matching part, which says what packets can match the rule.
  2. An output range part, which says what address range to match the packet onto, and what type of mapping (source or destination).
  3. An optional binding part, which says what binding to create for this packet stream (and the replies).

When a packet matches a rule, a "binding" is created, which indicates how to recognize other packets in the same stream (which have to be mangled the same way), and how the packet is to be mangled. It also specifies both these things for the replies.

Let's look a simplified example. We have create a rule which says all UDP packets going out ppp0 should be mapped onto the source IP address 1.2.3.4:

  1. A UDP packet from 192.168.1.1 goes out ppp0. The NAT code sees it, and first it checks the bindings: no binding matches that packet. Then it checks the rules, sees the rule we set up above, and sets up a binding, which it applies to the packet (so the packet now looks like it's coming from 1.2.3.4).
  2. The reply comes back, to 1.2.3.4. The NAT code sees it: it checks the bindings, and it matches the one set up previously, thus it maps the packet back to go to 192.168.1.1.

Like iptables, ipnatctl is extensible: new protocols and new bindings can be created, and several are included in the base distribution.

Standard Match Options

There are several standard options for matching packets: each protocol can provide extra options, as we will see below.

--source (-s)

Followed by an IP address or range, this option allows the specification of a particular source IP address, or a range of source addresses (using the `/' mask notation, such as `192.168.1.0/24' or `192.168.1.0/255.255.255.0').

--destination (-d)

Followed by an IP address or range, this is used to specify a particular destination IP address or range, similar to the above.

--manipulation-type (-m)

Followed by part or all of the words "source" or "destination", to indicate whether a source (NAT) or destination (RNAT) mapping is desired.

--interface (-i)

Followed by an interface name, indicates that the packet must be on that interface to match the rule. For source manipulations (NAT), this is the outgoing interface, and for destination manipulations (RNAT), this is the incoming interface.

--local (-l)

This flag (which can't be used with the `--interface' option, indicates that the manipulation is to be done on local packets. This is a special case, for altering the destination of locally-generated packets; this cannot be (and does not need to be) used for source mappings.

--protocol (-p)

Followed by a protocol number or name, means that only packets of the given protocol will match the rule. If the protocol contains special support, this causes the loading of extra options, as we'll see below.

Extended Match Options

A protocol can provide extended options: currently TCP and UDP do. Each extension has two parts: a kernel module (eg. "ip_nat_tcp.o"), and a shared library (eg. "libnatctl_proto_tcp.so"). The shared libraries should reside in the "/usr/local/lib/ipnatctl/" directory.

Protocol-specifics are enabled by using the `--protocol' option to ipnatctl: if the kernel modules are placed in the right directory (usually /lib/modules/`uname -r`/net/), they should auto-load. Otherwise these modules can be manually insmod'd after the core ip_nat module.

The protocols which have specific options are:

tcp

This provides the following options:

--source-port

or

--sport

Followed by a port number, indicates that only packets from this TCP port should match the rule.

--destination-port

or

--dport

Followed by a port number, indicates that only packets to this TCP port should match the rule.

udp

This provides the following options:

--source-port

or

--sport

Followed by a port number, indicates that only packets from this UDP port should match the rule.

--destination-port

or

--dport

Followed by a port number, indicates that only packets to this UDP port should match the rule.

Standard Output Options

There is only one standard output option:

--to (-t)

followed by either a single IP address or an address range, indicates the IP range onto which the source or destination IP address of the packet is to be matched.

The range can be an address and a mask, like the `--source' option, or a `-' separated inclusive IP range, like `192.168.1.1-192.168.1.3'.

Extended Output Options

The `--protocol' option not only can add extra match options, but also extra output options, which add protocol-specific restrictions on how the packet source or destination address can be mapped.

tcp

This provides the following options:

--to-port

Followed by a port number, or a `-' separated port range, indicates that packets must be mapped onto this TCP port or port range.

udp

This provides the following options:

--to-port

Followed by a port number, or a `-' separated port range, indicates that packets must be mapped onto this UDP port or port range.

Binding Options

Once we've specified what packets the rule matches, and the range into which they should be mangled, we need to specify exactly how the packets are to be mapped onto that range.

This is the reason for the optional ``--binding'' option: it is followed by a comma-separated list of binding types which will handle creation of the binding.

If no `--binding' option is specified, the `generic' binding is used. This binding searches for an unused mapping in the given range, as follows:

  1. If it's a source (NAT) mapping, and a binding exists with the same source IP, source port and protocol, and that binding maps into the range specified, then a copy of that binding is used. This means that if you send out two UDP packets from the same port on the same machine, and both are NAT'ed, they will be NAT'ed onto the same source IP address and port. ie. they will still appear to come from the same source. This is described beautifully in Dan Kegel's submission on Internet gaming (FIXME: get URL).
  2. Otherwise, we choose the least-used IP and protocol combination in the range. (Stop telling lies: discuss the iterations if a range has multiple parts). If the least-used IP and protocol combination is full, that implies that the entire range is full. We try to figure out a mapping onto this range. If we understand the protocol (eg. TCP), this is easier, because we can map more than one packet stream onto the same IP (eg. using different TCP ports).

Some bindings in the standard package are:

masquerade

Instead of mapping the source onto a fixed IP address, this maps the source onto the IP address of the interface the packet is heading out, making the packet seem to come from the box itself. Thus, with this binding, the `--to' option is ignored (but the protocol-specific options, such as TCP's `--to-port' still have effect). This only works as a source manipulation.

redirect

Instead of mapping the destination onto a fixed IP address, this maps the destination onto the IP address of the interface the packet is heading in, making the packet head to the box itself. Thus, with this binding, the `--to' option is ignored (but the protocol-specific options, such as TCP's `--to-port' still have effect). This only works as a destination manipulation.

Problems

When Binding Fails: Congestion

If you're trying to masquerade your 100,000 node network onto three TCP ports, you'll eventually have more than three of them trying to connect to the same server, and creating the binding fails. (This is because the NAT code will never create two connections which look identical). The packet which evoked the binding will be dropped.

Normally, dropping the packet is the right thing: IP is designed with the assumption that in case of congestion, packets will be dropped. If those three connections are web connections, and thus short lived, you're in with a good chance when your TCP stack retransmits. If, however, those connections are all long lived, you're SOL.

Crashing Machines

The classic problem at the moment is TCP connections which don't close properly (usually a Windows 95 machine or a Mac got the plug pulled; such machines have no place on a network). Established TCP connections take 6 hours to time out.

Denial of Service Attacks

To quote a post to Linux Kernel:

> Hi,
> 
> We have a problem with ip_masqurading set up as a firewall. When someone
> runs a stealth scan from the masquraded net to the outside net, it will
> very fast consume all available masqurade ports. The result is a nasty
> DoS for all adresses on the masquraded net.

Take a baseball bat to the stealth-scanning motherfucker, and the
problem will be resolved.

There are several possible DOS attacks from INSIDE a NAT host.  Fixing
this one doesn't win much.

Trust me on the baseball bat,
Rusty.

I included this for two reasons: firstly, we don't have enough vulgarities in HOWTOs (cf. Linux kernel code). Secondly, it illustrates a problem, especially if you are running a big site.

There is a solution for TCP (I call it Forgotton Unused Connection Knowledge, "Probe Me Harder" or AckNowedgement Activity Layer Probe), but it involves getting down and nasty, and I'm not sure it's a good idea. But you know that now I've found a solution I'm gonna code it eventually.

Unknown Protocols

You have to understand ICMP, UDP and TCP. ICMP, because it's used by the other protocols to report errors, and UDP and TCP because they're misdesigned, such that their internal checksum includes the IP source and destination addresses, so NAT breaks them.

For other protocols, we give it a damn good shot. If they're NAT-friendly, they should "just work", although congestion (see above) gets worse. Of course, since we know nothing about them, we don't know when a stream is finished, so we just keep the binding around for an hour since the last packet, making congestion an even bigger possibility.

Multiple Rules Which Map To Different Ranges

Eventually, the kernel will merge this into a single rule, providing seamless integration. Birds will sing, the sun will shine, and your breath will smell sweeter.

Until that day, try breathmints (``Ooooh.... Grog-o-mint, my favourite!'').

Fragments

Compile your kernel with CONFIG_IP_ALWAYS_DEFRAG, and be happy. I have grand plans for dealing with NAT of fragments, simply because it is possible. I want to see how messy it gets, though, and I can't think of a good reason for doing it. There are plenty of bad reasons though: it would open the doors to implementing parallel NAT machines, and it would piss off the authors of the draft NAT RFC, who said it couldn't be done.

How About Some Examples, and You're Ugly.

Since you were insulting, I'm not goint to give a comprehensive set of examples. But here are a few to get you started:

Getting Help

Try these:

# ipnatctl --help
# ipnatctl -p tcp --help
# ipnatctl -p udp --help

Making Things Work

Try this:

# insmod netfilter/NAT/protocols/ip_nat_tcp.o
# insmod netfilter/NAT/protocols/ip_nat_udp.o

Masquerade everything going out plip0

# insmod netfilter/NAT/bindings/ipnat_bind_masquerade.o
# ipnatctl -I -i plip0 -m source --binding masquerade

NAT everything coming from 10.* going out plip0 seem to come from 1.2.3.4

# ipnatctl -I -i plip0 -m source -s 10.0.0.0/8 --to 1.2.3.4

Redirect all requests going to 10.* to me.

# insmod netfilter/NAT/bindings/ipnat_bind_redirect.o
# ipnatctl -I -d 10.0.0.0/8 -m dest --binding redirect

Send all TCP requests in plip0 to port 80 to port 25

# ipnatctl -I -p tcp -i plip0 -m dest --dport 80 --to-port 25

Send all local connections to any TCP port 80 to the local port 8080

# ipnatctl -I -p tcp -l --dport 80 --to 127.0.0.1 --to-port 8080

3.4 Masquerading/REDIRECT With ipfwadm or ipchains

This may happen, with "backwards compatibility" modules. I use "backwards compatibility" in quotes, because I don't want users of these treated like second-class citizens. I prefer not to leave behind a trail of disappointed, bitter, out-for-blood Rusty haters. Let's leave that for my love-life.

3.5 Mixing iptables and ipnatctl

Should work fine. Load the ip_nat modules first, and you should always see "end-to-end" IP addresses (as far as such things exist in the real world).

So whether you're masquerading or not, the packet filter rules won't see it; it looks like your private network is directly connecting to the outside world, and vice-verse.

If you're redirecting to a server farm, it looks to your packet filter as if external machines are connecting straight to each individual machine.

Of course, this means that iptables can't exploit the state-keeping nature of the NAT code to get cheap stateful inspection, but that's what the separation of NAT and packet filtering is all about.


Next Previous Contents