This section should describe everything from a user's perspective, ie. how to use the tools provided to create the network special effects you are after.
This section generally won't cover the merits of various approaches, or why you would want to do these things. That is the domain of larger documents (such as a book on networking). Sure, there are places when this document will tell you how to do terrible, terrible things which you shouldn't ever do. Ever.
On the other hand, if I make it easier for you to do them, then all those Linux consultants get paid the really big bucks to clean up after you.
iptables
is the name given to the latest packet filtering
system in the ipfwadm and ipchains line of Linux packet filtering
methods. Like those two, it has both a kernel component and a
userspace component. Unlike the others, it is usually built as a
kernel module, and was designed to run on top of the netfilter
framework.
The first thing you'll notice about iptables is the similarity to ipchains. This is partially to ease transition, and partially because I wrote both of them. The main benefit of iptables (for user, programmer and feature-lover) extensibility.
So let's start with a quick guide to the differences between iptables and ipchains from a user's point of view, then go into a blow-by-blow description of the features.
[* Replaced by individual modules which provide this facility]
iptables may look a little like ipchains, but under the skin, iptables shows itself to be a simple framework for simple IP filtering, with the possibility of adding in new and funky features.
As an example, ipchains understands TCP, UDP and ICMP, so it can filter them. iptables does not; it loads up extra modules to help with these when specified.
There are many niche needs which ipchains didn't meet: this is my way of having my cake and eating it, too. It also means that distributions shipping pre-built kernels don't need to agonize over the growing kb wasted on packet filtering: it's modular.
Also, this allowed me to get the core code smaller than the old ipfwadm kernel code (unlike, *cough*, the ipchains kernel code), and I don't have to be embarrassed next time I see Jos Vos.
The kernel part of iptables is a module, called "ip_tables": you simply insmod it. It takes one optional argument: "forward=n", where n is 0 or 1, which sets the default policy for forwarding: DROP or ACCEPT. If it's not specified, the default is DROP.
Sometimes, when a rule asks for a specific (non-builtin) target or test, such as the "REJECT" target, or "tcp" test, it will require additional modules. If placed in the right directory (usually /lib/modules/`uname -r`/net/), they should auto-load. Otherwise these modules (with names like "ipt_REJECT.o" and "ipt_tcp.o") can be manually insmod'd after the core ip_tables module.
Once this has been done, the iptables program can be run. Using `iptables --help' will give reasonable usage information.
The `-L' or `--list' argument allows you to list the chains. If it is followed by a chain name, only that chain is listed (if it exists).
The `-v' or `--verbose' argument causes all information to be printed. The `-n' or `--numeric' option causes information to be printed numerically; useful for suppressing name lookup attempts. Finally, the `-x' or `--exact' option, when used with the -v option, causes the exact packet and byte counters (not the `324k'-style abbreviations) to be printed.
The `-v' or `--verbose' argument can also be used with any other command, to see exactly what is happening.
The `-A' or `--append' argument, followed by a chain name, is used to append a rule to the end of a chain. Similarly, `-I' or `--insert' inserts a rule at the beginning of a chain (or, followed by a rule number, insert at a given position in a chain, starting with the head at number 1).
The `-R' or `--replace' argument (followed by a chain name and a number) is used for atomically replacing one rule with another.
A rule can be deleted (`-d' or `--delete' followed by the chain name) with a similar syntax to appending; the first rule in the chain which matches the specification will be deleted. The other method is to delete by number, which simply follows the chain name.
New (user-defined) chains can be created using `-N' or `--new', followed by a chain name. An empty chain of that name is created if one does not already exist.
Empty user-defined chains can be deleted if no rules have it as a target, using the `-X' or `--delete-chain' arguments followed by the victim chain name.
Many fields in the standard IP header can be filtered on:
Followed by an optional `!' (meaning not), then an IP address or name. If an IP address, it can be followed by a mask, such as `/8' or `/255.0.0.0', both of which mean that only the first 8 bits of the address should be compared.
Used identically to the source address specification.
Followed by an optional `!', and an interface name. The name of the input interface to match. A packet passing through the OUTPUT chain has an input interface of "". If the interface name ends in a `+', then it means that any interface name starting with those characters should be matched, eg `ppp+' will match `ppp0' and `ppp10'.
Used identically to the above. A packet passing through the INPUT chain has an output interface of "".
Followed by an optional `!' and a number (usually a hexadecimal number prefixed with 0x), this matches only IP packets with the given Type of Service field.
Only match non-first fragments, ie. IP packets with a non-zero offset field. This argument can be preceeded by a `!' to indicate that it is to be inverted (ie. match IP packets with a zero offset field only).
Followed by an optional `!' and a protocol name or number, this matches only IP packets of the given protocol. It also has a side effect of loading the per-protocol module, if any, as we will see below.
In addition to the standard fields above, extended options are available in two cases. Firstly, the `-m' or `--match' argument can be used to load up a match module, which may provide extra options. Secondly, if no `--match' is specified, but a `--protocol' argument is specified, then if that protocol provides a match module, that it may provide extra options. If you want the help message to include help on a particular set of extended packet matching options, just use the `--protocol' or `--match' before the `--help' argument.
These match modules have two parts: the kernel part is described above, and the userspace part lives in a shared library. By default, iptables tries to load the library from the `/usr/local/lib/iptables/' directory.
Man, I don't get paid enough to write documentation.
Anyway, if you get an "Unknown arg `--syn'" error, or similar, it could be that you didn't specify the match or protocol argument first.
There are four extended packet matching modules included in the base package at the moment. These are:
This module is automatically loaded if `--protocol tcp' is specified, and no other match is specified. It provides the following options:
Followed by an optional `!', then two strings of flags, allows you to filter on specific TCP flags. The first string of flags is the mask: a list of flags you want to examine. The second string of flags tells which one(s) should be set. For example,
# iptables -A INPUT --protocol tcp --tcp-flags ALL SYN,ACK -j DENY
This indicates that all flags should be examined (`ALL' is synonomous with `SYN,ACK,FIN,RST,URG,PSH'), but only SYN and ACK should be set. There is also an argument `NONE' meaning no flags.
Optionally preceeded by a `!', this is shorthand for `--tcp-flags SYN,RST,ACK SYN'.
followed by an optional `!', then either a single TCP port, or a range of ports. Ports can be port names, as listed in /etc/services, or numeric. Ranges are either two port names separated by a `:', or (to specify greater than or equal to a given port) a port with a `:' appended, or (to specify less than or equal to a given port), a port preceeded by a `:'.
is synonymous with `--source-port'.
and
are the same as above, only they specify the destination, rather than source, port to match.
followed by an optional `!' and a number, matches a packet with a TCP option equalling that number.
This module is automatically loaded if `--protocol udp' is specified, and no other match is specified. It provides the options `--source-port', `--sport', `--destination-port' and `--dport' as detailed for TCP above.
This module is automatically loaded if `--protocol icmp' is specified, and no other match is specified. It provides only one option:
followed by an optional `!', then either an icmp type name (eg `host-unreachable'), or a numeric type (eg. `3'), or a numeric type and code separated by a `/' (eg. `3/3'). A list of available icmp type names is given using `-p icmp --help'.
This module must be explicitly specified with `-m mac' or `--match mac'. It is used for matching incoming packet's source ethernet (MAC) address, and thus only useful for packets traversing the INPUT and FORWARD chains. It provides only one option:
followed by an optional `!', then an ethernet address in colon-separated hexbyte notation, eg `--mac-source 00:60:08:91:CC:B7'.
Once you've specified what packets to match, you have to specify
what to do with the matched packets. This is done using the `-j'
option. If the end of a user-defined chain is reached, then the
packet traversal resumes at the chain which called it. If the end of
a built-in chain is reached, then the chain's policy
is
consulted: this is an unconditional rule at the end of the chain which
says what to do in this case (often, DROP the packet).
The standard targets are:
Pass the packet on.
Drop the packet; eat it.
Act as if this rule was the last in its chain; if it's a user-defined chain, this returns to the calling chain. If it's a built-in chain, the chain's policy is consulted.
Queue the packet for userspace handling. If there is no program waiting to handle the packet, or there are too many packets queued, this has the same effect as DROP.
If the option after `-j' is the name of a user-defined chain, then any packet which matches the rule will begin traversing that chain.
If no `-j' option is specified, then the next rule in that chain will be consulted. As each rule has a packet and a byte counter, this is useful for counting types of packets.
New target modules can be written for iptables, which add options in a similar manner to the way new packet-matching modules add options.
Like packet-matching modules, target modules have two parts: the kernel part is described above, and the userspace part lives in a shared library. By default, iptables tries to load the library from the `/usr/local/lib/iptables/' directory, same as extended match modules.
There are two extended target modules included in the default distribution. These are:
This module (when written) provides kernel logging of matching packets. It provides these additional options:
Followed by a level number or name. Valid names are (case-insensitive) `debug', `info', `notice', `warning', `err', `crit', `alert' and `emerg', corresponding to numbers 7 through 0. See the man page for syslog.conf for an explanation of these levels.
This option specifies that the messages should be limited; bursts are allowed, but the average rate can never exceed one message every 5 seconds. This avoid severe log-flooding or overloading.
Followed by a string of up to 14 characters, this message is sent at the start of the log message, to allow it to be uniquely identified.
This module has the same effect as `DROP', except that the sender is sent an ICMP `port unreachable' error message. Note that the ICMP error message is not sent if (see RFC 1122):
One day, when I write the neccessary compatibility layer, you'll be able to do a simple "modprobe ipfwadm.o" and use the normal ipfwadm tool, as it was used for Linux 2.0. A similar approach will allow use of ipchains, as was used for Linux 2.2.
Of course, I have to write it. Then I have to debug it. Then I have to decide if I'm doing to do the trickier things, like masquerading and redirect.
There is a new Network Address Translation system which works on top of the netfilter framework. Like iptables, it has a kernel part and a userspace part, and it is extensible to cover new protocols, and other wierd cases.
Network Address Translation is funky. The idea is to mangle packets on the fly as they pass through one way, and hope you can recognize the reply packets passing through the other way so you can unmangle them.
Note that this requires that both the original and reply packets pass through the Network Address Translation box (this is important to realize if you're getting really tricky).
On one level, this is simple. For example, if you have a network with IP addresses 1.2.0.0/16 behind your Linux box, and you want them to have addresses 1.3.0.0/16 instead, you can get the Linux box to alter all the source IP addresses on the way out of your network from 1.2 to 1.3, and the destination IP addresses on the way into your network from 1.3 to 1.2.
That is called static NAT, or (as implemented by Alexey Kuznetsov in Linux) "Fast NAT". It's actually quite easy to do, and is controlled by the routing code.
Unfortunately, life isn't always that easy. Sometimes you want to map a range of addresses onto a smaller range. The most frequent use of NAT in the world at the moment is Linux 2.0 and 2.2's "masquerading" feature, in which an entire network is mapped onto a single IP address (the IP address of the masquerading box's external interface), which is also used by the masquerading box itself!
On top of that, some standard protocols (ftp) don't like (ftp) being masqueraded (ftp), but I won't (ftp) mention any (ftp) names just yet. Proprietary protocols are even worse in this regard.
Another use of NAT is what I call RNAT (Reverse NAT), or load-sharing NAT. In this case it is the destination, not the source, which is altered: frequently this is used to map a single IP address onto a farm of servers, such as for a heavy porn... err... Web server.
The most common form of RNAT today is Linux 2.2's "port-forwarding" feature, which is usually used to direct connections to a single TCP port to another server. This is frequently used in combination with masquerading, where the server in question doesn't have a valid IP address, and so cannot be connected to directly.
One of the good things about Free Software is the cool people involved, such as Freshmeat's Patrick Lenz, who provides me with Freshmeat stats using ipchains, to count the total number of connections, and the number which are probably from masqueraded connections. Here are a recent snapshot of those stats:
Chain input (policy ACCEPT: 228314119 packets, 21789959697 bytes):
pkts bytes target prot opt tosa tosx ifname mark outsize source destination ports
1083067 55439939 - tcp -y---- 0xFF 0x00 any anywhere anywhere 61000:65095 -> any
12363685 621294605 - tcp -y---- 0xFF 0x00 any anywhere anywhere any -> any
That's 8.76% of connections to Freshmeat are masqueraded.
Another common form used is "transparent proxying", where connections which would ordinary pass through the masquerading box are RNAT'ed to the box itself.
Your gateway to the wonderful world of Linux NAT is the tool
ipnatctl
. You can think of this as a wonderful tool for
screwing your network over worse than you ever imagined was possible.
ipnatctl
allows you to insert (`-I') and delete (`-D')
rules. Rules are implicitly ordered, like the way routing information
is implicitly ordered: more specific rules take precedence over less
specific rules.
Each rule has three parts:
When a packet matches a rule, a "binding" is created, which indicates how to recognize other packets in the same stream (which have to be mangled the same way), and how the packet is to be mangled. It also specifies both these things for the replies.
Let's look a simplified example. We have create a rule which says all UDP packets going out ppp0 should be mapped onto the source IP address 1.2.3.4:
Like iptables
, ipnatctl
is extensible: new
protocols and new bindings can be created, and several are included in
the base distribution.
There are several standard options for matching packets: each protocol can provide extra options, as we will see below.
Followed by an IP address or range, this option allows the specification of a particular source IP address, or a range of source addresses (using the `/' mask notation, such as `192.168.1.0/24' or `192.168.1.0/255.255.255.0').
Followed by an IP address or range, this is used to specify a particular destination IP address or range, similar to the above.
Followed by part or all of the words "source" or "destination", to indicate whether a source (NAT) or destination (RNAT) mapping is desired.
Followed by an interface name, indicates that the packet must be on that interface to match the rule. For source manipulations (NAT), this is the outgoing interface, and for destination manipulations (RNAT), this is the incoming interface.
This flag (which can't be used with the `--interface' option, indicates that the manipulation is to be done on local packets. This is a special case, for altering the destination of locally-generated packets; this cannot be (and does not need to be) used for source mappings.
Followed by a protocol number or name, means that only packets of the given protocol will match the rule. If the protocol contains special support, this causes the loading of extra options, as we'll see below.
A protocol can provide extended options: currently TCP and UDP do. Each extension has two parts: a kernel module (eg. "ip_nat_tcp.o"), and a shared library (eg. "libnatctl_proto_tcp.so"). The shared libraries should reside in the "/usr/local/lib/ipnatctl/" directory.
Protocol-specifics are enabled by using the `--protocol' option to ipnatctl: if the kernel modules are placed in the right directory (usually /lib/modules/`uname -r`/net/), they should auto-load. Otherwise these modules can be manually insmod'd after the core ip_nat module.
The protocols which have specific options are:
This provides the following options:
or
Followed by a port number, indicates that only packets from this TCP port should match the rule.
or
Followed by a port number, indicates that only packets to this TCP port should match the rule.
This provides the following options:
or
Followed by a port number, indicates that only packets from this UDP port should match the rule.
or
Followed by a port number, indicates that only packets to this UDP port should match the rule.
There is only one standard output option:
followed by either a single IP address or an address range, indicates the IP range onto which the source or destination IP address of the packet is to be matched.
The range can be an address and a mask, like the `--source' option, or a `-' separated inclusive IP range, like `192.168.1.1-192.168.1.3'.
The `--protocol' option not only can add extra match options, but also extra output options, which add protocol-specific restrictions on how the packet source or destination address can be mapped.
This provides the following options:
Followed by a port number, or a `-' separated port range, indicates that packets must be mapped onto this TCP port or port range.
This provides the following options:
Followed by a port number, or a `-' separated port range, indicates that packets must be mapped onto this UDP port or port range.
Once we've specified what packets the rule matches, and the range into which they should be mangled, we need to specify exactly how the packets are to be mapped onto that range.
This is the reason for the optional ``--binding'' option: it is followed by a comma-separated list of binding types which will handle creation of the binding.
If no `--binding' option is specified, the `generic' binding is used. This binding searches for an unused mapping in the given range, as follows:
Some bindings in the standard package are:
Instead of mapping the source onto a fixed IP address, this maps the source onto the IP address of the interface the packet is heading out, making the packet seem to come from the box itself. Thus, with this binding, the `--to' option is ignored (but the protocol-specific options, such as TCP's `--to-port' still have effect). This only works as a source manipulation.
Instead of mapping the destination onto a fixed IP address, this maps the destination onto the IP address of the interface the packet is heading in, making the packet head to the box itself. Thus, with this binding, the `--to' option is ignored (but the protocol-specific options, such as TCP's `--to-port' still have effect). This only works as a destination manipulation.
If you're trying to masquerade your 100,000 node network onto three TCP ports, you'll eventually have more than three of them trying to connect to the same server, and creating the binding fails. (This is because the NAT code will never create two connections which look identical). The packet which evoked the binding will be dropped.
Normally, dropping the packet is the right thing: IP is designed with the assumption that in case of congestion, packets will be dropped. If those three connections are web connections, and thus short lived, you're in with a good chance when your TCP stack retransmits. If, however, those connections are all long lived, you're SOL.
The classic problem at the moment is TCP connections which don't close properly (usually a Windows 95 machine or a Mac got the plug pulled; such machines have no place on a network). Established TCP connections take 6 hours to time out.
To quote a post to Linux Kernel:
> Hi,
>
> We have a problem with ip_masqurading set up as a firewall. When someone
> runs a stealth scan from the masquraded net to the outside net, it will
> very fast consume all available masqurade ports. The result is a nasty
> DoS for all adresses on the masquraded net.
Take a baseball bat to the stealth-scanning motherfucker, and the
problem will be resolved.
There are several possible DOS attacks from INSIDE a NAT host. Fixing
this one doesn't win much.
Trust me on the baseball bat,
Rusty.
I included this for two reasons: firstly, we don't have enough vulgarities in HOWTOs (cf. Linux kernel code). Secondly, it illustrates a problem, especially if you are running a big site.
There is a solution for TCP (I call it Forgotton Unused Connection Knowledge, "Probe Me Harder" or AckNowedgement Activity Layer Probe), but it involves getting down and nasty, and I'm not sure it's a good idea. But you know that now I've found a solution I'm gonna code it eventually.
You have to understand ICMP, UDP and TCP. ICMP, because it's used by the other protocols to report errors, and UDP and TCP because they're misdesigned, such that their internal checksum includes the IP source and destination addresses, so NAT breaks them.
For other protocols, we give it a damn good shot. If they're NAT-friendly, they should "just work", although congestion (see above) gets worse. Of course, since we know nothing about them, we don't know when a stream is finished, so we just keep the binding around for an hour since the last packet, making congestion an even bigger possibility.
Eventually, the kernel will merge this into a single rule, providing seamless integration. Birds will sing, the sun will shine, and your breath will smell sweeter.
Until that day, try breathmints (``Ooooh.... Grog-o-mint, my favourite!'').
Compile your kernel with CONFIG_IP_ALWAYS_DEFRAG, and be happy. I have grand plans for dealing with NAT of fragments, simply because it is possible. I want to see how messy it gets, though, and I can't think of a good reason for doing it. There are plenty of bad reasons though: it would open the doors to implementing parallel NAT machines, and it would piss off the authors of the draft NAT RFC, who said it couldn't be done.
Since you were insulting, I'm not goint to give a comprehensive set of examples. But here are a few to get you started:
Try these:
# ipnatctl --help
# ipnatctl -p tcp --help
# ipnatctl -p udp --help
Try this:
# insmod netfilter/NAT/protocols/ip_nat_tcp.o
# insmod netfilter/NAT/protocols/ip_nat_udp.o
# insmod netfilter/NAT/bindings/ipnat_bind_masquerade.o
# ipnatctl -I -i plip0 -m source --binding masquerade
# ipnatctl -I -i plip0 -m source -s 10.0.0.0/8 --to 1.2.3.4
# insmod netfilter/NAT/bindings/ipnat_bind_redirect.o
# ipnatctl -I -d 10.0.0.0/8 -m dest --binding redirect
# ipnatctl -I -p tcp -i plip0 -m dest --dport 80 --to-port 25
# ipnatctl -I -p tcp -l --dport 80 --to 127.0.0.1 --to-port 8080
This may happen, with "backwards compatibility" modules. I use "backwards compatibility" in quotes, because I don't want users of these treated like second-class citizens. I prefer not to leave behind a trail of disappointed, bitter, out-for-blood Rusty haters. Let's leave that for my love-life.
Should work fine. Load the ip_nat modules first, and you should always see "end-to-end" IP addresses (as far as such things exist in the real world).
So whether you're masquerading or not, the packet filter rules won't see it; it looks like your private network is directly connecting to the outside world, and vice-verse.
If you're redirecting to a server farm, it looks to your packet filter as if external machines are connecting straight to each individual machine.
Of course, this means that iptables can't exploit the state-keeping nature of the NAT code to get cheap stateful inspection, but that's what the separation of NAT and packet filtering is all about.