Raw IP Stack
One of the jobs of an operating system is to maintain a complete TCP/IP stack, from
the link layer all the way up to the connect
and send
functions
found in most programming languages. Any interface opened on a linux system will
automatically respond to ARP and ICMP (ping) messages, handled by the kernel. This
functionality may be turned off, and implemented in user-space. There are two ways to do
this on linux, one is the tuntap interface, and the other is a raw packet socket. I'll
cover the latter method here. It requires the CAP_NET_RAW
capability in
contrast to the tuntap method, but may be used on real interfaces.
Components of the Stack
The basic components of the stack I will implement are as follows:
- "Physical" Layer (Raw Packet Socket)
- Link Layer (Ethernet)
- Address Resolution Protocol (ARP)
- Internet Protocol (IP)
- Internet Control Message Protocol (ICMP)
"Physical" Layer (Raw Packet Socket)
To open up a raw socket on an interface, the raw call (in C
) is
int sock = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
I would like to store the result of this call in a structure which I will call
phy_conn
, to represent the physical connection. I'll initialize this
in a function called phy_conn_init
. I would also like the socket to
be non-blocking, so that the stack I build can be completely asynchronous.
struct phy_conn {
int sock;
...
};
int phy_conn_init(struct phy_conn *this, ...) {
int result;
this->sock = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if (this->sock < 0)
return -errno;
result = fcntl(this->sock, F_SETFL, O_NONBLOCK);
if (result < 0)
return -errno;
...
return 0;
}
Style note: in this code I've taken to returning negative errno
values when a function fails. I find this to be the cleanest way to represent
errors in C
.
This code does not complete the set-up, however, since this code opens a raw socket connected to all interfaces at once which is certainly not what I want. For that, I'll declare a function to bind the socket to an interface. It requires to be bound to an interface index, which may be looked up, but for now here's the raw bind call:
int phy_conn_bind_to_ifindex(struct phy_conn *this, int ifindex) {
struct sockaddr_ll sockaddr_ll = {
.sll_family = AF_PACKET,
.sll_protocol = htons(ETH_P_ALL),
.sll_ifindex = ifindex,
};
if (bind(this->sock, (struct sockaddr *) &sockaddr_ll, sizeof sockaddr_ll) < 0)
return -errno;
return 0;
}
To send and receive on the socket, I've gone for the following approach:
-
For receiving, call the lowest-layer
recv
function which will call the higher layers in turn; -
For sending, call the highest-layer
send
function with the data to be sent, which will wrap the data and call the lower layers in turn.
This means the signatures of the send
and recv
functions for
phy_conn
are
int phy_conn_send(struct phy_conn *this, const unsigned char *data, int len);
int phy_conn_recv(struct phy_conn *this);
I'll also need to introduce the final member of the phy_conn
struct which
was omitted by elipses before, which is the next layer up - eth_conn
.
struct phy_conn {
int sock;
struct eth_conn *eth_conn;
};
Then the implementations of send
and recv
are
int phy_conn_send(struct phy_conn *this, const unsigned char *data, int len) {
ssize_t result;
if (len < 0)
return -EBADMSG;
result = write(this->sock, data, (size_t) len);
if (result < 0)
return -errno;
return (int) result;
}
int phy_conn_recv(struct phy_conn *this) {
ssize_t result;
_Alignas(struct ether_header) unsigned char buf[ETH_MAX_MTU + sizeof(struct ether_header)];
result = read(this->sock, buf, sizeof buf);
if (result < 0)
return -errno;
return eth_conn_recv(this->eth_conn, buf, (int) result);
}
These functions just wrap the system calls read
and write
,
which on a bound socket receive and send data respectively to/from the correct interface.
Link Layer (Ethernet)
I'm using the suffix _conn
for all of the structs in this stack, so I'll
continue with eth_conn
. The ethernet layer needs to interface with the
lower layer (phy_conn
), as well as the upper layers - both IP and ARP.
ARP is needed to resolve IP addresses for the IP layer, and directly responds to
ARP requests as well. Two other pieces of information are local to the eth_conn
itself - its own ethernet address, and the MTU that it supports (payloads above this
length will be rejected with EMSGSIZE
).
struct eth_conn {
struct ether_addr src_addr;
int mtu;
struct phy_conn *phy_conn;
struct arp_conn *arp_conn;
struct ip_conn *ip_conn;
};
The method for receiving now has additional parameters, since the phy_conn
will
pass up actual data. The method for sending also has additional parameters, since the
eth_conn
requires a destination address for each sent message, as well as a
message type designator. Their declarations look like this:
int eth_conn_send(struct eth_conn *this, const unsigned char *data, int len, struct ether_addr dst_addr, unsigned short type);
int eth_conn_recv(struct eth_conn *this, const unsigned char *data, int len);
The send
function needs to append a header to the data, and the recv
function needs to peel a header off, checking its contents. To avoid pitfalls of undefined
behaviour, I'm going to make liberal use of memcpy
in the code that follows.
int eth_conn_send(struct eth_conn *this, const unsigned char *data, int len, struct ether_addr dst_addr, unsigned short type) {
struct ether_header ether_header;
_Alignas(struct ether_header) unsigned char buf[ETH_MAX_MTU + sizeof ether_header];
if (len < 0)
return -EBADMSG;
if (len > this->mtu)
return -EMSGSIZE;
memcpy(ether_header.ether_dhost, dst_addr.ether_addr_octet, ETH_ALEN);
memcpy(ether_header.ether_shost, this->src_addr.ether_addr_octet, ETH_ALEN);
ether_header.ether_type = htons(type);
memcpy(buf, ðer_header, sizeof ether_header);
memcpy(buf + sizeof ether_header, data, (size_t) len);
return phy_conn_send(this->phy_conn, buf, (int) sizeof ether_header + len);
}
int eth_conn_recv(struct eth_conn *this, const unsigned char *data, int len) {
struct ether_header ether_header;
unsigned short type;
struct ether_addr broadcast = {{0xff, 0xff, 0xff, 0xff, 0xff, 0xff}};
if (len < (int) sizeof ether_header)
return -EBADMSG;
memcpy(ðer_header, data, sizeof ether_header);
type = ntohs(ether_header.ether_type);
if (!ether_eq(this->src_addr, ether_header.ether_dhost))
if (!ether_eq(broadcast, ether_header.ether_dhost))
return 0;
switch (type) {
case ETHERTYPE_IP:
return ip_conn_recv(this->ip_conn, data + sizeof ether_header, len - (int) sizeof ether_header);
case ETHERTYPE_ARP:
return arp_conn_recv(this->arp_conn, data + sizeof ether_header, len - (int) sizeof ether_header);
case ETHERTYPE_IPV6:
return 0;
default:
return -EPROTONOSUPPORT;
}
}
One interesting part of the recv
function here is where it discards data
not sent to its local address (or broadcast address). Disabling this check is commonly known
as "promiscuous mode" on an interface. Here I'm restricting the set of allowed destination
addresses for incoming packets.
The recv
function is also switching on the type of message and dispatching to
the correct handler. Note that I'm ignoring IPV6 packets instead of returning an error -
these can appear on interfaces (although usually those packets are blocked by the ethernet
address check).
Address Resolution Protocol (ARP)
A very simple implementation of ARP has to do a few things:
- Keep a list of local IP addresses and their corresponding physical (ethernet) addresses
- Keep a similar list for remote IP addresses
- Resolve an IP address to a physical address
- Respond to requests for IP addresses held locally
- Send requests to obtain unknown physical addresses from an IP address
The interface I've chosen to implement these functions is as follows:
struct arp_entry {
struct in_addr in_addr;
struct ether_addr ether_addr;
int is_local;
};
struct arp_conn {
struct arp_entry arp_entries[ARP_MAX_ENTRIES];
struct eth_conn *eth_conn;
};
int arp_conn_init(struct arp_conn *this, struct eth_conn *eth_conn);
int arp_conn_send_request(struct arp_conn *this, struct in_addr src, struct in_addr dst);
int arp_conn_recv(struct arp_conn *this, const unsigned char *data, int len);
int arp_conn_resolv(struct arp_conn *this, struct in_addr in, struct ether_addr *out);
int arp_conn_add(struct arp_conn *this, struct arp_entry new);
Combining the local and remote ARP entries helps to simplify the code. Also,
since zero addresses are not valid/routable, setting the arp_entries
table to zero clears its entries and so there's no separate length variable
to keep track of here. I've kept ARP_MAX_ENTRIES
very low, at a
value of 8.
The interesting part of the implementation for arp_conn
is the
arp_conn_recv
function, which I show here:
int arp_conn_recv(struct arp_conn *this, const unsigned char *data, int len) {
int i, result;
struct ether_arp ether_arp;
struct arp_entry src = {};
struct arp_entry tgt = {};
struct arp_entry *entry;
if (len != sizeof ether_arp)
return -EBADMSG;
memcpy(ðer_arp, data, sizeof ether_arp);
if (ntohs(ether_arp.arp_hrd) != ARPHRD_ETHER)
return -EAFNOSUPPORT;
if (ntohs(ether_arp.arp_pro) != ETHERTYPE_IP)
return -EAFNOSUPPORT;
if (ether_arp.arp_hln != 6)
return -EBADMSG;
if (ether_arp.arp_pln != 4)
return -EBADMSG;
memcpy(&src.ether_addr, ether_arp.arp_sha, sizeof(struct ether_addr));
memcpy(&src.in_addr, ether_arp.arp_spa, sizeof(struct in_addr));
memcpy(&tgt.ether_addr, ether_arp.arp_tha, sizeof(struct ether_addr));
memcpy(&tgt.in_addr, ether_arp.arp_tpa, sizeof(struct in_addr));
switch(ntohs(ether_arp.arp_op)) {
case ARPOP_REQUEST:
result = arp_conn_add(this, src);
if (result < 0)
return result;
if (tgt.in_addr.s_addr == 0)
return 0;
for (i = 0; i < ARP_MAX_ENTRIES; i++) {
entry = &this->arp_entries[i];
if (entry->is_local && entry->in_addr.s_addr == tgt.in_addr.s_addr) {
ether_arp.arp_op = htons(ARPOP_REPLY);
memcpy(ether_arp.arp_sha, &entry->ether_addr, sizeof(struct ether_addr));
memcpy(ether_arp.arp_spa, &entry->in_addr, sizeof(struct in_addr));
memcpy(ether_arp.arp_tha, &src.ether_addr, sizeof(struct ether_addr));
memcpy(ether_arp.arp_tpa, &src.in_addr, sizeof(struct in_addr));
return eth_conn_send(this->eth_conn, (const unsigned char *) ðer_arp, sizeof ether_arp, src.ether_addr, ETHERTYPE_ARP);
}
}
return 0;
case ARPOP_REPLY:
return arp_conn_add(this, src);
default:
return -EOPNOTSUPP;
}
}
This code checks that it's receiving an ARP ethernet/IP request/reply, then (for requests) it looks through all of the entries in the table until it finds a matching one. To handle gratuitous ARPs, both the request and the reply message cause the sender's ethernet/IP address to be added to the table. Any other operation is not handled and returns an error.
Internet Protocol (IP)
The ip_conn
needs to connect to a lot of the other layers. I realised here that
there is not much encapsulation of the different functions of the layers - IP needs to know
about ethernet and ARP, for instance. There may be a nice way to encapsulate this, but it
does not look much simpler than simply putting references to the lower layers in the
ip_conn
struct. IP also needs to know about ICMP, since it is a higher layer
on top of IP.
struct ip_conn {
struct in_addr src_addr;
struct eth_conn *eth_conn;
struct arp_conn *arp_conn;
struct icmp_conn *icmp_conn;
unsigned char ttl;
unsigned short next_id;
};
The send
function for IP also requires a destination address - an internet address
this time - and the protocol number of the higher layer (this will of course only ever be
ICMP in this small project).
int ip_conn_send(struct ip_conn *this, const unsigned char *data, int len, struct in_addr dst_addr, unsigned char proto);
int ip_conn_recv(struct ip_conn *this, const unsigned char *data, int len);
Internet Control Message Protocol (ICMP)
The icmp_conn
struct is actually very simple - there are no higher layers
and no stored data required. The IP layer below already checks that the destination IP
address is OK, and so all I require is that it passes the source IP address to the
icmp_conn_recv
function. Then icmp_conn
can send any
ICMP_ECHOREPLY
messages right back at the source IP address.
struct icmp_conn {
struct ip_conn *ip_conn;
};
int icmp_conn_recv(struct icmp_conn *this, const unsigned char *data, int len, struct in_addr src_addr);
I'm only handling ICMP_ECHO
messages - not even sending pings out of
the connection and receiving a reply - so this really is a minimal implementation of ICMP.
int icmp_conn_recv(struct icmp_conn *this, const unsigned char *data, int len, struct in_addr src_addr) {
unsigned sum;
struct icmp icmp;
_Alignas(struct icmp) unsigned char buf[IP_MAXPACKET];
if (len < (int) sizeof icmp)
return -EBADMSG;
memcpy(&icmp, data, sizeof icmp);
switch (icmp.icmp_type) {
case ICMP_ECHO:
icmp.icmp_type = ICMP_ECHOREPLY;
sum = icmp.icmp_cksum;
sum = (~sum & 0xffff) + (~ICMP_ECHO & 0xffff) + ICMP_ECHOREPLY;
sum = (sum & 0xffff) + (sum >> 16);
sum += sum >> 16;
icmp.icmp_cksum = (unsigned short) ~sum;
memcpy(buf, &icmp, sizeof icmp);
memcpy(buf + sizeof icmp, data + sizeof icmp, (size_t) len - sizeof icmp);
return ip_conn_send(this->ip_conn, buf, len, src_addr, IPPROTO_ICMP);
default:
return -EOPNOTSUPP;
}
}
The latest versions of ping
actually check the ICMP checksum (only very
recently did they start to do this), and so I'm acually computing the sum here. The best
way to do this is to adjust the sum based on the difference between the ICMP_ECHO
and ICMP_ECHOREPLY
codes, noting the caveats about one's complement
arithmetic documented in RFC 1624.
Wiring Everything Together
All of the structs can be created on the stack - and I've made a *_conn_init
function for each to initialize them. Once that's done, a loop can be started which polls
the socket's descriptor and calls phy_conn_recv
on the physical connection.
do {
result = poll(&(struct pollfd) { phy_conn.sock, POLLIN, 0 }, 1, -1);
if (result >= 0) {
result = phy_conn_recv(&phy_conn);
}
} while (result >= 0);
The poll
call here could incorporate different descriptors, so that, say, an
application could send a ping through this interface. For now, it only responds to received
messages.
The code is here. To build it, run make
,
and to run the code I've made a target run
in the makefile which will need
to be run with the appropriate permissions (CAP_NET_ADMIN
). This target
will create a virtual ethernet pair, disable the kernel's ARP on one end, and connect
this program to it. It may then be pinged via:
ping 10.0.0.2
To re-run, the new virtual ethernet pair must be deleted before the run script can re-create it, via
ip link del veth0