Pointless Hacks - Raw IP Stack

Raw IP Stack

One of the jobs of an operating system is to maintain a complete TCP/IP stack, from the link layer all the way up to the connect and send functions found in most programming languages. Any interface opened on a linux system will automatically respond to ARP and ICMP (ping) messages, handled by the kernel. This functionality may be turned off, and implemented in user-space. There are two ways to do this on linux, one is the tuntap interface, and the other is a raw packet socket. I'll cover the latter method here. It requires the CAP_NET_RAW capability in contrast to the tuntap method, but may be used on real interfaces.

Components of the Stack

The basic components of the stack I will implement are as follows:

"Physical" Layer (Raw Packet Socket)
Link Layer (Ethernet)
Address Resolution Protocol (ARP)
Internet Protocol (IP)
Internet Control Message Protocol (ICMP)

These are all the layers/protocols necessary to start up an interface that can be pinged via ICMP from another machine.

"Physical" Layer (Raw Packet Socket)

To open up a raw socket on an interface, the raw call (in C) is


int sock = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));

I would like to store the result of this call in a structure which I will call phy_conn, to represent the physical connection. I'll initialize this in a function called phy_conn_init. I would also like the socket to be non-blocking, so that the stack I build can be completely asynchronous.


struct phy_conn {
    int sock;
    ...
};

int phy_conn_init(struct phy_conn *this, ...) {
    int result;

    this->sock = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
    if (this->sock < 0)
        return -errno;

    result = fcntl(this->sock, F_SETFL, O_NONBLOCK);
    if (result < 0)
        return -errno;

    ...
    return 0;
}

Style note: in this code I've taken to returning negative errno values when a function fails. I find this to be the cleanest way to represent errors in C.

This code does not complete the set-up, however, since this code opens a raw socket connected to all interfaces at once which is certainly not what I want. For that, I'll declare a function to bind the socket to an interface. It requires to be bound to an interface index, which may be looked up, but for now here's the raw bind call:


int phy_conn_bind_to_ifindex(struct phy_conn *this, int ifindex) {
    struct sockaddr_ll sockaddr_ll = {
        .sll_family = AF_PACKET,
        .sll_protocol = htons(ETH_P_ALL),
        .sll_ifindex = ifindex,
    };

    if (bind(this->sock, (struct sockaddr *) &sockaddr_ll, sizeof sockaddr_ll) < 0)
        return -errno;
    return 0;
}

To send and receive on the socket, I've gone for the following approach:

For receiving, call the lowest-layer recv function which will call the higher layers in turn;
For sending, call the highest-layer send function with the data to be sent, which will wrap the data and call the lower layers in turn.

This means the signatures of the send and recv functions for phy_conn are


int phy_conn_send(struct phy_conn *this, const unsigned char *data, int len);
int phy_conn_recv(struct phy_conn *this);

I'll also need to introduce the final member of the phy_conn struct which was omitted by elipses before, which is the next layer up - eth_conn.


struct phy_conn {
    int sock;

    struct eth_conn *eth_conn;
};

Then the implementations of send and recv are


int phy_conn_send(struct phy_conn *this, const unsigned char *data, int len) {
    ssize_t result;

    if (len < 0)
        return -EBADMSG;

    result = write(this->sock, data, (size_t) len);
    if (result < 0)
        return -errno;

    return (int) result;
}

int phy_conn_recv(struct phy_conn *this) {
    ssize_t result;
    _Alignas(struct ether_header) unsigned char buf[ETH_MAX_MTU + sizeof(struct ether_header)];

    result = read(this->sock, buf, sizeof buf);
    if (result < 0)
        return -errno;

    return eth_conn_recv(this->eth_conn, buf, (int) result);
}

These functions just wrap the system calls read and write, which on a bound socket receive and send data respectively to/from the correct interface.

Link Layer (Ethernet)

I'm using the suffix _conn for all of the structs in this stack, so I'll continue with eth_conn. The ethernet layer needs to interface with the lower layer (phy_conn), as well as the upper layers - both IP and ARP. ARP is needed to resolve IP addresses for the IP layer, and directly responds to ARP requests as well. Two other pieces of information are local to the eth_conn itself - its own ethernet address, and the MTU that it supports (payloads above this length will be rejected with EMSGSIZE).


struct eth_conn {
    struct ether_addr src_addr;
    int mtu;
    
    struct phy_conn *phy_conn;

    struct arp_conn *arp_conn;
    struct ip_conn *ip_conn;
};

The method for receiving now has additional parameters, since the phy_conn will pass up actual data. The method for sending also has additional parameters, since the eth_conn requires a destination address for each sent message, as well as a message type designator. Their declarations look like this:


int eth_conn_send(struct eth_conn *this, const unsigned char *data, int len, struct ether_addr dst_addr, unsigned short type);
int eth_conn_recv(struct eth_conn *this, const unsigned char *data, int len);

The send function needs to append a header to the data, and the recv function needs to peel a header off, checking its contents. To avoid pitfalls of undefined behaviour, I'm going to make liberal use of memcpy in the code that follows.


int eth_conn_send(struct eth_conn *this, const unsigned char *data, int len, struct ether_addr dst_addr, unsigned short type) {
    struct ether_header ether_header;
    _Alignas(struct ether_header) unsigned char buf[ETH_MAX_MTU + sizeof ether_header];

    if (len < 0)
        return -EBADMSG;
    if (len > this->mtu)
        return -EMSGSIZE;

    memcpy(ether_header.ether_dhost, dst_addr.ether_addr_octet, ETH_ALEN);
    memcpy(ether_header.ether_shost, this->src_addr.ether_addr_octet, ETH_ALEN);
    ether_header.ether_type = htons(type);

    memcpy(buf, &ether_header, sizeof ether_header);
    memcpy(buf + sizeof ether_header, data, (size_t) len);
    return phy_conn_send(this->phy_conn, buf, (int) sizeof ether_header + len);
}

int eth_conn_recv(struct eth_conn *this, const unsigned char *data, int len) {
    struct ether_header ether_header;
    unsigned short type;
    struct ether_addr broadcast = {{0xff, 0xff, 0xff, 0xff, 0xff, 0xff}};

    if (len < (int) sizeof ether_header)
        return -EBADMSG;

    memcpy(&ether_header, data, sizeof ether_header);
    type = ntohs(ether_header.ether_type);

    if (!ether_eq(this->src_addr, ether_header.ether_dhost))
         if (!ether_eq(broadcast, ether_header.ether_dhost))
            return 0;

    switch (type) {
        case ETHERTYPE_IP:
            return ip_conn_recv(this->ip_conn, data + sizeof ether_header, len - (int) sizeof ether_header);
        case ETHERTYPE_ARP:
            return arp_conn_recv(this->arp_conn, data + sizeof ether_header, len - (int) sizeof ether_header);
        case ETHERTYPE_IPV6:
            return 0;
        default:
            return -EPROTONOSUPPORT;
    }
}

One interesting part of the recv function here is where it discards data not sent to its local address (or broadcast address). Disabling this check is commonly known as "promiscuous mode" on an interface. Here I'm restricting the set of allowed destination addresses for incoming packets.

The recv function is also switching on the type of message and dispatching to the correct handler. Note that I'm ignoring IPV6 packets instead of returning an error - these can appear on interfaces (although usually those packets are blocked by the ethernet address check).

Address Resolution Protocol (ARP)

A very simple implementation of ARP has to do a few things:

Keep a list of local IP addresses and their corresponding physical (ethernet) addresses
Keep a similar list for remote IP addresses
Resolve an IP address to a physical address
Respond to requests for IP addresses held locally
Send requests to obtain unknown physical addresses from an IP address

The interface I've chosen to implement these functions is as follows:


struct arp_entry {
    struct in_addr in_addr;
    struct ether_addr ether_addr;
    int is_local;
};

struct arp_conn {
    struct arp_entry arp_entries[ARP_MAX_ENTRIES];

    struct eth_conn *eth_conn;
};

int arp_conn_init(struct arp_conn *this, struct eth_conn *eth_conn);
int arp_conn_send_request(struct arp_conn *this, struct in_addr src, struct in_addr dst);
int arp_conn_recv(struct arp_conn *this, const unsigned char *data, int len);
int arp_conn_resolv(struct arp_conn *this, struct in_addr in, struct ether_addr *out);
int arp_conn_add(struct arp_conn *this, struct arp_entry new);

Combining the local and remote ARP entries helps to simplify the code. Also, since zero addresses are not valid/routable, setting the arp_entries table to zero clears its entries and so there's no separate length variable to keep track of here. I've kept ARP_MAX_ENTRIES very low, at a value of 8.

The interesting part of the implementation for arp_conn is the arp_conn_recv function, which I show here:


int arp_conn_recv(struct arp_conn *this, const unsigned char *data, int len) {
    int i, result;
    struct ether_arp ether_arp;
    struct arp_entry src = {};
    struct arp_entry tgt = {};
    struct arp_entry *entry;

    if (len != sizeof ether_arp)
        return -EBADMSG;

    memcpy(&ether_arp, data, sizeof ether_arp);

    if (ntohs(ether_arp.arp_hrd) != ARPHRD_ETHER)
        return -EAFNOSUPPORT;
    if (ntohs(ether_arp.arp_pro) != ETHERTYPE_IP)
        return -EAFNOSUPPORT;
    if (ether_arp.arp_hln != 6)
        return -EBADMSG;
    if (ether_arp.arp_pln != 4)
        return -EBADMSG;

    memcpy(&src.ether_addr, ether_arp.arp_sha, sizeof(struct ether_addr));
    memcpy(&src.in_addr, ether_arp.arp_spa, sizeof(struct in_addr));
    memcpy(&tgt.ether_addr, ether_arp.arp_tha, sizeof(struct ether_addr));
    memcpy(&tgt.in_addr, ether_arp.arp_tpa, sizeof(struct in_addr));

    switch(ntohs(ether_arp.arp_op)) {
        case ARPOP_REQUEST:
            result = arp_conn_add(this, src);
            if (result < 0)
                return result;
            if (tgt.in_addr.s_addr == 0)
                return 0;
            for (i = 0; i < ARP_MAX_ENTRIES; i++) {
                entry = &this->arp_entries[i];
                if (entry->is_local && entry->in_addr.s_addr == tgt.in_addr.s_addr) {
                    ether_arp.arp_op = htons(ARPOP_REPLY);
                    memcpy(ether_arp.arp_sha, &entry->ether_addr, sizeof(struct ether_addr));
                    memcpy(ether_arp.arp_spa, &entry->in_addr, sizeof(struct in_addr));
                    memcpy(ether_arp.arp_tha, &src.ether_addr, sizeof(struct ether_addr));
                    memcpy(ether_arp.arp_tpa, &src.in_addr, sizeof(struct in_addr));
                    return eth_conn_send(this->eth_conn, (const unsigned char *) &ether_arp, sizeof ether_arp, src.ether_addr, ETHERTYPE_ARP);
                }
            }
            return 0;
        case ARPOP_REPLY:
            return arp_conn_add(this, src);
        default:
            return -EOPNOTSUPP;
    }
}

This code checks that it's receiving an ARP ethernet/IP request/reply, then (for requests) it looks through all of the entries in the table until it finds a matching one. To handle gratuitous ARPs, both the request and the reply message cause the sender's ethernet/IP address to be added to the table. Any other operation is not handled and returns an error.

Internet Protocol (IP)

The ip_conn needs to connect to a lot of the other layers. I realised here that there is not much encapsulation of the different functions of the layers - IP needs to know about ethernet and ARP, for instance. There may be a nice way to encapsulate this, but it does not look much simpler than simply putting references to the lower layers in the ip_conn struct. IP also needs to know about ICMP, since it is a higher layer on top of IP.


struct ip_conn {
    struct in_addr src_addr;

    struct eth_conn *eth_conn;
    struct arp_conn *arp_conn;
    struct icmp_conn *icmp_conn;
    unsigned char ttl;
    unsigned short next_id;
};

The send function for IP also requires a destination address - an internet address this time - and the protocol number of the higher layer (this will of course only ever be ICMP in this small project).


int ip_conn_send(struct ip_conn *this, const unsigned char *data, int len, struct in_addr dst_addr, unsigned char proto);
int ip_conn_recv(struct ip_conn *this, const unsigned char *data, int len);

Internet Control Message Protocol (ICMP)

The icmp_conn struct is actually very simple - there are no higher layers and no stored data required. The IP layer below already checks that the destination IP address is OK, and so all I require is that it passes the source IP address to the icmp_conn_recv function. Then icmp_conn can send any ICMP_ECHOREPLY messages right back at the source IP address.


struct icmp_conn {
   struct ip_conn *ip_conn;
};

int icmp_conn_recv(struct icmp_conn *this, const unsigned char *data, int len, struct in_addr src_addr);

I'm only handling ICMP_ECHO messages - not even sending pings out of the connection and receiving a reply - so this really is a minimal implementation of ICMP.


int icmp_conn_recv(struct icmp_conn *this, const unsigned char *data, int len, struct in_addr src_addr) {
    unsigned sum;
    struct icmp icmp;
    _Alignas(struct icmp) unsigned char buf[IP_MAXPACKET];

    if (len < (int) sizeof icmp)
        return -EBADMSG;

    memcpy(&icmp, data, sizeof icmp);

    switch (icmp.icmp_type) {
        case ICMP_ECHO:
            icmp.icmp_type = ICMP_ECHOREPLY;
            sum = icmp.icmp_cksum;
            sum = (~sum & 0xffff) + (~ICMP_ECHO & 0xffff) + ICMP_ECHOREPLY;
            sum = (sum & 0xffff) + (sum >> 16);
            sum += sum >> 16;
            icmp.icmp_cksum = (unsigned short) ~sum;
            memcpy(buf, &icmp, sizeof icmp);
            memcpy(buf + sizeof icmp, data + sizeof icmp, (size_t) len - sizeof icmp);
            return ip_conn_send(this->ip_conn, buf, len, src_addr, IPPROTO_ICMP);
        default:
            return -EOPNOTSUPP;
    }
}

The latest versions of ping actually check the ICMP checksum (only very recently did they start to do this), and so I'm acually computing the sum here. The best way to do this is to adjust the sum based on the difference between the ICMP_ECHO and ICMP_ECHOREPLY codes, noting the caveats about one's complement arithmetic documented in RFC 1624.

Wiring Everything Together

All of the structs can be created on the stack - and I've made a *_conn_init function for each to initialize them. Once that's done, a loop can be started which polls the socket's descriptor and calls phy_conn_recv on the physical connection.


    do {
        result = poll(&(struct pollfd) { phy_conn.sock, POLLIN, 0 }, 1, -1);
        if (result >= 0) {
            result = phy_conn_recv(&phy_conn);
        }
    } while (result >= 0);

The poll call here could incorporate different descriptors, so that, say, an application could send a ping through this interface. For now, it only responds to received messages.

The code is here. To build it, run make, and to run the code I've made a target run in the makefile which will need to be run with the appropriate permissions (CAP_NET_ADMIN). This target will create a virtual ethernet pair, disable the kernel's ARP on one end, and connect this program to it. It may then be pinged via:


ping 10.0.0.2

To re-run, the new virtual ethernet pair must be deleted before the run script can re-create it, via


ip link del veth0