Anon

Experiments in Web Services Design

View project on GitHub

Anon Technical Overview

Fibers and I/O Driven Concurrency

The story here starts at linux's epoll api and how it supports efficient delivery of i/o notification. In the epoll model an application can register a set of file descriptors with the kernel, and with each one specify two pieces of data:

  • What i/o events it wants to be notified about
  • A pointer that it wants to be given when an i/o event occurs for that file descriptor
Linux uses file descriptors to represent lots of different things, but from a web services perspective the most important and typical one is that network sockets are represented with file descriptors. The idea of an i/o event is something like the point in time when a socket becomes readable without blocking -- that is, when data has arrived from the other computer such that a call to recv will return something without blocking the calling thread. That is the EPOLLIN event.

A reasonable picture of what is going on with epoll is here:

Linux epoll picture

Because linux will store pointers for you associated with each file descriptor, and you can use these pointers any way you want, like this example you can use them to store the addresses of C++ class instances. These can implement the pieces of code you want to run whenever it is possible to read from, or write to, a file without blocking.

Consider the following pseudo-code:

// runs forever...
void epoll_loop()
{
  while (true) {
    the_epoll_result = epoll_wait();
    the_object = (base_class*)the_epoll_result;
    the_object->handle_io();
  }
}

epoll_wait is the linux api to the epoll system that waits until there is i/o possible on at least one of the registered file descriptors. When there is it returns information about one such file descriptor. It isn't being shown correctly in this code example (the parameters are wrong), but in spirit it operates like this example shows. In this sort of model, if epoll_wait comes back with a 'result' that represents file descriptor 48 in the picture above, then a simple virtual function call to handle_io can get us to the code that knows what to do when another.server.com has sent information to us that we can read now without blocking.

In a number of ways linux's epoll_wait acts like both poll and select except for a few important differences. First, and most importantly, with epoll you don't have to send the set of file descriptors you are interested in each time you call the function as you do with both poll and select. This means that the kernel doesn't have to read the set you are telling it on each call and figure out what to do with each of them. This can make a significant difference when you have a lot of file descriptors that you are interested in - as is frequently the case for a web service that is under heavy load. It's also more convenient to let the kernel associate the file descriptors and pointers for you, than you having to do it by hand.

A last important difference that is largely a consequence of the first one (not needing to specify the file set on each call), is that it is now fairly straight forward to have multiple threads all waiting for epoll i/o notification on the same file set. This can further reduce the processing time needed to go from event notification to your handling code in any case where you have multi-threaded event handling code. In anon's epoll model, it calls epoll_wait on the same file set concurrently from multiple threads. It then dispatches to the handling code directly using function calls as is shown in the code snippet above. These handlers simply execute in the OS thread that recieved the notification from epoll_wait.

Callback Oriented Design

A significant drawback to this sort of epoll code is that while it is very efficient at notifying you that you can read some data from a socket, it can't tell you whether you can read all the data you might be interested in. It could be that a server has responded with only some of the information it is going to send you, and you are likely to need to start reading what it has sent so far to even determine whether it has sent you a complete reply or not.

Consider the case where epoll tells you that you can read some data from the another.server.com socket and you begin running your code to parse the http headers you expect to be at the start of this message. It could be that part way through parsing them you get to the end of the data that is currently available to read on that socket. If this happens your parser has only two realistic choices. It can be designed to use a blocking socket, in which case the next read call on that socket will block the calling thread until the next network packet shows up from that server. Or, it can use a non-blocking socket, and then be written to detect the EAGAIN state. But in that case you need to write your http parser in such a way that it can be suspended and later resumed when the packet does show up and you get the next epoll notification for it.

Each of these has drawbacks. A common pattern found in code that attempts to always be non-blocking is the one used by libevent and is the second of the two cases above. This is sometimes called a Callback Oriented Design because of its heavy use of callbacks. The overall design features the idea that i/o processing can take a long time, so you start by initiating it. When you do this you provide a callback function that will get called when the i/o operation completes. But this sort of design tends to interrupt the linear flow of source code because the idea of what happens next can't easily be specified as the next line in the source code. In a Callback Oriented Design, the thing that happens next is frequently specified in some other function.

Fibers and how they help

Fibers are sometimes known as user level threads. They are supported by many different operating systems, including linux. The typical idea behind this is that they are similar to OS threads. They each have a stack and register file. Code running in a fiber can make function calls, return from them and all the other "normal" things. But unlike normal operating system threads, they requrire an explicit call to switch from running one fiber to another. For a whole host of reasons, switching fiber contexts by calling an API is almost always faster than the OS switching thread contexts. But one common reason for them being faster is that you frequently don't have a particularly good strategy for deciding what fiber to run when. A good OS thread scheduling algorithm ensures that all threads are run at reasonable times. A fiber scheduling algorithm that attempted this sort of fairness would lose a great deal of its performance advantage over OS thread switching.

But for anon, we have a perfect use case for fibers. In anon, we can continue to execute in a single fiber until that fiber runs out of i/o. For example, a fiber can continue to execute and do whatever its logic is designed for until it gets to the next point of wanting to read something from a socket that has no more data. That EAGAIN condition can be used to signify an appropirate time to switch fibers since some other fiber in the system might be at a point where there is data available for it to read from its socket.

A core piece of anon is that when a fiber hits an EAGAIN condition attempting to perform i/o, it does a fiber switch back to the point where it calls epoll_wait. This allows the OS threads that run these fibers to remain busy executing interesting logic as long as there are sockets that are either ready to be read from or written to.

Anon's fiber model permits code that is structured something like this:

// appears to run until the end of the message.
// but, in fact switches out to other fibers if
// this socket runs out of data and others have
// available data
void parse_message()
{
  while (token = read_next_token()) {
    do_something_with(token);
  }
}

In anon, read_next_token does a fiber friendly read, which means that as soon as there are no bytes to be read it does a fiber-switch back to the epoll_wait code - allowing this OS thread to run other fibers whose sockets do have available data. When this, or any other OS thread sees that there is again data for this socket (epoll_wait returns a result for this socket), it fiber-switches back to the point where this fiber previously switched out. So from a source code perspective it looks like linear flow the same way that thread-blocking operations do. In the example above, the fiber is simply suspended inside of read_next_token until that function can read an entire token.

This ends up meaning that anon's fiber scheduling rules are derived from epoll's i/o event delivery mechanism. Anon implements an i/o driven concurrency model. And this is not a new idea.

Does that actually help?

Like anything, it depends on what you measure against. But you can do comparisons to similar OS thread switching by writing an application that does the following:

  • In one thread, accept socket connections and create "handlers" for each
  • In a second thread, make many connect calls to the listening socket of the first thread

Now, consider a very simple protocol where the sender (second thread) just sends a single byte. When the handler sees a byte it echos that byte back. After the sender sends its byte it attempts to read this echoed byte from the handler. To compare the fiber solution to a normal OS thread solution we want to look at the difference between the first thread creating new OS threads for each new handler vs. creating new fibers. Since the syntax we can use once we are inside of a fiber is similar to what we can do inside of a thread, we can just say that in both cases the handlers look something like:

// protocol handler
void handle_message(pipe& p)
{
  while (byte = p.read()) {
    p.write(byte);
  }
}

In an OS thread case, pipe is a type where read and write are directly mapped to recv and send and the underlying file descriptor is set blocking. For the fiber case the file descriptor is non-blocking and the methods are mapped to versions of these calls that check for EAGAIN. When they detect it they fiber-switch back to the epoll_wait code. In both cases, when the caller sends a zero byte the protocol is over and the routine returns, terminating the OS thread or fiber.

In the OS thread case the code that calls accept looks something like:

// accept loop
void accept_loop(int listening_socket)
{
  while (true) {
    int new_connection = accept(listening_socket, 0, 0);
    std::thread(std::bind(handle_message, new_connection)).detach();
  }
}

where you (incorrectly) have to assume that there is some way to automatically convert the new_connection int (file descriptor) into a non-const pipe&. Real C++ syntax wouldn't permit this. But what is important here is to point out that each call to accept results in a new OS thread running handle_message with that one socket.

The fiber equivalent that we want to compare to is identical except that:

  • Its handle_message uses a pipe type that does fiber-friendly reads and writes
  • Its accept_loop creates fibers running handle_message instead of std::threads

With these two different versions of a server, we can write a client that looks something like:

void client()
{
  int socks[400];
  for (int i=0; i<400; i++)
    socks[i] = connect( ... to the server ... );
    
  for (int msg=0; i<10000; msg++) {
  
    // first send one message to each of the sockets
    for (int i=0; i<400; i++)
      send_one_byte(socks[i]);
      
    // now loop through and read all the responses
    for (int i=0; i<400; i++)
      read_one_byte(socks[i]);
  }
}

The client app is being careful to force thread or fiber context switching in the server by writing one message to each of the connections before entering the phase where it reads replies. This makes sure that the server isn't running in mode where it reads long sequences of messages on a single connection and just returns them without needing to switch to handling other connections. That sort of "do a small amount of work on one connection and then move on to the next" behavior is typical in a real world web service, and is exactly what we want to measure. What we want now is to put a timer around the "msg" for loop and compare the timings for the two different kinds of servers.

Some Real Timing Data

Anon's test code contains this exact sort of testing, allowing you to make this comparison on whatever machine configuration you want to test. The data below was run on an Ubuntu 14.04 VM, hosted by VMWare running on a 2.3 GHz Intel Core i7 MacBook Pro laptop. Along one axis of the timing data it varies the number of CPU cores given to the VM instance, using 1, 2, and 4 cores. The fiber switching code is designed to run as many OS threads as there are CPU cores. So while the client application causes the fiber server to create 400 fibers, all of these fibers run on 4 OS threads if the VM is set to 4 cores. The OS thread server creates 400 OS threads in response to the client app, regardless of how many cores the VM has to work with. The connect call, abbreviated in the code example above, is using the loopback network address so is short circuiting a great deal of the steps that would occur if the client and server were on different machines.

Dispatch Timing Comparison for 4,000,000 Messages
1 Core 2 Cores 4 Cores
Fibers 3.7 - 3.8 secs. 2.4 - 2.5 secs. 2.3 - 2.4 secs.
OS Threads 8.7 - 9.0 secs. 10.6 - 10.9 secs. 12.4 - 17.2 secs.

This timing shows a number of interesting things. First, fibers always beat OS threads by a significant margin. But the two cases show interesting data as you vary the number of CPU cores. In the 1 core, fiber case both the client and server can get into a state where each has previously called revc from a thread whose socket didn't have data available at the time, so the thread was put to sleep. Once in that state those threads can't be woken until the right other thread is woken and writes into the other side of the socket. That fact that there is only a single CPU core servicing these threads makes it harder to get out of this state once it has entered it. This does not happen as frequently when there is more than one core and you can see that the timings for the 2 and 4 core cases are much more similar.

Speculation about why increasing the number of available cores increases thread context switching time for OS threads is this. There is likely to be some amount of synchronization between cores in order to correctly schedule threads, and this synchronization time increases as the number of cores increase. Since there is very little done in each thread before it blocks again on the next recv call, this core synchronization time dominates the total execution time.

Interested in other Cool Timing Comparisions?

Final Observations

Fibers and efficient handling of i/o events can signficantly speed up processing time for a server under load. Anon provides a tool set for using these techniques that is similar to the way one uses normal threads and event dispatching mechanisms.