Anon

Experiments in Web Services Design

View project on GitHub

Anon

It Goes To 11!

The Anon Story

Anon is an experiment in server design for Services. It attempts to achieve several goals. These are:

  • Maximize efficiency on a single CPU instance
  • Effeciently deal with "over-maximum" requests
  • Provide a reasonable dev environment

Anon is highly experimental and is unlike other service designs in several ways. For example, it is currently all written in C++. The name "anon" comes from the first two goals, and is the word anon, meaning something like again. It's not an abreviation for anonymous.

A common problem in some server designs is one that can be called the infinite queue problem. It can be seen when the Service design has components that look something like the following -- shown in C++, but can exist in any language:

// some global
std::deque<int> g_new_connections;

// runs in one thread
void new_connections_loop(int listening_socket)
{
  while (true) {
    int new_connection = accept(listening_socket, 0, 0);
    g_new_connections.push_back(new_connection);
  }
}

// runs in another thread
void process_connections_loop()
{
  while (true) {
    int conn = g_new_connections.front();
    g_new_connections.pop_front();
    process_one_connection(conn);
  }
}

This code isn't meant to be fully correct. It's missing mutex's and other stuff. But it shows a central feature of many Service designs where there exists some kind of queue, shown above as the std::deque g_new_connections. Here one thread of execution accepts new network connections as fast as it can and puts each one on a queue. Another thread pulls the recently established connections off of the queue and processes them as fast as it can. The basic problem illustrated here is that there isn't a good way for the second thread to keep the first from getting too far ahead of it. If the process_one_connection function takes a long time to complete relative to how fast other machines are trying to establish new connections to this one, then the g_new_connections queue can grow arbitrarily large. The root cause of many Critical Service Outages, particularly when they occur due to excessive server load, can be traced to a basic problem where a consumer of queued requests falls behind the producer of them. When that starts to happen the percentage of the Service's total capacity that is dedicated to maintaining the queue itself grows. In many designs this slows down process_one_connection more than it does process_connections_loop, which then compounds the original problem.

Linux has an errno code named EAGAIN which it uses when certain operations are set to be non-blocking and are not currently possible for one reason or another. For example, trying to read from a non-blocking socket that doesn't currently have any data to be read generates EAGAIN. A common use of EAGAIN would be to set listening_socket above to be non-blocking. Then the call to accept would return -1 and set errno to EAGAIN if there were not any connections that could be returned when it was called. That kind of usage would allow the calling thread to go do something else instead of stay stuck in accept until someone tries to connect to the computer.

But EAGAIN can also be used on the write-side of an operation. If g_new_connections were a pipe of some kind instead of the deque that is shown, then the push_back would be some kind of write call. The front and pop_front would then be replaced by read calls. In that kind of design the pipe could be set non-blocking, and since it has a fixed internal buffer size, if new_connections_loop gets too far in front of process_connections_loop the write call will fail with errno set to EAGAIN. And this can serve as a natural way for a consumer of requests to signal to the producer that it needs to slow down. The consumer doesn't need to do anything at all. The fact that it is unable to read fast enough causes the producer to be unable to continue writing.

Even without setting the pipe to be non-blocking, using a pipe with a fixed capacity will cause new_connections_loop to block inside of its write call, which will keep it from being able to call accept again. This would then keep client machines from being able to connect and send new requests. That creates a kind of speed limit that process_connections_loop can assert on the entire Service. But having client machines fail to connect without understanding why makes it hard to get those client machines working correctly. So a basic principle of Anon is to propagate the EAGAIN concept through the entire Service.

In the Anon design new_connections_loop does use a non-blocking pipe, and so sees that it has gotten too far ahead of process_connections_loop. It can then enter a state where further accept calls are immediately replied with a kind of EAGAIN message. That distributes the EAGAIN processing throughout the entire service -- thus the name Anon for the project.

Conveniently, EAGAIN is errno 11, nicely tying the name to the other goal of maximum efficiency. A second piece of Anon is to provide a design that makes good use of Linux's event dispatching mechanism epoll. It provides a platform where all request processing can be done free of any thread blocking operations. In fact, the goal is to allow Anon servers to run in a model where the number of running os threads is equal to the number of CPU cores. In this model, each request is handled by user-level threads (fibers) and fiber scheduling is driven by Linux's epoll event dispatching mechanism.

Technical Overview