Wednesday, January 13, 2016

Asynchronous Java webserver "Hello World" baseline

the TechEmpower plaintext benchmark (TEBM) attempts to give some rough measure of how a server (or framework) responds to the simplest of requests, and as such provides some guidance when configuring a server. however, most of the implementations are blocking (ie, use one thread per connection). i'm working on a demo for a database that i've written, and want to choose an appropriate server. the database:
  • is implemented in java and uses java methods (and lambdas) as the query language
  • uses the Kilim fiber library for high concurrency, ie imperative code is transformed into coroutines
  • is initially targeting low-end boxes, eg a VPS with limited cpu and memory resources, rentable for approximately $5 a month
  • is targeted at java developers that don't want the complexity of an enterprise environment
so i've reproduced something vaguely resembling the TEBM for a few asynchronous java servers, and a couple blocking implementations as a baseline, with a focus on simplicity and high concurrency, each implemented as an embedded server:
  • Jetty blocking
  • Undertow blocking
  • Jetty Async (servlet)
  • Kilim Async - a simple http server that's part of the library
  • Comsat Async - a Quasar (another fiber library) shim that integrates with a Jetty servlet
  • Undertow Async (native)
i also looked at spark-java, but the performance wasn't comparable so i haven't included it here

my test machine is a i3 at 3100Mhz with 16GB. memory didn't appear to be an issue. the test client is ApacheBench. the servers and client are run with "ulimit -n 102400", and i've used sysctl to get 64000-ish available ports. i would run each server and sweep the AB concurrency from 1000 to 20000, each AB test running for approximately 10 seconds, for a total of 40 million requests for each server (half timed, half warmup). i then waited for 60 seconds to account for any delayed effects. throughout, i crudely monitored cpu usage. keep-alive was used (without it, the connection time dominated the performance). the tests were run many times to eliminate problems and develop an understanding of the performance, but in the end only 3 sweeps for each server were needed to get "good" data

source code for the embedded servers and some scripts used for testing:
https://github.com/nqzero/jempower

Async:

the async servers are truly async, not solely using the async APIs. for Undertow and Jetty, the request stores the request in a queue, and a worker thread (or 3) services the queue periodically and sends the response (during extensive searching i did not find an example of this behavior online). for Comsat Jetty, Comsat provides a shim that does the equivalent of the queuing, and the userspace handler runs as a fiber, ie it yields when a blocking operation is performed. Kilim Http is entirely fiber-based, ie even the network code is non-blocking. both Comsat (Quasar) and Kilim transform imperative code into fibers / coroutines by weaving the bytecode

in all cases, the async server handlers are sleeping before responding, and a single thread is responding to 1000s of connections

Results:

the chart below shows the concurrency sweep. the left axis is median requests per second, the right is max (over 3 sweeps) failures. at 20000, undertow-async is all over the place, so don't read too much into the specific number (see the discussion below). here's an interactive version of the chart



Discussion

Undertow, both blocking and async, excelled at absolute throughput at low concurrency (consistent with the TEBM). At higher concurrencies (both sync and async), undertow exhibited 2 issues - "receive failures" and long periods (10-20 seconds) of max-cpu when the sweep finished. neither of these is mentioned in the TEBM. In addition to these problems, the documentation for undertow was limited (i'm waiting on some more feedback from the mailing list) so i'm not planning on going further with this server

Jetty performance, for both blocking and async, was a very nice middle ground - decent low-concurrency performance and a graceful falloff at higher levels. the cost of using async was about 15% across the range (at lower concurrency my async implementation sleeps a little too long to saturate the connection, but that's an artifact of the highly artificial use of async)

Comsat added a substantial overhead on top of async Jetty, and exhibited some failures at high concurrency. i've tried several different implementations based on the documentation and code from TEBM, and all performed about the same

kilim performance was competitive with Jetty, especially at high concurrency and kilim has some nice properties - true fibers (like comsat) so handlers are simple to write, the server itself is fiber-based (unlike Comsat) so blocking IO isn't a problem (async IO can be used in Jetty, but is a pain). however, the server is fairly primitive feature-wise, so at the least another server would be needed to proxy it. i have a working integration with my database that's very elegant, and i'll keep that, but want to add a more versatile server as well

Conclusions

  • Kilim offers competitive performance and the advantages of a true fiber-based server
  • Jetty Async offers a good mix of features, performance, and ease of use
  • Comsat performance was underwhelming, but the ability to shim fibers with servlets isn't (i haven't looked at the source, but my understanding is that it's a pretty simple and elegant hack) and i may investigate using Quasar in my database
  • undertow seems very capable, but isn't accessible to non-experts (or at least, not to me)
i'll be using Kilim and Jetty for my demo

Caveats

  • i'm not an expert
  • ApacheBench doesn't rate-limit connections, so the results aren't necessarily apples-to-apples (ie, faster servers get worked "harder")
  • at least the undertow handlers could saturate ApacheBench at low concurrencies, and running a second instance of AB increased throughput (120000 requests per second). not the focus of this study, so the results above use only a single AB
  • pull requests are welcome (though part of my need is for my demo to be easy for potential users to understand, so i probably won't integrate solutions that greatly increase complexity)



1 comment:

Fabio Tudone said...

Hi, thanks for your benchmark work, I think more of this is needed in general.

I started from your work and built a new set of benchmarks, you can find the post here: http://blog.paralleluniverse.co/2016/03/30/http-server-benchmark/