Resty Core – The Good, The Bad, and the JIT-y

OpenResty’s biggest selling point is its performance. Embedding Lua allows for some pretty cool new features to pop into a simple Nginx proxy, and the synchronous but non-blocking paradigm introduced by hooking into the event loop (thanks, epoll!) is awesome, but the OpenResty stack as a whole really shines when everything is tied together under the umbrella of approaching, crossing, and then shattering C10k. Out of the box, Lua code embedded into any number of phase handlers runs at an impressive speed, and with the flick of a switch, we can really kick things into high gear.

lua-resty-core is, for the most part, a drop-in replacement for parts of the ngx.* API provided by lua-nginx-module. It’s FFI based, so it requires the LuaJIT compiler; this provides the performance bump by JIT-compiling the ngx.* API code (in addition to the user code), since the existing ngx.* API is all is C, and thus cannot be compiled. The module is distributed with OpenResty; usage is dead simple:

By require‘ing the module, we open up a number of new modules:

Apart from base, which provides helper functions and constants, each core module re-implements some part of the ngx API as a Lua function, making an FFI C call back into a separate function provided by lua-nginx-module (when NGX_LUA_NO_FFI_API is not defined at compile time). Each supported element of ngx.* is simply overwritten (remember that ngx is not a read-only table). Take the case of some of the timing functions, such as is now a pure Lua function that makes an FFI C call; calling inside your OpenResty app now executes this function, which in turn executes tonumber on the result of calling the FFI C function. So how does this improve performance, if we’re still eventually calling into C? Let’s look at an example in ngx.get_phase(), which I’ve created and submitted upstream. The existing get_phase() call is provided as a C function, injected into the ngx namespace:

I’ve written the following patch that re-implements this C function, and overwriting ngx.get_phase() as we saw before. The C code (that can’t be JIT-ed) is now:

And the new lua-resty-core module is straightforward as well:

We can micro-benchmark both of these calls with the resty CLI utility and some of the jit libraries provided by LuaJIT. The benchmark is simple enough; just calling get_phase() repeatedly in a local assignment, and use jit.dump to spew some details (the lua-resty-core unit tests use some less verbose jit libraries for verifying the traces):

The first half of this, sans lua-resty-core, takes about 75000 microseconds to run on my laptop, and we get expected negative results from the JIT compiler:

Pretty obvious- C code called from interpreted Lua cannot be compiled, so the trace aborts. What happens when we simply require "resty.core" in our benchmark file? The results (no JIT tracing, just timing output):

The generated bytecode for the JIT trace follows:

And the IR:

So we can clearly see the affects of the new FFI API that we can leverage simply by loading lua-resty-core during our initialization. For the vast majority of cases, this acts as a simple drop-in replacement, though in the past I’ve run into a few hiccups in very minor behavior changes (nothing that couldn’t be fixed 😉 ), so as always, test your environment when rolling this. And, it should go without saying, micro-benchmarks like this are almost never an accurate representation of real-world use cases, and can sometimes be inaccurate to the point of producing invalid results (link below to an OpenResty mailing list entry describing this), so take this only as a laboratory example of the implementation differences.

Beyond micro benchmarks, we can see that there are practical performance implications for bringing in resty core. Consider my previous benchmarks on using ngx.ctx and shared dictionaries. When we bring in resty core, there are significant performance differences (see my previous post for the benchmarking guts):

We can see valuable, practical performance benefits from loading resty core with respect to accessing the per-transaction ngx.ctx table. Of course, localizing the table for a large number of accesses is still the most performant approach for this micro benchmark, but for in-the-wild use cases, we can see a clear gain. Provided that the environment in question provides LuaJIT (preferably LuaJIT 2.1), there’s really no reason not to bring in lua-resty-core.

Looking further beyond benchmarks, consider the use case of lua-resty-core in a real application like lua-resty-waf. From a high-level perspective, we can measure the performance impact of lua-resty-core by simply measuring the amount of time we spent in userspace Lua code, as measured by the stapxx toolkit. Without lua-resty-core, our runtime might look something like this, using a simple test harness designed to cover the majority of the WAF codebase. We’ll use the lj-vm-states and ngx-lua-exec-time tools to gauge the amount of JIT-compiled code, and the overall time we spent in Lua:

So half of our time was spent in C code called by interpreted Lua, which cannot be JIT compiled. Additionally, our average runtime was ~ 320 micro seconds. What happens when we load in resty core?

Again, the only change here was the inclusion of the directive require "resty.core" in the init_by_lua handler. We can see an almost 20% increase in average performance just by loading in resty core, as well as a significant jump in the amount of JIT-ed code (and as a side note, these results are a good indication that we can do more work optimizing lua-resty-waf; more on this in a later post 😉 ). This indicates that a large portion of lua-resty-waf that is dependent on the ngx.* API will see significant performance improvements by including this library. Not shown here is the result of the wrk benchmark platform, which saw a near 25% increase in request rate when including resty core.

So what’s the downside? Well, when your Lua code cannot be JIT compiled, you may actually see a performance decrease by loading this library. From the docs:

If the user Lua code is not JIT compiled, then use of this library may lead to performance drop in interpreted mode. You will only observe speedup when you get a good part of your user Lua code JIT compiled.

This is a result of how LuaJIT moves back and forth between compiled and interpreted Lua code- overhead is involved in switching between the two, and the result of a large accumulation of these switches can lead to a performance drop. Personally, I have yet to see a case where this causes actual problems in production, but this is where deep application and system profiling comes into play (details on this coming in a future post).

There’s a lot of available literature on LuaJIT; Marek Vavruša wrote some very good material, agentzh (as always) has some excellent insight, and the LuaJIT mailing list is a deep source of knowledge (as it ought to be). I will never claim to be an expert on the theoretical concepts touched on here, like tracing compilers, LuaJIT, or the guts of systemtap. But from an engineering standpoint, lua-resty-core is a well-maintained library, and, from what is seems, the likely future of OpenResty, that provides another part of that extra special sauce that makes up OpenResty.

Leave a Reply

Your email address will not be published. Required fields are marked *