Benchmarking Simple String Comparison Options for OpenResty

Perhaps one of the most powerful primatives that lua-nginx-module provides out of the box is a sane, simple wrapper for regular expression operations (via PCRE). String matching, replacing (and now splitting!) via regex allows for much greater flexibility in string processing than Lua’s native string library. Recently while cleaning up an OpenResty InfluxDB client I needed to do some simple string comparison. My knee-jerk reaction was to use a simple expression in ngx.re.find, but I had a hunch that the overhead of using the PCRE lib would be a waste, and that native Lua pattern searches would be quicker. Time for a benchmark to figure out the most sane solution!

Some background: InfluxDB defines a (relatively) straightforward wire protocol for shipping time series data. Each Line Protocol entry contains a measurement, zero or more tags, one or more fields, and an optional timestamp. Tag values are always defined as strings; field value types will be automagically figured based on the tokens presents in the field. Strings, floats, integers, and boolean values are possible. For whatever (I’m sure not entirely) inane reason, there are five ways to represent each boolean value (when writing to a data point; SELECT queries annoyingly have separate limitations, but that’s outside scope here). Unlike strings, boolean values are not quote-encapsulated in the Line Protocol statement. Compare the following:

In building a client we need to be able to properly type user input from a variety of inputs and contexts, so this means accepting both boolean and string Lua types when building the field portion of the protocol statement (we could ask a user to explicitly provide a boolean when they want a Line Protocol boolean, or we could build in some flexibility). Consider the initial approach I took to building field sets:

This universally quotes a field value according to the Line Protocol reference, but leaves no room for boolean types. Adding a check to ignore this extra quoting routine for boolean field values is easy enough, but since this is part of a hot code path in the Influx client, we need to be smart about how we check for bools. Benchmarking this out is fairly straightforward:

Here we define a handful of values that look like Line Protocol booleans (plus a few that don’t), and define two regular expressions and one harness for Lua string.find. The first regular expression is essentially just a linear search using the regex engine, and the second expression tries to be a bit smarter (note the use of the negative lookbehind to prevent a match for something like “tRUE”). The Lua harness is just a straight linear search (Lua string patterns cannot accept repetition or alternation of multiple chars). So how’d we do?

Not terribly surprising. Here we clearly show that the overheard of regular expression parsing is a waste for our simple case, running an order of magnitude slower, even though we are calling the Lua string.find function an order of magnitude more than the regular expression match. Now, we can build this logic into a Line Protocol field builder:

Of course, it’s important to remember that this is the appropriate solution for our particular case; more complex searches, and cases with a larger search space, need to be tested in their own right, so thinking that Lua’s string matching is always the fastest solution is a naive assumption.

Brute force searches like this are never particularly elegant (which is part of what powered my initial learn towards using a “smarter” expression search), but in low-level plumbing code like this, doing your work and getting out of the way, with the correct results, should be the primary focus.

2 thoughts on “Benchmarking Simple String Comparison Options for OpenResty

  1. Hi Robert,

    please consider enabling the regex caching options for static patterns for ngx.re.*
    This way the overhead of initializing the regex is only done once.

    Using the regex options “jo” the second test-case turns out to perform fastest:

    re_match(v, true_ex1,”jo”)

    re1 time: 0.06600022315979
    re2 time: 0.030999898910522
    str time: 0.045000076293945

    Regards,
    Andreas

    1. Hey Andreas,

      You’re absolutely right! I can reproduce your results when loading in resty.core as well; otherwise, even with pattern caching enabled native string matching is the most performant approach (for this case).

      FWIW, I have a light discussion of PCRE JIT/caching in https://www.cryptobells.com/building-openresty-with-pcre-jit/, and I’m working on a separate post picking at the nitty-gritty of lua-resty-core.

      Thanks for the feedback!

Leave a Reply

Your email address will not be published. Required fields are marked *