Efficient Bulk Search in High Level Languages

Certainly an oxymoronic subject, and not without justification- any decent high-level language developer will leverage existing libraries that have been designed for optimal performance (at least among the majority of use cases). Optimization and high performance in these environments often focuses on architectural and engineering disciplines, not a science of algorithm optimization and memory management. Certainly, any developer worth their weight should understand the basic concepts in computer science, but you’re not likely to find a vast array of efficient search algorithms implemented in Javascript or Ruby. Development platforms (and in turn, high-level languages) are built with the best possible performance in mind to enable developers to create and build without focusing on plumbing code. This does not, however, limit a language or environment’s ability to provide efficient methods of data management, nor should it preclude a developer from leveraging the most efficient (and safest, and sanest) solution possible.

I’ll use Perl as an example testbed, largely because Perl’s array primitives do not contain any sort of search of find method- if you want to know if is in b[], you’re gonna have to go index spelunking (and yes, I’m submitting that phrase to Webster). The use of a simple loop and matching on a sentinel is easy enough to implement, and in a small data set, looping through the entirety of an array likely won’t be too expensive. Alternatively, converting the array to a hashmap and checking the existence of the key is an acceptable solution– but is it really worth it? We’ll take a look at both the cost of searching through both types of data structures, and the time needed to build each list:

Which results in the following:

It doesn’t take a rocket surgeon to understand how the first test scales. Increasing the search range of a linear function by an order of magnitude increases run time… by an order of magnitude. Shocker. Meanwhile, Perl’s exists() function runs in near-constant time (and I say near with a nod to DVK). With regard to data structure construction, we see another story- O(n) time for both structure forms, though the hashmap’s slope is greater. This is a function of the complexity of how Perl manages memory structures. Essentially, the choice between linear and hash searches is dictated by use case- large data sets that are created, then searched sparingly, will perform better overall using a linear search, since the time to build the hashmap may outweigh the benefits gained from a constant search time; existing large data sets, or those that already live in the form of a hash map, should clearly be searched via hash key checking methods.

We can see the same results of linear vs. hashmap search comparison in another high-level language- embeded Lua:

The Nginx Lua API doesn’t provide high-resolution timer checking that we can get in Perl, so this test case is limited to a single example (plus the Lua GC is apparently not strong enough to handle millions of items; see my post to the Openresty mailing list). Still, we can clearly see that implementing a hash search is much more efficient as the data set grows, and in a web environment (picture a shared memory zone holding millions of IPs in a behavioral analysis algorithm) that receives thousands of concurrent requests, squeezing out those extra 19 milliseconds of performance can make a significant difference.

Leave a Reply

Your email address will not be published. Required fields are marked *