Skip to main content
blog

Gentle Introduction to Elasticsearch

By September 9, 2024No Comments

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual-licensed under the source-available Server Side Public License and the Elastic license, while other parts fall under the proprietary (source-available) Elastic License. Official clients are available in Java, .NET (C#), PHP, Python, Apache Groovy, Ruby and many other languages. According to the DB-Engines ranking, Elasticsearch is the most popular enterprise search engine. Source: Wikipedia

elastic

Even with databases spanning hundreds of millions of rows, Elasticsearch was able to deliver millisecond response times that were far superior to anything their Oracle systems could ever hope to deliver.

Regardless amount of data stored, elasticsearch is able to deliver results within miliseconds! But how, let’s breakdown internal working of elasticsearch.

Inverted Indexes

Under the hood elasticsearch keep data stored in a datastructure called ‘Inverted Indexes’ with a given mapping. Working of this datastructure is fairly easy to understand, we just use terms from database, let’s say admin, contact, california, etc as a key and store their frequency, occurence on primary db etc as values, so unlike having (index -> data), it’s stored as (data -> index), hence inverted. So, it’s able to achieve fast search responses because instead of searching the text directly, it searches an index.

Read more about this here.

inverted index

APIs

Elasticsearch uses HTTP web interface, which means any interaction with elasticsearch uses the REST APIs exposed by elasticsearch itself. For example, here is the sample API for creating indexes with settings like shards and replicas(We will cover these futher in this read).

PUT /test_index?pretty
{
    "settings" : {
        "number_of_shards" : 2,
        "number_of_replicas" : 1
    },
    "mappings" : {
        "properties" : {
            "tags" : { "type" : "keyword" },
            "updated_at" : { "type" : "date" }
        }
    }
}

Scale it up.

Elasticsearch really works well for massive datastores. As we know now elasticsearch uses inverted indexes, we store inverted index broken down in multiple places that what we call shards, For availibilty these shards can be replicated according to requirement. So if demand arises, then it can scale horizontally really well.

Furthermore, these shards are stored in nodes, A node is a running instance of elasticsearch. Then we have clusters, an Elasticsearch cluster is a group of nodes that have the same cluster.name attribute. As nodes join or leave a cluster, the cluster automatically reorganizes itself to evenly distribute the data across the available nodes. If you are running a single instance of Elasticsearch, you have a cluster of one node.

cluster

Regular Database vs ElasticSearch

ElasticSearch is a great way of quick data retrieval but there is a trade-off, it requires additional disk space to store it’s own file system of inverted indexes. So, if you observe that you are experiencing delays because of there is tons of data which usually takes longer to search from, then probably you can opt for elastic search. Or you want to use features of ElasticSearch (eg: Stemming, Boosting, etc.) that might not be provided in your Regular Database.

Benchmarking along with Postgres

We’ve conducted an small head to head benchmarking test between ElasticSearch and Postgres in Ruby. Here, we ran it on Postgres using ActiveRecord ORM and used searchkick gem for ElasticSearch. Next we iteratively benchmarked queries in both cases and created 200k fake data in each iteration, query contains string match and date-time comparisions.

Queries:


# Postgres
Contact.where("(gaming_platform LIKE ? OR gaming_platform LIKE ?) AND education = ? AND last_used >= ? AND app_version <= ?", "%Wii%", "%Xbox%", "Bachelor", DateTime.parse("20220202T060000"), "0.8").limit(10000).order([:name_first, :name_last])
# ElasticSearch (searchkick gem)
Contact.search("Wii Xbox", operator: "or", fields: [:gaming_platform], where: { education: 'Bachelor', last_used: {gte: DateTime.parse("20220202T060000")}, app_version: {lte: "0.8"}}, order: { name_first: :asc, name_last: :asc})

Results:

                          user     system      total        real
SQL Queries On Postgres  0.060805   0.018111   0.078916 (  0.085860)
Elasticsearch queries  0.005498   0.000000   0.005498 (  0.010541)
↳ For 200000 records, found 94 records

                          user     system      total        real
SQL Queries On Postgres  0.082427   0.000000   0.082427 (  0.085349)
Elasticsearch queries  0.005763   0.000000   0.005763 (  0.008134)
↳ For 400000 records, found 183 records

                          user     system      total        real
SQL Queries On Postgres  0.088674   0.007989   0.096663 (  0.101067)
Elasticsearch queries  0.004568   0.000761   0.005329 (  0.005955)
↳ For 600000 records, found 274 records

                           user     system      total        real
SQL Queries On Postgres  0.086693   0.015756   0.102449 (  0.144879)
Elasticsearch queries  0.004756   0.001463   0.006219 (  0.008626)
↳ For 800000 records, found 358 records

                           user     system      total        real
SQL Queries On Postgres  0.105945   0.013521   0.119466 (  0.204582)
Elasticsearch queries  0.005394   0.001918   0.007312 (  0.105512)
↳ For 1000000 records, found 443 records

                           user     system      total        real
SQL Queries On Postgres  0.139786   0.038830   0.178616 (  0.325225)
Elasticsearch queries  0.004267   0.003694   0.007961 (  0.138759)
↳ For 1200000 records, found 529 records

when plotted:

out

You can observe here that ElasticSearch is taking almost same time even with more and more data.

Conclusion

Elasticsearch is both a simple and complex product. We’ve so far learned the basics of what it is, how to look inside of it, and how to work with it using some of the REST APIs. Hopefully this tutorial has given you a better understanding of what Elasticsearch is and more importantly, inspired you to further experiment with the rest of its great features!

Explore Other Resources

February 14, 2024 in Case Studies, Product Engineering

CLAS – A system that integrates Hubspot, Stripe, Canvas

About This Project Ziplines is a series A-funded ed-tech startup with one goal—helping students attain the real-world skills they need to thrive in careers they love by partnering with universities.…
Read More
September 9, 2024 in blog

Gentle Introduction to Elasticsearch

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed…
Read More
May 17, 2024 in blog

How to fix “OAuth out-of-band (OOB) flow will be deprecated” error for Google apps API access.

Migrate your OAuth out-of-band flow to an alternative method. Google has announced that they will block the usage of OOB based OAuth starting from January 31, 2023. This has forced…
Read More