Query Elasticsearch with Python

Before, when I needed to query an elasticsearch cluster and retrieve data, I always ended up doing some curl-fu. It's nice and quick however if you want to retrieve a lot of data and are in the need of something more powerful you need something else. Moreover json is aweful to deal with on the command line !

I came across those two python libraries which are quite nice:

Combining those two libraries provides you with really powerful ways to insert, query and manage your ES's data. I won't describe in details the use of those libraries but there are two main features I've used and found very nice:

Filtering

With these libraries, constructing your query is now way easier than using some json:

...
import elasticsearch
from elasticsearch_dsl import Search, Q, F
...
client = elasticsearch.Elasticsearch(ES_HOST)
s = Search(using=client, index=INDEX)

# only return specific fields
s = s.fields(['field0', 'field1'])

# add some filters
s = s.filter("term", _type="SOMETYPE")
s = s.filter("range", count={'gt': "1000"})
s = s.filter(~F("term", othercount="0"))

# sort the results
s = s.sort({'count': {'order': 'desc'}})
...

As shown above, you can think of any type of filters without having to fight with json syntax and stuff. For more info on the different types of ES filter, see ES offical query DSL

Query in batch

The next very interesting feature is the ability to perform scroll searches.

This is very efficient when you need to retrieve a bunch of documents/results on a very large index. And it's quite elegant in python:

...
import elasticsearch
from elasticsearch_dsl import Search, Q, F
...
def es_scroll(client, search, index, func):
    # Initialize the scroll
    gotten = 0
    page = client.search(index=index, scroll=SCROLLDURATION,
      search_type='scan', body=search.to_dict())
    sid = page['_scroll_id']
    total = int(page['hits']['total'])
    batchgot = total

    # scrolling through this view
    while batchgot > 0:
      page = client.scroll(scroll_id=sid, scroll=SCROLLDURATION)
      sid = page['_scroll_id']
      batchgot = len(page['hits']['hits'])
      gotten += batchgot
      # do something with the results
      func(page['hits']['hits'])
    ...
    return gotten
...

Scroll searches provide you with a kind of snapshot of your data and allows you to retrieve it in batch. Called Search context, it could be seen as a view in SQL databases.

The parameter SCROLLDURATION defines how long the view of the data is to be kept (this is updated at each scroll's call). The function func is called after each batch to handle the results.

This is very convenient and much faster than querying ES one document at a time !

Happy searches in bigdata

bigdata searches