User Guide¶
Note
Examples will be based on IMDB dataset data. This is a work in progress. Some sections still need to be furnished.
Query¶
The Query
class allows multiple ways to declare and udpate an Elasticsearch query.
Let’s explore the multiple ways we have to declare the following query:
>>> expected_query = {'bool': {'must': [
>>> {'terms': {'genres': ['Action', 'Thriller']}},
>>> {'range': {'rank': {'gte': 7}}},
>>> {'nested': {
>>> 'path': 'roles',
>>> 'query': {'bool': {'must': [
>>> {'term': {'roles.gender': {'value': 'F'}}},
>>> {'term': {'roles.role': {'value': 'Reporter'}}}]}
>>> }
>>> }}
>>> ]}}
Pandagg DSL¶
Pandagg provides a DSL to declare this query in a quite similar fashion:
>>> from pandagg.query import Nested, Bool, Query, Range, Term, Terms
>>> q = Query(
>>> Bool(must=[
>>> Terms('genres', terms=['Action', 'Thriller']),
>>> Range('rank', gte=7),
>>> Nested(
>>> path='roles',
>>> query=Bool(must=[
>>> Term('roles.gender', value='F'),
>>> Term('roles.role', value='Reporter')
>>> ])
>>> )
>>> ])
>>>)
The serialized query is then available with query_dict method:
>>> q.query_dict() == expected_query
True
A visual representation of the query helps to have a clearer view:
>>> q
<Query>
bool
└── must
├── nested
│ ├── path="roles"
│ └── query
│ └── bool
│ └── must
│ ├── term, field=roles.gender, value="F"
│ └── term, field=roles.role, value="Reporter"
├── range, field=rank, gte=7
└── terms, field=genres, values=['Action', 'Thriller']
Chaining¶
Another way to declare this query is through chaining:
>>> from pandagg.utils import equal_queries
>>> from pandagg.query import Nested, Bool, Query, Range, Term, Terms
>>> q = Query()\
>>> .query({'terms': {'genres': ['Action', 'Thriller']}})\
>>> .nested(path='roles', _name='nested_roles', query=Term('roles.gender', value='F'))\
>>> .query(Range('rank', gte=7))\
>>> .query(Term('roles.role', value='Reporter'), parent='nested_roles')
>>> equal_queries(q.query_dict(), expected_query)
True
Note
equal_queries function won’t consider order of clauses in must/should parameters since it actually doesn’t matter in Elasticsearch execution, ie
>>> equal_queries({'must': [A, B]}, {'must': [B, A]})
True
Regular syntax¶
Eventually, you can also use regular Elasticsearch dict syntax:
>>> q = Query(expected_query)
>>> q
<Query>
bool
└── must
├── nested
│ ├── path="roles"
│ └── query
│ └── bool
│ └── must
│ ├── term, field=roles.gender, value="F"
│ └── term, field=roles.role, value="Reporter"
├── range, field=rank, gte=7
└── terms, field=genres, values=['Action', 'Thriller']
Mapping¶
Here is a portion of IMDB dataset example mapping:
>>> imdb_mapping = {
>>> 'dynamic': False,
>>> 'properties': {
>>> 'movie_id': {'type': 'integer'},
>>> 'name': {
>>> 'type': 'text',
>>> 'fields': {
>>> 'raw': {'type': 'keyword'}
>>> }
>>> },
>>> 'year': {
>>> 'type': 'date',
>>> 'format': 'yyyy'
>>> },
>>> 'rank': {'type': 'float'},
>>> 'genres': {'type': 'keyword'},
>>> 'roles': {
>>> 'type': 'nested',
>>> 'properties': {
>>> 'role': {'type': 'keyword'},
>>> 'actor_id': {'type': 'integer'},
>>> 'gender': {'type': 'keyword'},
>>> 'first_name': {
>>> 'type': 'text',
>>> 'fields': {
>>> 'raw': {'type': 'keyword'}
>>> }
>>> },
>>> 'last_name': {
>>> 'type': 'text',
>>> 'fields': {
>>> 'raw': {'type': 'keyword'}
>>> }
>>> }
>>> }
>>> }
>>> }
>>> }
Mapping DSL¶
The Mapping
class provides a more compact view, which can help when dealing with large mappings:
>>> from pandagg.mapping import Mapping
>>> m = Mapping(imdb_mapping)
<Mapping>
{Object}
├── genres Keyword
├── movie_id Integer
├── name Text
│ └── raw ~ Keyword
├── rank Float
├── roles [Nested]
│ ├── actor_id Integer
│ ├── first_name Text
│ │ └── raw ~ Keyword
│ ├── gender Keyword
│ ├── last_name Text
│ │ └── raw ~ Keyword
│ └── role Keyword
└── year Date
With pandagg DSL, an equivalent declaration would be the following:
>>> from pandagg.mapping import Mapping, Object, Nested, Float, Keyword, Date, Integer, Text
>>>
>>> dsl_mapping = Mapping(properties=[
>>> Integer('movie_id'),
>>> Text('name', fields=[
>>> Keyword('raw')
>>> ]),
>>> Date('year', format='yyyy'),
>>> Float('rank'),
>>> Keyword('genres'),
>>> Nested('roles', properties=[
>>> Keyword('role'),
>>> Integer('actor_id'),
>>> Keyword('gender'),
>>> Text('first_name', fields=[
>>> Keyword('raw')
>>> ]),
>>> Text('last_name', fields=[
>>> Keyword('raw')
>>> ])
>>> ])
>>> ])
Which is exactly equivalent to initial mapping:
>>> dsl_mapping.serialize() == imdb_mapping
True
Interactive mapping¶
In interactive context, the IMapping
class provides navigation features with autocompletion to quickly discover a large
mapping:
>>> from pandagg.mapping import IMapping
>>> m = IMapping(imdb_mapping)
>>> m.roles
<IMapping subpart: roles>
roles [Nested]
├── actor_id Integer
├── first_name Text
│ └── raw ~ Keyword
├── gender Keyword
├── last_name Text
│ └── raw ~ Keyword
└── role Keyword
>>> m.roles.first_name
<IMapping subpart: roles.first_name>
first_name Text
└── raw ~ Keyword
To get the complete field definition, just call it:
>>> m.roles.first_name()
<Mapping Field first_name> of type text:
{
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
A IMapping instance can be bound to an Elasticsearch client to get quick access to aggregations computation on mapping fields.
Suppose you have the following client:
>>> from elasticsearch import Elasticsearch
>>> client = Elasticsearch(hosts=['localhost:9200'])
Client can be bound either at initiation:
>>> m = IMapping(imdb_mapping, client=client, index_name='movies')
or afterwards through bind method:
>>> m = IMapping(imdb_mapping)
>>> m.bind(client=client, index_name='movies')
Doing so will generate a a attribute on mapping fields, this attribute will list all available aggregation for that field type (with autocompletion):
>>> m.roles.gender.a.terms()
[('M', {'key': 'M', 'doc_count': 2296792}),
('F', {'key': 'F', 'doc_count': 1135174})]
Note
Nested clauses will be automatically taken into account.
Cluster indices discovery¶
TODO