User Guide¶
Note
Examples will be based on IMDB dataset data. This is a work in progress. Some sections still need to be furnished.
Query¶
The Query
class provides :
- multiple syntaxes to declare and udpate a query
- query validation (with nested clauses validation)
- ability to insert clauses at specific points
- tree-like visual representation
Instantiation¶
From native “dict” query¶
Given the following query:
>>> expected_query = {'bool': {'must': [
>>> {'terms': {'genres': ['Action', 'Thriller']}},
>>> {'range': {'rank': {'gte': 7}}},
>>> {'nested': {
>>> 'path': 'roles',
>>> 'query': {'bool': {'must': [
>>> {'term': {'roles.gender': {'value': 'F'}}},
>>> {'term': {'roles.role': {'value': 'Reporter'}}}]}
>>> }
>>> }}
>>> ]}}
To instantiate Query
, simply pass “dict” query as argument:
>>> from pandagg.query import Query
>>> q = Query(expected_query)
A visual representation of the query is available with show()
:
>>> q.show()
<Query>
bool
└── must
├── nested, path="roles"
│ └── query
│ └── bool
│ └── must
│ ├── term, field=roles.gender, value="F"
│ └── term, field=roles.role, value="Reporter"
├── range, field=rank, gte=7
└── terms, genres=["Action", "Thriller"]
Call to_dict()
to convert it to native dict:
>>> q.to_dict()
{'bool': {
'must': [
{'range': {'rank': {'gte': 7}}},
{'terms': {'genres': ['Action', 'Thriller']}},
{'bool': {'must': [
{'term': {'roles.role': {'value': 'Reporter'}}},
{'term': {'roles.gender': {'value': 'F'}}}]}}}}
]}
]
}}
>>> from pandagg.utils import equal_queries
>>> equal_queries(q.to_dict(), expected_query)
True
Note
equal_queries function won’t consider order of clauses in must/should parameters since it actually doesn’t matter in Elasticsearch execution, ie
>>> equal_queries({'must': [A, B]}, {'must': [B, A]})
True
With DSL classes¶
Pandagg provides a DSL to declare this query in a quite similar fashion:
>>> from pandagg.query import Nested, Bool, Range, Term, Terms
>>> q = Bool(must=[
>>> Terms(genres=['Action', 'Thriller']),
>>> Range(rank={"gte": 7}),
>>> Nested(
>>> path='roles',
>>> query=Bool(must=[
>>> Term(roles__gender='F'),
>>> Term(roles__role='Reporter')
>>> ])
>>> )
>>> ])
All these classes inherit from Query
and thus provide the same interface.
>>> from pandagg.query import Query
>>> isinstance(q, Query)
True
With single clause as flattened syntax¶
In the flattened syntax, the query clause type is used as first argument:
>>> from pandagg.query import Query
>>> q = Query('terms', genres=['Action', 'Thriller'])
Query enrichment¶
All methods described below return a new Query
instance, and keep unchanged the
initial query.
For instance:
>>> from pandagg.query import Query
>>> initial_q = Query()
>>> enriched_q = initial_q.query('terms', genres=['Comedy', 'Short'])
>>> initial_q.to_dict()
None
>>> enriched_q.to_dict()
{'terms': {'genres': ['Comedy', 'Short']}}
Note
Calling to_dict()
on an empty Query returns None
>>> from pandagg.query import Query
>>> Query().to_dict()
None
query() method¶
The base method to enrich a Query
is query()
.
Considering this query:
>>> from pandagg.query import Query
>>> q = Query()
query()
accepts following syntaxes:
from dictionnary:
>>> q.query({"terms": {"genres": ['Comedy', 'Short']})
flattened syntax:
>>> q.query("terms", genres=['Comedy', 'Short'])
from Query instance (this includes DSL classes):
>>> from pandagg.query import Terms
>>> q.query(Terms(genres=['Action', 'Thriller']))
Compound clauses specific methods¶
Query
instance also exposes following methods for specific compound queries:
(TODO: detail allowed syntaxes)
Specific to bool queries:
Specific to other compound queries:
Inserted clause location¶
On all insertion methods detailed above, by default, the inserted clause is placed at the top level of your query, and generates a bool clause if necessary.
Considering the following query:
>>> from pandagg.query import Query
>>> q = Query('terms', genres=['Action', 'Thriller'])
>>> q.show()
<Query>
terms, genres=["Action", "Thriller"]
A bool query will be created:
>>> q = q.query('range', rank={"gte": 7})
>>> q.show()
<Query>
bool
└── must
├── range, field=rank, gte=7
└── terms, genres=["Action", "Thriller"]
And reused if necessary:
>>> q = q.must_not('range', year={"lte": 1970})
>>> q.show()
<Query>
bool
├── must
│ ├── range, field=rank, gte=7
│ └── terms, genres=["Action", "Thriller"]
└── must_not
└── range, field=year, lte=1970
Specifying a specific location requires to name queries :
>>> from pandagg.query import Nested
>>> q = q.nested(path='roles', _name='nested_roles', query=Term('roles.gender', value='F'))
>>> q.show()
<Query>
bool
├── must
│ ├── nested, _name=nested_roles, path="roles"
│ │ └── query
│ │ └── term, field=roles.gender, value="F"
│ ├── range, field=rank, gte=7
│ └── terms, genres=["Action", "Thriller"]
└── must_not
└── range, field=year, lte=1970
Doing so allows to insert clauses above/below given clause using parent/child parameters:
>>> q = q.query('term', roles__role='Reporter', parent='nested_roles')
>>> q.show()
<Query>
bool
├── must
│ ├── nested, _name=nested_roles, path="roles"
│ │ └── query
│ │ └── bool
│ │ └── must
│ │ ├── term, field=roles.role, value="Reporter"
│ │ └── term, field=roles.gender, value="F"
│ ├── range, field=rank, gte=7
│ └── terms, genres=["Action", "Thriller"]
└── must_not
└── range, field=year, lte=1970
TODO: explain parent_param, child_param, mode merging strategies on same named clause etc..
Aggregation¶
The Aggs
class provides :
- multiple syntaxes to declare and udpate a aggregation
- clause validation (with nested clauses validation)
- ability to insert clauses at specific points
Aggregation declaration¶
Aggregation response¶
TODO
Search¶
TODO
Mapping¶
Interactive mapping¶
In interactive context, the IMapping
class provides navigation features with autocompletion to quickly discover a large
mapping:
>>> from pandagg.mapping import IMapping
>>> from examples.imdb.load import mapping
>>> m = IMapping(imdb_mapping)
>>> m.roles
<IMapping subpart: roles>
roles [Nested]
├── actor_id Integer
├── first_name Text
│ └── raw ~ Keyword
├── gender Keyword
├── last_name Text
│ └── raw ~ Keyword
└── role Keyword
>>> m.roles.first_name
<IMapping subpart: roles.first_name>
first_name Text
└── raw ~ Keyword
To get the complete field definition, just call it:
>>> m.roles.first_name()
<Mapping Field first_name> of type text:
{
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
A IMapping instance can be bound to an Elasticsearch client to get quick access to aggregations computation on mapping fields.
Suppose you have the following client:
>>> from elasticsearch import Elasticsearch
>>> client = Elasticsearch(hosts=['localhost:9200'])
Client can be bound at instantiation:
>>> m = IMapping(imdb_mapping, client=client, index_name='movies')
Doing so will generate a a attribute on mapping fields, this attribute will list all available aggregation for that field type (with autocompletion):
>>> m.roles.gender.a.terms()
[('M', {'key': 'M', 'doc_count': 2296792}),
('F', {'key': 'F', 'doc_count': 1135174})]
Note
Nested clauses will be automatically taken into account.
Cluster indices discovery¶
TODO