User Guide¶
pandagg library provides interfaces to perform read operations on cluster.
Note
Examples will be based on IMDB dataset data.
Search
class is intended to perform request (see Search)
>>> from pandagg.search import Search
>>>
>>> client = ElasticSearch(hosts=['localhost:9200'])
>>> search = Search(using=client, index='movies')\
>>> .size(2)\
>>> .groupby('decade', 'histogram', interval=10, field='year')\
>>> .groupby('genres', size=3)\
>>> .agg('avg_rank', 'avg', field='rank')\
>>> .aggs('avg_nb_roles', 'avg', field='nb_roles')\
>>> .filter('range', year={"gte": 1990})
>>> search
{
"query": {
"bool": {
"filter": [
{
"range": {
"year": {
"gte": 1990
}
}
}
]
}
},
"aggs": {
"decade": {
"histogram": {
"field": "year",
"interval": 10
},
"aggs": {
"genres": {
"terms": {
...
..truncated..
...
}
}
},
"size": 2
}
It relies on:
Aggs
to build aggregations (see Aggregation)>>> search._query.show() <Query> bool └── filter └── range, field=year, gte=1990
>>> search._aggs.show() <Aggregations> decade <histogram, field="year", interval=10> └── genres <terms, field="genres", size=3> ├── avg_nb_roles <avg, field="nb_roles"> └── avg_rank <avg, field="rank">
Executing a Search
request using execute()
will return a
Response
instance (see Response).
>>> response = search.execute()
>>> response
<Response> took 58ms, success: True, total result >=10000, contains 2 hits
>>> response.hits.hits
[<Hit 640> score=0.00, <Hit 641> score=0.00]
>>> response.aggregations.to_dataframe()
avg_nb_roles avg_rank doc_count
decade genres
1990.0 Drama 18.518067 5.981429 12232
Short 3.023284 6.311326 12197
Documentary 3.778982 6.517093 8393
2000.0 Short 4.053082 6.836253 13451
Drama 14.385391 6.269675 11500
Documentary 5.581433 6.980898 8639
On top of that some interactive features are available (see Interactive features).