pandagg

Principles

This library focuses on two principles:

  • stick to the tree structure of Elasticsearch objects
  • provide simple and flexible interfaces to make it easy and intuitive to use in an interactive usage

Elasticsearch tree structures

Many Elasticsearch objects have a tree structure, ie they are built from a hierarchy of nodes:

  • a mappings (tree) is a hierarchy of fields (nodes)
  • a query (tree) is a hierarchy of query clauses (nodes)
  • an aggregation (tree) is a hierarchy of aggregation clauses (nodes)
  • an aggregation response (tree) is a hierarchy of response buckets (nodes)

This library sticks to that structure by providing a flexible syntax distinguishing trees and nodes, trees all inherit from lighttree.Tree class, whereas nodes all inherit from lighttree.Node class.

Interactive usage

pandagg is designed for both for “regular” code repository usage, and “interactive” usage (ipython or jupyter notebook usage with autocompletion features inspired by pandas design).

Some classes are not intended to be used elsewhere than in interactive mode (ipython), since their purpose is to serve auto-completion features and convenient representations.

Namely:

  • IMapping: used to interactively navigate in mapping and run quick aggregations on some fields
  • IResponse: used to interactively navigate in an aggregation response

These use case will be detailed in following sections.

User Guide

pandagg library provides interfaces to perform read operations on cluster.

Query

The Query class provides :

  • multiple syntaxes to declare and udpate a query
  • query validation (with nested clauses validation)
  • ability to insert clauses at specific points
  • tree-like visual representation

Declaration

From native “dict” query

Given the following query:

>>> expected_query = {'bool': {'must': [
>>>    {'terms': {'genres': ['Action', 'Thriller']}},
>>>    {'range': {'rank': {'gte': 7}}},
>>>    {'nested': {
>>>        'path': 'roles',
>>>        'query': {'bool': {'must': [
>>>            {'term': {'roles.gender': {'value': 'F'}}},
>>>            {'term': {'roles.role': {'value': 'Reporter'}}}]}
>>>         }
>>>    }}
>>> ]}}

To instantiate Query, simply pass “dict” query as argument:

>>> from pandagg.query import Query
>>> q = Query(expected_query)

A visual representation of the query is available with show():

>>> q.show()
<Query>
bool
└── must
    ├── nested, path="roles"
    │   └── query
    │       └── bool
    │           └── must
    │               ├── term, field=roles.gender, value="F"
    │               └── term, field=roles.role, value="Reporter"
    ├── range, field=rank, gte=7
    └── terms, genres=["Action", "Thriller"]

Call to_dict() to convert it to native dict:

>>> q.to_dict()
{'bool': {
    'must': [
        {'range': {'rank': {'gte': 7}}},
        {'terms': {'genres': ['Action', 'Thriller']}},
        {'bool': {'must': [
            {'term': {'roles.role': {'value': 'Reporter'}}},
            {'term': {'roles.gender': {'value': 'F'}}}]}}}}
        ]}
    ]
}}
>>> from pandagg.utils import equal_queries
>>> equal_queries(q.to_dict(), expected_query)
True

Note

equal_queries function won’t consider order of clauses in must/should parameters since it actually doesn’t matter in Elasticsearch execution, ie

>>> equal_queries({'must': [A, B]}, {'must': [B, A]})
True
With DSL classes

Pandagg provides a DSL to declare this query in a quite similar fashion:

>>> from pandagg.query import Nested, Bool, Range, Term, Terms
>>> q = Bool(must=[
>>>     Terms(genres=['Action', 'Thriller']),
>>>     Range(rank={"gte": 7}),
>>>     Nested(
>>>         path='roles',
>>>         query=Bool(must=[
>>>             Term(roles__gender='F'),
>>>             Term(roles__role='Reporter')
>>>         ])
>>>     )
>>> ])

All these classes inherit from Query and thus provide the same interface.

>>> from pandagg.query import Query
>>> isinstance(q, Query)
True
With flattened syntax

In the flattened syntax, the query clause type is used as first argument:

>>> from pandagg.query import Query
>>> q = Query('terms', genres=['Action', 'Thriller'])

Query enrichment

All methods described below return a new Query instance, and keep unchanged the initial query.

For instance:

>>> from pandagg.query import Query
>>> initial_q = Query()
>>> enriched_q = initial_q.query('terms', genres=['Comedy', 'Short'])
>>> initial_q.to_dict()
None
>>> enriched_q.to_dict()
{'terms': {'genres': ['Comedy', 'Short']}}

Note

Calling to_dict() on an empty Query returns None

>>> from pandagg.query import Query
>>> Query().to_dict()
None
query() method

The base method to enrich a Query is query().

Considering this query:

>>> from pandagg.query import Query
>>> q = Query()

query() accepts following syntaxes:

from dictionnary:

>>> q.query({"terms": {"genres": ['Comedy', 'Short']})

flattened syntax:

>>> q.query("terms", genres=['Comedy', 'Short'])

from Query instance (this includes DSL classes):

>>> from pandagg.query import Terms
>>> q.query(Terms(genres=['Action', 'Thriller']))
Compound clauses specific methods

Query instance also exposes following methods for specific compound queries:

(TODO: detail allowed syntaxes)

Specific to bool queries:

  • bool()
  • filter()
  • must()
  • must_not()
  • should()

Specific to other compound queries:

  • nested()
  • constant_score()
  • dis_max()
  • function_score()
  • has_child()
  • has_parent()
  • parent_id()
  • pinned_query()
  • script_score()
  • boost()
Inserted clause location

On all insertion methods detailed above, by default, the inserted clause is placed at the top level of your query, and generates a bool clause if necessary.

Considering the following query:

>>> from pandagg.query import Query
>>> q = Query('terms', genres=['Action', 'Thriller'])
>>> q.show()
<Query>
terms, genres=["Action", "Thriller"]

A bool query will be created:

>>> q = q.query('range', rank={"gte": 7})
>>> q.show()
<Query>
bool
└── must
    ├── range, field=rank, gte=7
    └── terms, genres=["Action", "Thriller"]

And reused if necessary:

>>> q = q.must_not('range', year={"lte": 1970})
>>> q.show()
<Query>
bool
├── must
│   ├── range, field=rank, gte=7
│   └── terms, genres=["Action", "Thriller"]
└── must_not
    └── range, field=year, lte=1970

Specifying a specific location requires to name queries :

>>> from pandagg.query import Nested
>>> q = q.nested(path='roles', _name='nested_roles', query=Term('roles.gender', value='F'))
>>> q.show()
<Query>
bool
├── must
│   ├── nested, _name=nested_roles, path="roles"
│   │   └── query
│   │       └── term, field=roles.gender, value="F"
│   ├── range, field=rank, gte=7
│   └── terms, genres=["Action", "Thriller"]
└── must_not
    └── range, field=year, lte=1970

Doing so allows to insert clauses above/below given clause using parent/child parameters:

>>> q = q.query('term', roles__role='Reporter', parent='nested_roles')
>>> q.show()
<Query>
bool
├── must
│   ├── nested, _name=nested_roles, path="roles"
│   │   └── query
│   │       └── bool
│   │           └── must
│   │               ├── term, field=roles.role, value="Reporter"
│   │               └── term, field=roles.gender, value="F"
│   ├── range, field=rank, gte=7
│   └── terms, genres=["Action", "Thriller"]
└── must_not
    └── range, field=year, lte=1970

TODO: explain parent_param, child_param, mode merging strategies on same named clause etc..

Aggregation

The Aggs class provides :

  • multiple syntaxes to declare and udpate a aggregation
  • aggregation clause validation
  • ability to insert clauses at specific locations (and not just below last manipulated clause)

Declaration

From native “dict” query

Given the following aggregation:

>>> expected_aggs = {
>>>   "decade": {
>>>     "histogram": {"field": "year", "interval": 10},
>>>     "aggs": {
>>>       "genres": {
>>>         "terms": {"field": "genres", "size": 3},
>>>         "aggs": {
>>>           "max_nb_roles": {
>>>             "max": {"field": "nb_roles"}
>>>           },
>>>           "avg_rank": {
>>>             "avg": {"field": "rank"}
>>>           }
>>>         }
>>>       }
>>>     }
>>>   }
>>> }

To declare Aggs, simply pass “dict” query as argument:

>>> from pandagg.agg import Aggs
>>> a = Aggs(expected_aggs)

A visual representation of the query is available with show():

>>> a.show()
<Aggregations>
decade                                         <histogram, field="year", interval=10>
└── genres                                            <terms, field="genres", size=3>
    ├── max_nb_roles                                          <max, field="nb_roles">
    └── avg_rank                                                  <avg, field="rank">

Call to_dict() to convert it to native dict:

>>> a.to_dict() == expected_aggs
True
With DSL classes

Pandagg provides a DSL to declare this query in a quite similar fashion:

>>> from pandagg.agg import Histogram, Terms, Max, Avg
>>>
>>> a = Histogram("decade", field='year', interval=10, aggs=[
>>>     Terms("genres", field="genres", size=3, aggs=[
>>>         Max("max_nb_roles", field="nb_roles"),
>>>         Avg("avg_rank", field="range")
>>>     ]),
>>> ])

All these classes inherit from Aggs and thus provide the same interface.

>>> from pandagg.agg import Aggs
>>> isinstance(a, Aggs)
True
With flattened syntax

In the flattened syntax, the first argument is the aggregation name, the second argument is the aggregation type, the following keyword arguments define the aggregation body:

>>> from pandagg.query import Aggs
>>> a = Aggs('genres', 'terms', size=3)
>>> a.to_dict()
{'genres': {'terms': {'field': 'genres', 'size': 3}}}

Aggregations enrichment

Aggregations can be enriched using two methods:

  • aggs()
  • groupby()

Both methods return a new Aggs instance, and keep unchanged the initial Aggregation.

For instance:

>>> from pandagg.aggs import Aggs
>>> initial_a = Aggs()
>>> enriched_a = initial_a.agg('genres_agg', 'terms', field='genres')
>>> initial_q.to_dict()
None
>>> enriched_q.to_dict()
{'genres_agg': {'terms': {'field': 'genres'}}}

Note

Calling to_dict() on an empty Aggregation returns None

>>> from pandagg.agg import Aggs
        >>> Aggs().to_dict()
        None

TODO >>> Aggs().to_dict() None

TODO

Response

When executing a search request via execute() method of Search, a Response instance is returned.

>>> from elasticsearch import Elasticsearch
>>> from pandagg.search import Search
>>>
>>> client = ElasticSearch(hosts=['localhost:9200'])
>>> response = Search(using=client, index='movies')\
>>>     .size(2)\
>>>     .filter('term', genres='Documentary')\
>>>     .agg('avg_rank', 'avg', field='rank')\
>>>     .execute()
>>> response
<Response> took 9ms, success: True, total result >=10000, contains 2 hits
>>> response.__class__
pandagg.response.Response

ElasticSearch raw dict response is available under data attribute:

>>> response.data
{
    'took': 9, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
    'hits': {'total': {'value': 10000, 'relation': 'gte'},
    'max_score': 0.0,
    'hits': [{'_index': 'movies', ...}],
    'aggregations': {'avg_rank': {'value': 6.496829211219546}}
}

Hits

Hits are available under hits attribute:

>>> response.hits
<Hits> total: >10000, contains 2 hits
>>> response.hits.total
{'value': 10000, 'relation': 'gte'}
>>> response.hits.hits
[<Hit 642> score=0.00, <Hit 643> score=0.00]

Those hits are instances of Hit.

Directly iterating over Response will return those hits:

>>> list(response)
[<Hit 642> score=0.00, <Hit 643> score=0.00]
>>> hit = next(iter(response))

Each hit contains the raw dict under data attribute:

>>> hit.data
{'_index': 'movies',
 '_type': '_doc',
 '_id': '642',
 '_score': 0.0,
 '_source': {'movie_id': 642,
  'name': '10 Tage in Calcutta',
  'year': 1984,
  'genres': ['Documentary'],
  'roles': None,
  'nb_roles': 0,
  'directors': [{'director_id': 33096,
    'first_name': 'Reinhard',
    'last_name': 'Hauff',
    'full_name': 'Reinhard Hauff',
    'genres': ['Documentary', 'Drama', 'Musical', 'Short']}],
  'nb_directors': 1,
  'rank': None}}
>>> hit._index
'movies'
>>> hit._source
{'movie_id': 642,
 'name': '10 Tage in Calcutta',
 'year': 1984,
 'genres': ['Documentary'],
 'roles': None,
 'nb_roles': 0,
 'directors': [{'director_id': 33096,
   'first_name': 'Reinhard',
   'last_name': 'Hauff',
   'full_name': 'Reinhard Hauff',
   'genres': ['Documentary', 'Drama', 'Musical', 'Short']}],
 'nb_directors': 1,
 'rank': None}

If pandas dependency is installed, hits can be parsed as a dataframe:

>>> hits.to_dataframe()
     _index  _score _type                                                                                                                                                        directors         genres  movie_id                       name  nb_directors  nb_roles  rank roles  year
_id
642  movies     0.0  _doc  [{'director_id': 33096, 'first_name': 'Reinhard', 'last_name': 'Hauff', 'full_name': 'Reinhard Hauff', 'genres': ['Documentary', 'Drama', 'Musical', 'Short']}]  [Documentary]       642        10 Tage in Calcutta             1         0  None  None  1984
643  movies     0.0  _doc                               [{'director_id': 32148, 'first_name': 'Tanja', 'last_name': 'Hamilton', 'full_name': 'Tanja Hamilton', 'genres': ['Documentary']}]  [Documentary]       643  10 Tage, ein ganzes Leben             1         0  None  None  2004

Aggregations

Aggregations are handled differently, the aggregations attribute of a Response returns a Aggregations instance, that provides specific parsing abilities in addition to exposing raw aggregations response under data attribute.

Let’s build a bit more complex aggregation query to showcase its functionalities:

>>> from elasticsearch import Elasticsearch
>>> from pandagg.search import Search
>>>
>>> client = Elasticsearch(hosts=['localhost:9200'])
>>> response = Search(using=client, index='movies')\
>>>     .size(0)\
>>>     .groupby('decade', 'histogram', interval=10, field='year')\
>>>     .groupby('genres', size=3)\
>>>     .agg('avg_rank', 'avg', field='rank')\
>>>     .aggs('avg_nb_roles', 'avg', field='nb_roles')\
>>>     .filter('range', year={"gte": 1990})\
>>>     .execute()

Note

for more details about how to build aggregation query, consult Aggregation section

Using data attribute:

>>> response.aggregations.data
{'decade': {'buckets': [{'key': 1990.0,
'doc_count': 79495,
'genres': {'doc_count_error_upper_bound': 0,
 'sum_other_doc_count': 38060,
 'buckets': [{'key': 'Drama',
   'doc_count': 12232,
   'avg_nb_roles': {'value': 18.518067364290385},
   'avg_rank': {'value': 5.981429367965072}},
  {'key': 'Short',
...
Tree serialization

Using to_normalized():

>>> response.aggregations.to_normalized()
{'level': 'root',
 'key': None,
 'value': None,
 'children': [{'level': 'decade',
   'key': 1990.0,
   'value': 79495,
   'children': [{'level': 'genres',
     'key': 'Drama',
     'value': 12232,
     'children': [{'level': 'avg_rank',
       'key': None,
       'value': 5.981429367965072},
      {'level': 'avg_nb_roles', 'key': None, 'value': 18.518067364290385}]},
    {'level': 'genres',
     'key': 'Short',
     'value': 12197,
     'children': [{'level': 'avg_rank',
       'key': None,
       'value': 6.311325829450123},
    ...

Using to_interactive_tree():

>>> response.aggregations.to_interactive_tree()
<IResponse>
root
├── decade=1990                                        79495
│   ├── genres=Documentary                              8393
│   │   ├── avg_nb_roles                  3.7789824854045038
│   │   └── avg_rank                       6.517093241977517
│   ├── genres=Drama                                   12232
│   │   ├── avg_nb_roles                  18.518067364290385
│   │   └── avg_rank                       5.981429367965072
│   └── genres=Short                                   12197
│       ├── avg_nb_roles                   3.023284414200213
│       └── avg_rank                       6.311325829450123
└── decade=2000                                        57649
    ├── genres=Documentary                              8639
    │   ├── avg_nb_roles                   5.581433036231045
    │   └── avg_rank                       6.980897812811443
    ├── genres=Drama                                   11500
    │   ├── avg_nb_roles                  14.385391304347825
    │   └── avg_rank                       6.269675415719865
    └── genres=Short                                   13451
        ├── avg_nb_roles                   4.053081555274701
        └── avg_rank                        6.83625304327684
Tabular serialization

Doing so requires to identify a level that will draw the line between:

  • grouping levels: those which will be used to identify rows (here decades, and genres), and provide doc_count per row
  • columns levels: those which will be used to populate columns and cells (here avg_nb_roles and avg_rank)

The tabular format will suit especially well aggregations with a T shape.

Using to_dataframe():

>>> response.aggregations.to_dataframe()
                        avg_nb_roles  avg_rank  doc_count
decade genres
1990.0 Drama           18.518067  5.981429      12232
       Short            3.023284  6.311326      12197
       Documentary      3.778982  6.517093       8393
2000.0 Short            4.053082  6.836253      13451
       Drama           14.385391  6.269675      11500
       Documentary      5.581433  6.980898       8639

Using to_tabular():

>>> response.aggregations.to_tabular()
(['decade', 'genres'],
 {(1990.0, 'Drama'): {'doc_count': 12232,
   'avg_rank': 5.981429367965072,
   'avg_nb_roles': 18.518067364290385},
  (1990.0, 'Short'): {'doc_count': 12197,
   'avg_rank': 6.311325829450123,
   'avg_nb_roles': 3.023284414200213},
  (1990.0, 'Documentary'): {'doc_count': 8393,
   'avg_rank': 6.517093241977517,
   'avg_nb_roles': 3.7789824854045038},
  (2000.0, 'Short'): {'doc_count': 13451,
   'avg_rank': 6.83625304327684,
   'avg_nb_roles': 4.053081555274701},
  (2000.0, 'Drama'): {'doc_count': 11500,
   'avg_rank': 6.269675415719865,
   'avg_nb_roles': 14.385391304347825},
  (2000.0, 'Documentary'): {'doc_count': 8639,
   'avg_rank': 6.980897812811443,
   'avg_nb_roles': 5.581433036231045}})

Note

TODO - explain parameters:

  • index_orient
  • grouped_by
  • expand_columns
  • expand_sep
  • normalize
  • with_single_bucket_groups

Interactive features

Features described in this module are primarly designed for interactive usage, for instance in an ipython shell<https://ipython.org/>_, since one of the key features is the intuitive usage provided by auto-completion.

Cluster indices discovery

discover() function list all indices on a cluster matching a provided pattern:

>>> from elasticsearch import Elasticsearch
>>> from pandagg.discovery import discover
>>> client = Elasticsearch(hosts=['xxx'])
>>> indices = discover(client, index='mov*')
>>> indices
<Indices> ['movies', 'movies_fake']

Each of the indices is accessible via autocompletion:

>>> indices.movies
 <Index 'movies'>

An Index exposes: settings, mapping (interactive), aliases and name:

>>> movies = indices.movies
>>> movies.settings
{'index': {'creation_date': '1591824202943',
  'number_of_shards': '1',
  'number_of_replicas': '1',
  'uuid': 'v6Amj9x1Sk-trBShI-188A',
  'version': {'created': '7070199'},
  'provided_name': 'movies'}}
>>> movies.mapping
<Mapping>
_
├── directors                                                [Nested]
│   ├── director_id                                           Keyword
│   ├── first_name                                            Text
│   │   └── raw                                             ~ Keyword
│   ├── full_name                                             Text
│   │   └── raw                                             ~ Keyword
│   ├── genres                                                Keyword
│   └── last_name                                             Text
│       └── raw                                             ~ Keyword
├── genres                                                    Keyword
├── movie_id                                                  Keyword
├── name                                                      Text
│   └── raw                                                 ~ Keyword
├── nb_directors                                              Integer
├── nb_roles                                                  Integer
├── rank                                                      Float
├── roles                                                    [Nested]
│   ├── actor_id                                              Keyword
│   ├── first_name                                            Text
│   │   └── raw                                             ~ Keyword
│   ├── full_name                                             Text
│   │   └── raw                                             ~ Keyword
│   ├── gender                                                Keyword
│   ├── last_name                                             Text
│   │   └── raw                                             ~ Keyword
│   └── role                                                  Keyword
└── year                                                      Integer

Note

Examples will be based on IMDB dataset data.

Search class is intended to perform request (see Search)

>>> from pandagg.search import Search
>>>
>>> client = ElasticSearch(hosts=['localhost:9200'])
>>> search = Search(using=client, index='movies')\
>>>     .size(2)\
>>>     .groupby('decade', 'histogram', interval=10, field='year')\
>>>     .groupby('genres', size=3)\
>>>     .agg('avg_rank', 'avg', field='rank')\
>>>     .aggs('avg_nb_roles', 'avg', field='nb_roles')\
>>>     .filter('range', year={"gte": 1990})
>>> search
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "year": {
              "gte": 1990
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "decade": {
      "histogram": {
        "field": "year",
        "interval": 10
      },
      "aggs": {
        "genres": {
          "terms": {
        ...
        ..truncated..
        ...
      }
    }
  },
  "size": 2
}

It relies on:

  • Query to build queries (see Query),

  • Aggs to build aggregations (see Aggregation)

    >>> search._query.show()
    <Query>
    bool
    └── filter
        └── range, field=year, gte=1990
    
    >>> search._aggs.show()
    <Aggregations>
    decade                                         <histogram, field="year", interval=10>
    └── genres                                            <terms, field="genres", size=3>
        ├── avg_nb_roles                                          <avg, field="nb_roles">
        └── avg_rank                                                  <avg, field="rank">
    

Executing a Search request using execute() will return a Response instance (see Response).

>>> response = search.execute()
>>> response
<Response> took 58ms, success: True, total result >=10000, contains 2 hits
>>> response.hits.hits
[<Hit 640> score=0.00, <Hit 641> score=0.00]
>>> response.aggregations.to_dataframe()
                        avg_nb_roles  avg_rank  doc_count
decade genres
1990.0 Drama           18.518067  5.981429      12232
       Short            3.023284  6.311326      12197
       Documentary      3.778982  6.517093       8393
2000.0 Short            4.053082  6.836253      13451
       Drama           14.385391  6.269675      11500
       Documentary      5.581433  6.980898       8639

On top of that some interactive features are available (see Interactive features).

IMDB dataset

You might know the Internet Movie Database, commonly called IMDB.

Well it’s a simple example to showcase some of Elasticsearch capabilities.

In this case, relational databases (SQL) are a good fit to store with consistence this kind of data. Yet indexing some of this data in a optimized search engine will allow more powerful queries.

Query requirements

In this example, we’ll suppose most usage/queries requirements will be around the concept of movie (rather than usages focused on fetching actors or directors, even though it will still be possible with this data structure).

The index should provide good performances trying to answer these kind question (non-exhaustive):

  • in which movies this actor played?
  • what movies genres were most popular among decades?
  • which actors have played in best-rated movies, or worst-rated movies?
  • which actors movies directors prefer to cast in their movies?
  • which are best ranked movies of last decade in Action or Documentary genres?

Data source

I exported following SQL tables from MariaDB following these instructions.

Relational schema is the following:

_images/imdb_ijs.svgimdb tables

Index mappings

Overview

The base unit (document) will be a movie, having a name, rank (ratings), year of release, a list of actors and a list of directors.

Schematically:

Movie:
 - name
 - year
 - rank
 - [] genres
 - [] directors
 - [] actor roles

Which fields require nesting?

Since genres contain a single keyword field, in no case we need it to be stored as a nested field. On the contrary, actor roles and directors require a nested field if we consider applying multiple simultanous query clauses on their sub-fields (for instance search movie in which actor is a woman AND whose role is nurse). More information on distinction between array and nested fields here.

Text or keyword fields?

Some fields are easy to choose, in no situation gender will require a full text search, thus we’ll store it as a keyword. On the other hand actors and directors names (first and last) will require full-text search, we’ll thus opt for a text field. Yet we might want to aggregate on exact keywords to count number of movies per actor for instance. More inforamtion on distinction between text and keyword fields here

Mappings

<Mappings>
_
├── directors                                                [Nested]
│   ├── director_id                                           Keyword
│   ├── first_name                                            Text
│   │   └── raw                                             ~ Keyword
│   ├── full_name                                             Text
│   │   └── raw                                             ~ Keyword
│   ├── genres                                                Keyword
│   └── last_name                                             Text
│       └── raw                                             ~ Keyword
├── genres                                                    Keyword
├── movie_id                                                  Keyword
├── name                                                      Text
│   └── raw                                                 ~ Keyword
├── nb_directors                                              Integer
├── nb_roles                                                  Integer
├── rank                                                      Float
├── roles                                                    [Nested]
│   ├── actor_id                                              Keyword
│   ├── first_name                                            Text
│   │   └── raw                                             ~ Keyword
│   ├── full_name                                             Text
│   │   └── raw                                             ~ Keyword
│   ├── gender                                                Keyword
│   ├── last_name                                             Text
│   │   └── raw                                             ~ Keyword
│   └── role                                                  Keyword
└── year                                                      Integer

Steps to start playing with your index

You can either directly use the demo index available here with credentials user: pandagg, password: pandagg:

Access it with following client instantiation:

from elasticsearch import Elasticsearch
client = Elasticsearch(
    hosts=['https://beba020ee88d49488d8f30c163472151.eu-west-2.aws.cloud.es.io:9243/'],
    http_auth=('pandagg', 'pandagg')
)

Or follow below steps to install it yourself locally. In this case, you can either generate yourself the files, or download them from here (file md5 b363dee23720052501e24d15361ed605).

Dump tables

Follow instruction on bottom of https://relational.fit.cvut.cz/dataset/IMDb page and dump following tables in a directory:

  • movies.csv
  • movies_genres.csv
  • movies_directors.csv
  • directors.csv
  • directors_genres.csv
  • roles.csv
  • actors.csv

Clone pandagg and setup environment

git clone git@github.com:alkemics/pandagg.git
cd pandagg

virtualenv env
python setup.py develop
pip install pandas simplejson jupyter seaborn

Then copy conf.py.dist file into conf.py and edit variables as suits you, for instance:

# your cluster address
ES_HOST = 'localhost:9200'

# where your table dumps are stored, and where serialized output will be written
DATA_DIR = '/path/to/dumps/'
OUTPUT_FILE_NAME = 'serialized.json'

Serialize movie documents and insert them

# generate serialized movies documents, ready to be inserted in ES
# can take a while
python examples/imdb/serialize.py

# create index with mappings if necessary, bulk insert documents in ES
python examples/imdb/load.py

Explore pandagg notebooks

An example notebook is available to showcase some of pandagg functionalities: here it is.

Code is present in examples/imdb/IMDB exploration.py file.

pandagg package

Subpackages

pandagg.interactive package

Submodules
pandagg.interactive.mappings module
class pandagg.interactive.mappings.IMappings(mappings, client=None, index=None, depth=1, root_path=None, initial_tree=None)[source]

Bases: pandagg.utils.DSLMixin, lighttree.interactive.TreeBasedObj

Interactive wrapper upon mappings tree, allowing field navigation and quick access to single clause aggregations computation.

pandagg.interactive.response module
class pandagg.interactive.response.IResponse(tree, search, depth, root_path=None, initial_tree=None)[source]

Bases: lighttree.interactive.TreeBasedObj

Interactive aggregation response.

get_bucket_filter()[source]

Build filters to select documents belonging to that bucket, independently from initial search query clauses.

search()[source]
Module contents

pandagg.node package

Subpackages
pandagg.node.aggs package
Submodules
pandagg.node.aggs.abstract module
pandagg.node.aggs.abstract.A(name, type_or_agg=None, **body)[source]

Accept multiple syntaxes, return a AggNode instance.

Parameters:
  • type_or_agg
  • body
Returns:

AggNode

class pandagg.node.aggs.abstract.AggClause(meta=None, **body)[source]

Bases: pandagg.node._node.Node

Wrapper around elasticsearch aggregation concept. https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-aggregations.html

Each aggregation can be seen both a Node that can be encapsulated in a parent agg.

Define a method to build aggregation request.

BLACKLISTED_MAPPING_TYPES = None
KEY = None
VALUE_ATTRS = None
WHITELISTED_MAPPING_TYPES = None
classmethod extract_bucket_value(response, value_as_dict=False)[source]
extract_buckets(response_value)[source]
get_filter(key)[source]

Return filter query to list documents having this aggregation key.

Parameters:key – string
Returns:elasticsearch filter query
line_repr(depth, **kwargs)[source]

Control how node is displayed in tree representation. _ ├── one end │ └── two myEnd └── three

to_dict()[source]

ElasticSearch aggregation queries follow this formatting:

{
    "<aggregation_name>" : {
        "<aggregation_type>" : {
            <aggregation_body>
        }
        [,"meta" : {  [<meta_data_body>] } ]?
    }
}

to_dict() returns the following part (without aggregation name):

{
    "<aggregation_type>" : {
        <aggregation_body>
    }
    [,"meta" : {  [<meta_data_body>] } ]?
}
classmethod valid_on_field_type(field_type)[source]
class pandagg.node.aggs.abstract.BucketAggClause(meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.AggClause

Bucket aggregation have special abilities: they can encapsulate other aggregations as children. Each time, the extracted value is a ‘doc_count’.

Provide methods: - to build aggregation request (with children aggregations) - to to extract buckets from raw response - to build query to filter documents belonging to that bucket

Note: the aggs attribute’s only purpose is for children initiation with the following syntax: >>> from pandagg.aggs import Terms, Avg >>> agg = Terms( >>> field=’some_path’, >>> aggs={ >>> ‘avg_agg’: Avg(field=’some_other_path’) >>> } >>> )

VALUE_ATTRS = None
extract_buckets(response_value)[source]
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.node.aggs.abstract.FieldOrScriptMetricAgg(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.MetricAgg

Metric aggregation based on single field.

VALUE_ATTRS = None
class pandagg.node.aggs.abstract.MetricAgg(meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.AggClause

Metric aggregation are aggregations providing a single bucket, with value attributes to be extracted.

VALUE_ATTRS = None
extract_buckets(response_value)[source]
get_filter(key)[source]

Return filter query to list documents having this aggregation key.

Parameters:key – string
Returns:elasticsearch filter query
class pandagg.node.aggs.abstract.MultipleBucketAgg(keyed=None, key_path='key', meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.BucketAggClause

IMPLICIT_KEYED = False
VALUE_ATTRS = None
extract_buckets(response_value)[source]
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.node.aggs.abstract.Pipeline(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.UniqueBucketAgg

VALUE_ATTRS = None
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.node.aggs.abstract.Root(meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.AggClause

Not a real aggregation. Just the initial empty dict (used as lighttree.Tree root).

KEY = '_root'
classmethod extract_bucket_value(response, value_as_dict=False)[source]
extract_buckets(response_value)[source]
line_repr(depth, **kwargs)[source]

Control how node is displayed in tree representation. _ ├── one end │ └── two myEnd └── three

class pandagg.node.aggs.abstract.ScriptPipeline(script, buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = None
VALUE_ATTRS = 'value'
class pandagg.node.aggs.abstract.UniqueBucketAgg(meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.BucketAggClause

Aggregations providing a single bucket.

VALUE_ATTRS = None
extract_buckets(response_value)[source]
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

pandagg.node.aggs.bucket module

Not implemented aggregations include: - children agg - geo-distance - geo-hash grid - ipv4 - sampler - significant terms

class pandagg.node.aggs.bucket.Composite(keyed=None, key_path='key', meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.MultipleBucketAgg

KEY = 'composite'
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.node.aggs.bucket.DateHistogram(field, interval=None, calendar_interval=None, fixed_interval=None, meta=None, keyed=False, key_as_string=True, **body)[source]

Bases: pandagg.node.aggs.abstract.MultipleBucketAgg

KEY = 'date_histogram'
VALUE_ATTRS = ['doc_count']
WHITELISTED_MAPPING_TYPES = ['date']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.node.aggs.bucket.DateRange(field, key_as_string=True, meta=None, **body)[source]

Bases: pandagg.node.aggs.bucket.Range

KEY = 'date_range'
KEY_SEP = '::'
VALUE_ATTRS = ['doc_count']
WHITELISTED_MAPPING_TYPES = ['date']
class pandagg.node.aggs.bucket.Filter(filter=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.UniqueBucketAgg

KEY = 'filter'
VALUE_ATTRS = ['doc_count']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.node.aggs.bucket.Filters(filters, other_bucket=False, other_bucket_key=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.MultipleBucketAgg

DEFAULT_OTHER_KEY = '_other_'
IMPLICIT_KEYED = True
KEY = 'filters'
VALUE_ATTRS = ['doc_count']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.node.aggs.bucket.Global(meta=None)[source]

Bases: pandagg.node.aggs.abstract.UniqueBucketAgg

KEY = 'global'
VALUE_ATTRS = ['doc_count']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.node.aggs.bucket.Histogram(field, interval, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.MultipleBucketAgg

KEY = 'histogram'
VALUE_ATTRS = ['doc_count']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.node.aggs.bucket.MatchAll(meta=None, **body)[source]

Bases: pandagg.node.aggs.bucket.Filter

class pandagg.node.aggs.bucket.Missing(field, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.UniqueBucketAgg

BLACKLISTED_MAPPING_TYPES = []
KEY = 'missing'
VALUE_ATTRS = ['doc_count']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.node.aggs.bucket.Nested(path, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.UniqueBucketAgg

KEY = 'nested'
VALUE_ATTRS = ['doc_count']
WHITELISTED_MAPPING_TYPES = ['nested']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.node.aggs.bucket.Range(field, ranges, keyed=False, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.MultipleBucketAgg

KEY = 'range'
KEY_SEP = '-'
VALUE_ATTRS = ['doc_count']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
from_key
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

to_key
class pandagg.node.aggs.bucket.ReverseNested(path=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.UniqueBucketAgg

KEY = 'reverse_nested'
VALUE_ATTRS = ['doc_count']
WHITELISTED_MAPPING_TYPES = ['nested']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.node.aggs.bucket.Terms(field, missing=None, size=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.MultipleBucketAgg

Terms aggregation.

BLACKLISTED_MAPPING_TYPES = []
KEY = 'terms'
VALUE_ATTRS = ['doc_count', 'doc_count_error_upper_bound', 'sum_other_doc_count']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

pandagg.node.aggs.composite module
class pandagg.node.aggs.composite.Composite(sources, size=None, after_key=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.BucketAggClause

KEY = 'composite'
VALUE_ATTRS = ['doc_count']
extract_buckets(response_value)[source]
get_filter(key)[source]

In composite aggregation, key is a map, source name -> value

pandagg.node.aggs.metric module
class pandagg.node.aggs.metric.Avg(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'avg'
VALUE_ATTRS = ['value']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
class pandagg.node.aggs.metric.Cardinality(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'cardinality'
VALUE_ATTRS = ['value']
class pandagg.node.aggs.metric.ExtendedStats(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'extended_stats'
VALUE_ATTRS = ['count', 'min', 'max', 'avg', 'sum', 'sum_of_squares', 'variance', 'std_deviation', 'std_deviation_bounds']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
class pandagg.node.aggs.metric.GeoBound(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'geo_bounds'
VALUE_ATTRS = ['bounds']
WHITELISTED_MAPPING_TYPES = ['geo_point']
class pandagg.node.aggs.metric.GeoCentroid(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'geo_centroid'
VALUE_ATTRS = ['location']
WHITELISTED_MAPPING_TYPES = ['geo_point']
class pandagg.node.aggs.metric.Max(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'max'
VALUE_ATTRS = ['value']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
class pandagg.node.aggs.metric.Min(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'min'
VALUE_ATTRS = ['value']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
class pandagg.node.aggs.metric.PercentileRanks(field, values, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'percentile_ranks'
VALUE_ATTRS = ['values']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
class pandagg.node.aggs.metric.Percentiles(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

Percents body argument can be passed to specify which percentiles to fetch.

KEY = 'percentiles'
VALUE_ATTRS = ['values']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
class pandagg.node.aggs.metric.Stats(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'stats'
VALUE_ATTRS = ['count', 'min', 'max', 'avg', 'sum']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
class pandagg.node.aggs.metric.Sum(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'sum'
VALUE_ATTRS = ['value']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
class pandagg.node.aggs.metric.TopHits(meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.MetricAgg

KEY = 'top_hits'
VALUE_ATTRS = ['hits']
class pandagg.node.aggs.metric.ValueCount(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

BLACKLISTED_MAPPING_TYPES = []
KEY = 'value_count'
VALUE_ATTRS = ['value']
pandagg.node.aggs.pipeline module

Pipeline aggregations: https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-aggregations-pipeline.html

class pandagg.node.aggs.pipeline.AvgBucket(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'avg_bucket'
VALUE_ATTRS = ['value']
class pandagg.node.aggs.pipeline.BucketScript(script, buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.ScriptPipeline

KEY = 'bucket_script'
VALUE_ATTRS = ['value']
class pandagg.node.aggs.pipeline.BucketSelector(script, buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.ScriptPipeline

KEY = 'bucket_selector'
VALUE_ATTRS = None
class pandagg.node.aggs.pipeline.BucketSort(script, buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.ScriptPipeline

KEY = 'bucket_sort'
VALUE_ATTRS = None
class pandagg.node.aggs.pipeline.CumulativeSum(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'cumulative_sum'
VALUE_ATTRS = ['value']
class pandagg.node.aggs.pipeline.Derivative(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'derivative'
VALUE_ATTRS = ['value']
class pandagg.node.aggs.pipeline.ExtendedStatsBucket(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'extended_stats_bucket'
VALUE_ATTRS = ['count', 'min', 'max', 'avg', 'sum', 'sum_of_squares', 'variance', 'std_deviation', 'std_deviation_bounds']
class pandagg.node.aggs.pipeline.MaxBucket(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'max_bucket'
VALUE_ATTRS = ['value']
class pandagg.node.aggs.pipeline.MinBucket(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'min_bucket'
VALUE_ATTRS = ['value']
class pandagg.node.aggs.pipeline.MovingAvg(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'moving_avg'
VALUE_ATTRS = ['value']
class pandagg.node.aggs.pipeline.PercentilesBucket(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'percentiles_bucket'
VALUE_ATTRS = ['values']
class pandagg.node.aggs.pipeline.SerialDiff(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'serial_diff'
VALUE_ATTRS = ['value']
class pandagg.node.aggs.pipeline.StatsBucket(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'stats_bucket'
VALUE_ATTRS = ['count', 'min', 'max', 'avg', 'sum']
class pandagg.node.aggs.pipeline.SumBucket(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'sum_bucket'
VALUE_ATTRS = ['value']
Module contents
pandagg.node.mappings package
Submodules
pandagg.node.mappings.abstract module
class pandagg.node.mappings.abstract.ComplexField(**body)[source]

Bases: pandagg.node.mappings.abstract.Field

KEY = None
is_valid_value(v)[source]
class pandagg.node.mappings.abstract.Field(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node._node.Node

KEY = None
body
is_valid_value(v)[source]
line_repr(depth, **kwargs)[source]

Control how node is displayed in tree representation. _ ├── one end │ └── two myEnd └── three

class pandagg.node.mappings.abstract.RegularField(**body)[source]

Bases: pandagg.node.mappings.abstract.Field

KEY = None
is_valid_value(v)[source]
pandagg.node.mappings.field_datatypes module

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html

class pandagg.node.mappings.field_datatypes.Alias(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

Defines an alias to an existing field.

KEY = 'alias'
class pandagg.node.mappings.field_datatypes.Binary(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'binary'
class pandagg.node.mappings.field_datatypes.Boolean(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'boolean'
class pandagg.node.mappings.field_datatypes.Byte(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'byte'
class pandagg.node.mappings.field_datatypes.Completion(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

To provide auto-complete suggestions

KEY = 'completion'
class pandagg.node.mappings.field_datatypes.ConstantKeyword(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'constant_keyword'
class pandagg.node.mappings.field_datatypes.Date(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'date'
class pandagg.node.mappings.field_datatypes.DateNanos(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'date_nanos'
class pandagg.node.mappings.field_datatypes.DateRange(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'date_range'
class pandagg.node.mappings.field_datatypes.DenseVector(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

Record dense vectors of float values.

KEY = 'dense_vector'
class pandagg.node.mappings.field_datatypes.Double(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'double'
class pandagg.node.mappings.field_datatypes.DoubleRange(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'double_range'
class pandagg.node.mappings.field_datatypes.Flattened(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

Allows an entire JSON object to be indexed as a single field.

KEY = 'flattened'
class pandagg.node.mappings.field_datatypes.Float(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'float'
class pandagg.node.mappings.field_datatypes.FloatRange(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'float_range'
class pandagg.node.mappings.field_datatypes.GeoPoint(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

For lat/lon points

KEY = 'geo_point'
class pandagg.node.mappings.field_datatypes.GeoShape(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

For complex shapes like polygons

KEY = 'geo_shape'
class pandagg.node.mappings.field_datatypes.HalfFloat(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'half_float'
class pandagg.node.mappings.field_datatypes.Histogram(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

For pre-aggregated numerical values for percentiles aggregations.

KEY = 'histogram'
class pandagg.node.mappings.field_datatypes.IP(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

for IPv4 and IPv6 addresses

KEY = 'ip'
class pandagg.node.mappings.field_datatypes.Integer(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'integer'
class pandagg.node.mappings.field_datatypes.IntegerRange(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'integer_range'
class pandagg.node.mappings.field_datatypes.Join(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

Defines parent/child relation for documents within the same index

KEY = 'join'
class pandagg.node.mappings.field_datatypes.Keyword(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'keyword'
class pandagg.node.mappings.field_datatypes.Long(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'long'
class pandagg.node.mappings.field_datatypes.LongRange(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'long_range'
class pandagg.node.mappings.field_datatypes.MapperAnnotatedText(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

To index text containing special markup (typically used for identifying named entities)

KEY = 'annotated-text'
class pandagg.node.mappings.field_datatypes.MapperMurMur3(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

To compute hashes of values at index-time and store them in the index

KEY = 'murmur3'
class pandagg.node.mappings.field_datatypes.Nested(**body)[source]

Bases: pandagg.node.mappings.abstract.ComplexField

KEY = 'nested'
class pandagg.node.mappings.field_datatypes.Object(**body)[source]

Bases: pandagg.node.mappings.abstract.ComplexField

KEY = 'object'
class pandagg.node.mappings.field_datatypes.Percolator(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

Accepts queries from the query-dsl

KEY = 'percolator'
class pandagg.node.mappings.field_datatypes.RankFeature(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

Record numeric feature to boost hits at query time.

KEY = 'rank_feature'
class pandagg.node.mappings.field_datatypes.RankFeatures(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

Record numeric features to boost hits at query time.

KEY = 'rank_features'
class pandagg.node.mappings.field_datatypes.ScaledFloat(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'scaled_float'
class pandagg.node.mappings.field_datatypes.SearchAsYouType(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

A text-like field optimized for queries to implement as-you-type completion

KEY = 'search_as_you_type'
class pandagg.node.mappings.field_datatypes.Shape(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

For arbitrary cartesian geometries.

KEY = 'shape'
class pandagg.node.mappings.field_datatypes.Short(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'short'
class pandagg.node.mappings.field_datatypes.SparseVector(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

Record sparse vectors of float values.

KEY = 'sparse_vector'
class pandagg.node.mappings.field_datatypes.Text(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'text'
class pandagg.node.mappings.field_datatypes.TokenCount(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

To count the number of tokens in a string

KEY = 'token_count'
class pandagg.node.mappings.field_datatypes.WildCard(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'wildcard'
pandagg.node.mappings.meta_fields module
class pandagg.node.mappings.meta_fields.FieldNames(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

All fields in the document which contain non-null values.

KEY = '_field_names'
class pandagg.node.mappings.meta_fields.Id(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

The document’s ID.

KEY = '_id'
class pandagg.node.mappings.meta_fields.Ignored(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

All fields in the document that have been ignored at index time because of ignore_malformed.

KEY = '_ignored'
class pandagg.node.mappings.meta_fields.Index(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

The index to which the document belongs.

KEY = '_index'
class pandagg.node.mappings.meta_fields.Meta(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

Application specific metadata.

KEY = '_meta'
class pandagg.node.mappings.meta_fields.Routing(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

A custom routing value which routes a document to a particular shard.

KEY = '_routing'
class pandagg.node.mappings.meta_fields.Size(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

The size of the _source field in bytes, provided by the mapper-size plugin.

KEY = '_size'
class pandagg.node.mappings.meta_fields.Source(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

The original JSON representing the body of the document.

KEY = '_source'
class pandagg.node.mappings.meta_fields.Type(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

The document’s mappings type.

KEY = '_type'
Module contents
pandagg.node.query package
Submodules
pandagg.node.query.abstract module
class pandagg.node.query.abstract.AbstractSingleFieldQueryClause(field, _name=None, **body)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

class pandagg.node.query.abstract.FlatFieldQueryClause(field, _name=None, **body)[source]

Bases: pandagg.node.query.abstract.AbstractSingleFieldQueryClause

Query clause applied on one single field. Example:

Exists: {“exists”: {“field”: “user”}} -> field = “user” -> body = {“field”: “user”} >>> from pandagg.query import Exists >>> q = Exists(field=”user”)

DistanceFeature: {“distance_feature”: {“field”: “production_date”, “pivot”: “7d”, “origin”: “now”}} -> field = “production_date” -> body = {“field”: “production_date”, “pivot”: “7d”, “origin”: “now”} >>> from pandagg.query import DistanceFeature >>> q = DistanceFeature(field=”production_date”, pivot=”7d”, origin=”now”)

class pandagg.node.query.abstract.KeyFieldQueryClause(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.AbstractSingleFieldQueryClause

Clause with field used as key in clause body:

Term: {“term”: {“user”: {“value”: “Kimchy”, “boost”: 1}}} -> field = “user” -> body = {“user”: {“value”: “Kimchy”, “boost”: 1}} >>> from pandagg.query import Term >>> q1 = Term(user={“value”: “Kimchy”, “boost”: 1}}) >>> q2 = Term(field=”user”, value=”Kimchy”, boost=1}})

Can accept a “_implicit_param” attribute specifying which is the equivalent key when inner body isn’t a dict but a raw value. For Term: _implicit_param = “value” >>> q = Term(user=”Kimchy”) {“term”: {“user”: {“value”: “Kimchy”}}} -> field = “user” -> body = {“term”: {“user”: {“value”: “Kimchy”}}}

line_repr(depth, **kwargs)[source]

Control how node is displayed in tree representation. _ ├── one end │ └── two myEnd └── three

class pandagg.node.query.abstract.LeafQueryClause(_name=None, **body)[source]

Bases: pandagg.node.query.abstract.QueryClause

class pandagg.node.query.abstract.MultiFieldsQueryClause(fields, _name=None, **body)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

line_repr(depth, **kwargs)[source]

Control how node is displayed in tree representation. _ ├── one end │ └── two myEnd └── three

class pandagg.node.query.abstract.ParentParameterClause[source]

Bases: pandagg.node.query.abstract.QueryClause

line_repr(**kwargs)[source]

Control how node is displayed in tree representation. _ ├── one end │ └── two myEnd └── three

pandagg.node.query.abstract.Q(type_or_query=None, **body)[source]

Accept multiple syntaxes, return a QueryClause node.

Parameters:
  • type_or_query
  • body
Returns:

QueryClause

class pandagg.node.query.abstract.QueryClause(_name=None, accept_children=True, keyed=True, _children=None, **body)[source]

Bases: pandagg.node._node.Node

KEY = None
line_repr(depth, **kwargs)[source]

Control how node is displayed in tree representation. _ ├── one end │ └── two myEnd └── three

name
to_dict()[source]
pandagg.node.query.compound module
class pandagg.node.query.compound.Bool(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

>>> Bool(must=[], should=[], filter=[], must_not=[], boost=1.2)
KEY = 'bool'
class pandagg.node.query.compound.Boosting(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'boosting'
class pandagg.node.query.compound.CompoundClause(_name=None, **body)[source]

Bases: pandagg.node.query.abstract.QueryClause

Compound clauses can encapsulate other query clauses:

class pandagg.node.query.compound.ConstantScore(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'constant_score'
class pandagg.node.query.compound.DisMax(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'dis_max'
class pandagg.node.query.compound.FunctionScore(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'function_score'
pandagg.node.query.full_text module
class pandagg.node.query.full_text.Common(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'common'
class pandagg.node.query.full_text.Intervals(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'intervals'
class pandagg.node.query.full_text.Match(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'match'
class pandagg.node.query.full_text.MatchBoolPrefix(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'match_bool_prefix'
class pandagg.node.query.full_text.MatchPhrase(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'match_phrase'
class pandagg.node.query.full_text.MatchPhrasePrefix(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'match_phrase_prefix'
class pandagg.node.query.full_text.MultiMatch(fields, _name=None, **body)[source]

Bases: pandagg.node.query.abstract.MultiFieldsQueryClause

KEY = 'multi_match'
class pandagg.node.query.full_text.QueryString(_name=None, **body)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

KEY = 'query_string'
class pandagg.node.query.full_text.SimpleQueryString(_name=None, **body)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

KEY = 'simple_string'
pandagg.node.query.geo module
class pandagg.node.query.geo.GeoBoundingBox(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'geo_bounding_box'
class pandagg.node.query.geo.GeoDistance(distance, **body)[source]

Bases: pandagg.node.query.abstract.AbstractSingleFieldQueryClause

KEY = 'geo_distance'
line_repr(depth, **kwargs)[source]

Control how node is displayed in tree representation. _ ├── one end │ └── two myEnd └── three

class pandagg.node.query.geo.GeoPolygone(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'geo_polygon'
class pandagg.node.query.geo.GeoShape(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'geo_shape'
pandagg.node.query.joining module
class pandagg.node.query.joining.HasChild(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'has_child'
class pandagg.node.query.joining.HasParent(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'has_parent'
class pandagg.node.query.joining.Nested(path, **kwargs)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'nested'
class pandagg.node.query.joining.ParentId(_name=None, **body)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

KEY = 'parent_id'
pandagg.node.query.shape module
class pandagg.node.query.shape.Shape(_name=None, **body)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

KEY = 'shape'
pandagg.node.query.span module
pandagg.node.query.specialized module
class pandagg.node.query.specialized.DistanceFeature(field, _name=None, **body)[source]

Bases: pandagg.node.query.abstract.FlatFieldQueryClause

KEY = 'distance_feature'
class pandagg.node.query.specialized.MoreLikeThis(fields, _name=None, **body)[source]

Bases: pandagg.node.query.abstract.MultiFieldsQueryClause

KEY = 'more_like_this'
class pandagg.node.query.specialized.Percolate(field, _name=None, **body)[source]

Bases: pandagg.node.query.abstract.FlatFieldQueryClause

KEY = 'percolate'
class pandagg.node.query.specialized.RankFeature(field, _name=None, **body)[source]

Bases: pandagg.node.query.abstract.FlatFieldQueryClause

KEY = 'rank_feature'
class pandagg.node.query.specialized.Script(_name=None, **body)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

KEY = 'script'
class pandagg.node.query.specialized.Wrapper(_name=None, **body)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

KEY = 'wrapper'
pandagg.node.query.specialized_compound module
class pandagg.node.query.specialized_compound.PinnedQuery(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'pinned'
class pandagg.node.query.specialized_compound.ScriptScore(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'script_score'
pandagg.node.query.term_level module
class pandagg.node.query.term_level.Exists(field, _name=None)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

KEY = 'exists'
line_repr(depth, **kwargs)[source]

Control how node is displayed in tree representation. _ ├── one end │ └── two myEnd └── three

class pandagg.node.query.term_level.Fuzzy(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'fuzzy'
class pandagg.node.query.term_level.Ids(values, _name=None)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

KEY = 'ids'
line_repr(depth, **kwargs)[source]

Control how node is displayed in tree representation. _ ├── one end │ └── two myEnd └── three

to_dict(with_name=True)[source]
class pandagg.node.query.term_level.Prefix(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'prefix'
class pandagg.node.query.term_level.Range(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'range'
class pandagg.node.query.term_level.Regexp(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'regexp'
class pandagg.node.query.term_level.Term(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'term'
class pandagg.node.query.term_level.Terms(**body)[source]

Bases: pandagg.node.query.abstract.AbstractSingleFieldQueryClause

KEY = 'terms'
class pandagg.node.query.term_level.TermsSet(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'terms_set'
class pandagg.node.query.term_level.Type(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'type'
class pandagg.node.query.term_level.Wildcard(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'wildcard'
Module contents
pandagg.node.response package
Submodules
pandagg.node.response.bucket module
class pandagg.node.response.bucket.Bucket(value, key=None, level=None)[source]

Bases: pandagg.node.response.bucket.BucketNode

attr_name

Determine under which attribute name the bucket will be available in response tree. Dots are replaced by _ characters so that they don’t prevent from accessing as attribute.

Resulting attribute unfit for python attribute name syntax is still possible and will be accessible through item access (dict like), see more in ‘utils.Obj’ for more details.

line_repr(**kwargs)[source]

Control how node is displayed in tree representation. _ ├── one end │ └── two myEnd └── three

class pandagg.node.response.bucket.BucketNode[source]

Bases: pandagg.node._node.Node

Module contents
Submodules
pandagg.node.types module
Module contents

pandagg.tree package

Submodules
pandagg.tree.aggs module
class pandagg.tree.aggs.Aggs(aggs=None, mappings=None, nested_autocorrect=None, _groupby_ptr=None)[source]

Bases: pandagg.tree._tree.Tree

Combination of aggregation clauses. This class provides handful methods to build an aggregation (see aggs() and groupby()), and is used as well to parse aggregations response in easy to manipulate formats.

Mappings declaration is optional, but doing so validates aggregation validity and automatically handles missing nested clauses.

Accept following syntaxes:

from a dict: >>> Aggs({“per_user”: {“terms”: {“field”: “user”}}})

from an other Aggs instance: >>> Aggs(Aggs({“per_user”: {“terms”: {“field”: “user”}}}))

dict with AggClause instances as values: >>> from pandagg.aggs import Terms, Avg >>> Aggs({‘per_user’: Terms(field=’user’)})

Parameters:mappingsdict or pandagg.tree.mappings.Mappings Mappings of requested indice(s). If provided, will

check aggregations validity. :param nested_autocorrect: bool In case of missing nested clauses in aggregation, if True, automatically add missing nested clauses, else raise error. Ignored if mappings are not provided. :param _groupby_ptr: str identifier of aggregation clause used as grouping element (used by clone method).

agg(name, type_or_agg=None, insert_below=None, at_root=False, **body)[source]

Insert provided agg clause in copy of initial Aggs.

Accept following syntaxes for type_or_agg argument:

string, with body provided in kwargs >>> Aggs().agg(name=’some_agg’, type_or_agg=’terms’, field=’some_field’)

python dict format: >>> Aggs().agg(name=’some_agg’, type_or_agg={‘terms’: {‘field’: ‘some_field’})

AggClause instance: >>> from pandagg.aggs import Terms >>> Aggs().agg(name=’some_agg’, type_or_agg=Terms(field=’some_field’))

Parameters:
  • name – inserted agg clause name
  • type_or_agg – either agg type (str), or agg clause of dict format, or AggClause instance
  • insert_below – name of aggregation below which provided aggs should be inserted
  • at_root – if True, aggregation is inserted at root
  • body – aggregation clause body when providing string type_of_agg (remaining kwargs)
Returns:

copy of initial Aggs with provided agg inserted

aggs(aggs, insert_below=None, at_root=False)[source]

Insert provided aggs in copy of initial Aggs.

Accept following syntaxes for provided aggs:

python dict format: >>> Aggs().aggs({‘some_agg’: {‘terms’: {‘field’: ‘some_field’}}, ‘other_agg’: {‘avg’: {‘field’: ‘age’}}})

Aggs instance: >>> Aggs().aggs(Aggs({‘some_agg’: {‘terms’: {‘field’: ‘some_field’}}, ‘other_agg’: {‘avg’: {‘field’: ‘age’}}}))

dict with Agg clauses values: >>> from pandagg.aggs import Terms, Avg >>> Aggs().aggs({‘some_agg’: Terms(field=’some_field’), ‘other_agg’: Avg(field=’age’)})

Parameters:
  • aggs – aggregations to insert into existing aggregation
  • insert_below – name of aggregation below which provided aggs should be inserted
  • at_root – if True, aggregation is inserted at root
Returns:

copy of initial Aggs with provided aggs inserted

applied_nested_path_at_node(nid)[source]

Return nested path applied at a clause.

Parameters:nid – clause identifier
Returns:None if no nested is applied, else applied path (str)
apply_reverse_nested(nid=None)[source]
groupby(name, type_or_agg=None, insert_below=None, at_root=None, **body)[source]

Insert provided aggregation clause in copy of initial Aggs.

Given the initial aggregation:

A──> B
└──> C

If insert_below = ‘A’:

A──> new──> B
       └──> C
>>> Aggs().groupby('per_user_id', 'terms', field='user_id')
{"per_user_id":{"terms":{"field":"user_id"}}}
>>> Aggs().groupby('per_user_id', {'terms': {"field": "user_id"}})
{"per_user_id":{"terms":{"field":"user_id"}}}
>>> from pandagg.aggs import Terms
>>> Aggs().groupby('per_user_id', Terms(field="user_id"))
{"per_user_id":{"terms":{"field":"user_id"}}}
Return type:pandagg.aggs.Aggs
grouped_by(agg_name=None, deepest=False)[source]

Define which aggregation will be used as grouping pointer.

Either provide an aggregation name, either specify ‘deepest=True’ to consider deepest linear eligible aggregation node as pointer.

node_class

alias of pandagg.node.aggs.abstract.AggClause

show(*args, line_max_length=80, **kwargs)[source]

Return compact representation of Aggs.

>>> Aggs({
>>>     "genres": {
>>>         "terms": {"field": "genres", "size": 3},
>>>         "aggs": {
>>>             "movie_decade": {
>>>                 "date_histogram": {"field": "year", "fixed_interval": "3650d"}
>>>             }
>>>         },
>>>     }
>>> }).show()
<Aggregations>
genres                                           <terms, field="genres", size=3>
└── movie_decade          <date_histogram, field="year", fixed_interval="3650d">

All *args and **kwargs are propagated to lighttree.Tree.show method. :return: str

to_dict(from_=None, depth=None)[source]

Serialize Aggs as dict.

Parameters:from – identifier of aggregation clause, if provided, limits serialization to this clause and its

children (used for recursion, shouldn’t be useful) :param depth: integer, if provided, limit the serialization to a given depth :return: dict

pandagg.tree.mappings module
class pandagg.tree.mappings.Mappings(properties=None, dynamic=False, **kwargs)[source]

Bases: pandagg.tree._tree.Tree

list_nesteds_at_field(field_path)[source]

List nested paths that apply at a given path.

>>> mappings = Mappings(dynamic=False, properties={
>>>     'id': {'type': 'keyword'},
>>>     'comments': {'type': 'nested', 'properties': {
>>>         'comment_text': {'type': 'text'},
>>>         'date': {'type': 'date'}
>>>     }}
>>> })
>>> mappings.list_nesteds_at_field('id')
[]
>>> mappings.list_nesteds_at_field('comments')
['comments']
>>> mappings.list_nesteds_at_field('comments.comment_text')
['comments']
mapping_type_of_field(field_path)[source]

Return field type of provided field path.

>>> mappings = Mappings(dynamic=False, properties={
>>>     'id': {'type': 'keyword'},
>>>     'comments': {'type': 'nested', 'properties': {
>>>         'comment_text': {'type': 'text'},
>>>         'date': {'type': 'date'}
>>>     }}
>>> })
>>> mappings.mapping_type_of_field('id')
'keyword'
>>> mappings.mapping_type_of_field('comments')
'nested'
>>> mappings.mapping_type_of_field('comments.comment_text')
'text'
nested_at_field(field_path)[source]

Return nested path applied on a given path. Return None is none applies.

>>> mappings = Mappings(dynamic=False, properties={
>>>     'id': {'type': 'keyword'},
>>>     'comments': {'type': 'nested', 'properties': {
>>>         'comment_text': {'type': 'text'},
>>>         'date': {'type': 'date'}
>>>     }}
>>> })
>>> mappings.nested_at_field('id')
None
>>> mappings.nested_at_field('comments')
'comments'
>>> mappings.nested_at_field('comments.comment_text')
'comments'
node_class

alias of pandagg.node.mappings.abstract.Field

to_dict(from_=None, depth=None)[source]

Serialize Mappings as dict.

Parameters:from – identifier of a field, if provided, limits serialization to this field and its

children (used for recursion, shouldn’t be useful) :param depth: integer, if provided, limit the serialization to a given depth :return: dict

validate_agg_clause(agg_clause, exc=True)[source]

Ensure that if aggregation clause relates to a field (field or path) this field exists in mappings, and that required aggregation type is allowed on this kind of field.

Parameters:
  • agg_clause – AggClause you want to validate on these mappings
  • exc – boolean, if set to True raise exception if invalid
Return type:

boolean

validate_document(d)[source]
pandagg.tree.query module
class pandagg.tree.query.Query(q=None, mappings=None, nested_autocorrect=False)[source]

Bases: pandagg.tree._tree.Tree

applied_nested_path_at_node(nid)[source]

Return nested path applied at a clause.

Parameters:nid – clause identifier
Returns:None if no nested is applied, else applied path (str)
bool(must=None, should=None, must_not=None, filter=None, insert_below=None, on=None, mode='add', **body)[source]
>>> Query().bool(must={"term": {"some_field": "yolo"}})
boosting(positive=None, negative=None, insert_below=None, on=None, mode='add', **body)[source]
constant_score(filter=None, boost=None, insert_below=None, on=None, mode='add', **body)[source]
dis_max(queries, insert_below=None, on=None, mode='add', **body)[source]
filter(type_or_query, insert_below=None, on=None, mode='add', bool_body=None, **body)[source]
function_score(query, insert_below=None, on=None, mode='add', **body)[source]
has_child(query, insert_below=None, on=None, mode='add', **body)[source]
has_parent(query, insert_below=None, on=None, mode='add', **body)[source]
must(type_or_query, insert_below=None, on=None, mode='add', bool_body=None, **body)[source]

Create copy of initial Query and insert provided clause under “bool” query “must”.

>>> Query().must('term', some_field=1)
>>> Query().must({'term': {'some_field': 1}})
>>> from pandagg.query import Term
>>> Query().must(Term(some_field=1))
Keyword Arguments:
 
  • insert_below (str) – named query clause under which the inserted clauses should be placed.
  • compound_param (str) – param under which inserted clause will be placed in compound query
  • on (str) – named compound query clause on which the inserted compound clause should be merged.
  • mode (str one of ‘add’, ‘replace’, ‘replace_all’) – merging strategy when inserting clauses on a existing compound clause.
    • ‘add’ (default) : adds new clauses keeping initial ones
    • ‘replace’ : for each parameter (for instance in ‘bool’ case : ‘filter’, ‘must’, ‘must_not’, ‘should’), replace existing clauses under this parameter, by new ones only if declared in inserted compound query
    • ‘replace_all’ : existing compound clause is completely replaced by the new one
must_not(type_or_query, insert_below=None, on=None, mode='add', bool_body=None, **body)[source]
nested(path, query=None, insert_below=None, on=None, mode='add', **body)[source]
node_class

alias of pandagg.node.query.abstract.QueryClause

pinned_query(organic, insert_below=None, on=None, mode='add', **body)[source]
query(type_or_query, insert_below=None, on=None, mode='add', compound_param=None, **body)[source]

Insert provided clause in copy of initial Query.

>>> from pandagg.query import Query
>>> Query().query('term', some_field=23)
{'term': {'some_field': 23}}
>>> from pandagg.query import Term
>>> Query()\
>>> .query({'term': {'some_field': 23})\
>>> .query(Term(other_field=24))\
{'bool': {'must': [{'term': {'some_field': 23}}, {'term': {'other_field': 24}}]}}
Keyword Arguments:
 
  • insert_below (str) – named query clause under which the inserted clauses should be placed.
  • compound_param (str) – param under which inserted clause will be placed in compound query
  • on (str) – named compound query clause on which the inserted compound clause should be merged.
  • mode (str one of ‘add’, ‘replace’, ‘replace_all’) – merging strategy when inserting clauses on a existing compound clause.
    • ‘add’ (default) : adds new clauses keeping initial ones
    • ‘replace’ : for each parameter (for instance in ‘bool’ case : ‘filter’, ‘must’, ‘must_not’, ‘should’), replace existing clauses under this parameter, by new ones only if declared in inserted compound query
    • ‘replace_all’ : existing compound clause is completely replaced by the new one
script_score(query, insert_below=None, on=None, mode='add', **body)[source]
should(type_or_query, insert_below=None, on=None, mode='add', bool_body=None, **body)[source]
show(*args, line_max_length=80, **kwargs)[source]

Return compact representation of Query.

>>> Query()        >>> .must({"exists": {"field": "some_field"}})        >>> .must({"term": {"other_field": {"value": 5}}})        >>> .show()
<Query>
bool
└── must
    ├── exists                                                  field=some_field
    └── term                                          field=other_field, value=5

All *args and **kwargs are propagated to lighttree.Tree.show method. :return: str

to_dict(from_=None)[source]

Serialize Query as dict.

pandagg.tree.response module
class pandagg.tree.response.AggsResponseTree(aggs, raw_response=None)[source]

Bases: pandagg.tree._tree.Tree

Tree shaped representation of an ElasticSearch aggregations response.

bucket_properties(bucket, properties=None, end_level=None, depth=None)[source]

Recursive method returning a given bucket’s properties in the form of an ordered dictionnary. Travel from current bucket through all ancestors until reaching root.

Parameters:
  • bucket – instance of pandagg.buckets.buckets.Bucket
  • properties – OrderedDict accumulator of ‘level’ -> ‘key’
  • end_level – optional parameter to specify until which level properties are fetched
  • depth – optional parameter to specify a limit number of levels which are fetched
Returns:

OrderedDict of structure ‘level’ -> ‘key’

get_bucket_filter(nid)[source]

Build query filtering documents belonging to that bucket. Suppose the following configuration:

Base                        <- filter on base
  |── Nested_A                 no filter on A (nested still must be applied for children)
  |     |── SubNested A1
  |     └── SubNested A2    <- filter on A2
  └── Nested_B              <- filter on B
node_class

alias of pandagg.node.response.bucket.BucketNode

parse(raw_response)[source]

Build response tree from ElasticSearch aggregation response

Parameters:raw_response – ElasticSearch aggregation response
Returns:self
show(**kwargs)[source]

Return tree structure in hierarchy style.

Parameters:
  • nid – Node identifier from which tree traversal will start. If None tree root will be used
  • filter_ – filter function performed on nodes. Nodes excluded from filter function nor their children won’t be displayed
  • reverse – the reverse param for sorting Node objects in the same level
  • display_key – boolean, if True display keyed nodes keys
  • reverse – reverse parameter applied at sorting
  • line_type – display type choice
  • limit – int, truncate tree display to this number of lines
  • kwargs – kwargs params passed to node line_repr method

:param line_max_length :rtype: unicode in python2, str in python3

Module contents

Submodules

pandagg.aggs module

class pandagg.aggs.Aggs(aggs=None, mappings=None, nested_autocorrect=None, _groupby_ptr=None)[source]

Bases: pandagg.tree._tree.Tree

Combination of aggregation clauses. This class provides handful methods to build an aggregation (see aggs() and groupby()), and is used as well to parse aggregations response in easy to manipulate formats.

Mappings declaration is optional, but doing so validates aggregation validity and automatically handles missing nested clauses.

Accept following syntaxes:

from a dict: >>> Aggs({“per_user”: {“terms”: {“field”: “user”}}})

from an other Aggs instance: >>> Aggs(Aggs({“per_user”: {“terms”: {“field”: “user”}}}))

dict with AggClause instances as values: >>> from pandagg.aggs import Terms, Avg >>> Aggs({‘per_user’: Terms(field=’user’)})

Parameters:mappingsdict or pandagg.tree.mappings.Mappings Mappings of requested indice(s). If provided, will

check aggregations validity. :param nested_autocorrect: bool In case of missing nested clauses in aggregation, if True, automatically add missing nested clauses, else raise error. Ignored if mappings are not provided. :param _groupby_ptr: str identifier of aggregation clause used as grouping element (used by clone method).

agg(name, type_or_agg=None, insert_below=None, at_root=False, **body)[source]

Insert provided agg clause in copy of initial Aggs.

Accept following syntaxes for type_or_agg argument:

string, with body provided in kwargs >>> Aggs().agg(name=’some_agg’, type_or_agg=’terms’, field=’some_field’)

python dict format: >>> Aggs().agg(name=’some_agg’, type_or_agg={‘terms’: {‘field’: ‘some_field’})

AggClause instance: >>> from pandagg.aggs import Terms >>> Aggs().agg(name=’some_agg’, type_or_agg=Terms(field=’some_field’))

Parameters:
  • name – inserted agg clause name
  • type_or_agg – either agg type (str), or agg clause of dict format, or AggClause instance
  • insert_below – name of aggregation below which provided aggs should be inserted
  • at_root – if True, aggregation is inserted at root
  • body – aggregation clause body when providing string type_of_agg (remaining kwargs)
Returns:

copy of initial Aggs with provided agg inserted

aggs(aggs, insert_below=None, at_root=False)[source]

Insert provided aggs in copy of initial Aggs.

Accept following syntaxes for provided aggs:

python dict format: >>> Aggs().aggs({‘some_agg’: {‘terms’: {‘field’: ‘some_field’}}, ‘other_agg’: {‘avg’: {‘field’: ‘age’}}})

Aggs instance: >>> Aggs().aggs(Aggs({‘some_agg’: {‘terms’: {‘field’: ‘some_field’}}, ‘other_agg’: {‘avg’: {‘field’: ‘age’}}}))

dict with Agg clauses values: >>> from pandagg.aggs import Terms, Avg >>> Aggs().aggs({‘some_agg’: Terms(field=’some_field’), ‘other_agg’: Avg(field=’age’)})

Parameters:
  • aggs – aggregations to insert into existing aggregation
  • insert_below – name of aggregation below which provided aggs should be inserted
  • at_root – if True, aggregation is inserted at root
Returns:

copy of initial Aggs with provided aggs inserted

applied_nested_path_at_node(nid)[source]

Return nested path applied at a clause.

Parameters:nid – clause identifier
Returns:None if no nested is applied, else applied path (str)
apply_reverse_nested(nid=None)[source]
groupby(name, type_or_agg=None, insert_below=None, at_root=None, **body)[source]

Insert provided aggregation clause in copy of initial Aggs.

Given the initial aggregation:

A──> B
└──> C

If insert_below = ‘A’:

A──> new──> B
       └──> C
>>> Aggs().groupby('per_user_id', 'terms', field='user_id')
{"per_user_id":{"terms":{"field":"user_id"}}}
>>> Aggs().groupby('per_user_id', {'terms': {"field": "user_id"}})
{"per_user_id":{"terms":{"field":"user_id"}}}
>>> from pandagg.aggs import Terms
>>> Aggs().groupby('per_user_id', Terms(field="user_id"))
{"per_user_id":{"terms":{"field":"user_id"}}}
Return type:pandagg.aggs.Aggs
grouped_by(agg_name=None, deepest=False)[source]

Define which aggregation will be used as grouping pointer.

Either provide an aggregation name, either specify ‘deepest=True’ to consider deepest linear eligible aggregation node as pointer.

node_class

alias of pandagg.node.aggs.abstract.AggClause

show(*args, line_max_length=80, **kwargs)[source]

Return compact representation of Aggs.

>>> Aggs({
>>>     "genres": {
>>>         "terms": {"field": "genres", "size": 3},
>>>         "aggs": {
>>>             "movie_decade": {
>>>                 "date_histogram": {"field": "year", "fixed_interval": "3650d"}
>>>             }
>>>         },
>>>     }
>>> }).show()
<Aggregations>
genres                                           <terms, field="genres", size=3>
└── movie_decade          <date_histogram, field="year", fixed_interval="3650d">

All *args and **kwargs are propagated to lighttree.Tree.show method. :return: str

to_dict(from_=None, depth=None)[source]

Serialize Aggs as dict.

Parameters:from – identifier of aggregation clause, if provided, limits serialization to this clause and its

children (used for recursion, shouldn’t be useful) :param depth: integer, if provided, limit the serialization to a given depth :return: dict

class pandagg.aggs.Terms(field, missing=None, size=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.MultipleBucketAgg

Terms aggregation.

BLACKLISTED_MAPPING_TYPES = []
KEY = 'terms'
VALUE_ATTRS = ['doc_count', 'doc_count_error_upper_bound', 'sum_other_doc_count']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.aggs.Filters(filters, other_bucket=False, other_bucket_key=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.MultipleBucketAgg

DEFAULT_OTHER_KEY = '_other_'
IMPLICIT_KEYED = True
KEY = 'filters'
VALUE_ATTRS = ['doc_count']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.aggs.Histogram(field, interval, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.MultipleBucketAgg

KEY = 'histogram'
VALUE_ATTRS = ['doc_count']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.aggs.DateHistogram(field, interval=None, calendar_interval=None, fixed_interval=None, meta=None, keyed=False, key_as_string=True, **body)[source]

Bases: pandagg.node.aggs.abstract.MultipleBucketAgg

KEY = 'date_histogram'
VALUE_ATTRS = ['doc_count']
WHITELISTED_MAPPING_TYPES = ['date']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.aggs.Range(field, ranges, keyed=False, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.MultipleBucketAgg

KEY = 'range'
KEY_SEP = '-'
VALUE_ATTRS = ['doc_count']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
from_key
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

to_key
class pandagg.aggs.Global(meta=None)[source]

Bases: pandagg.node.aggs.abstract.UniqueBucketAgg

KEY = 'global'
VALUE_ATTRS = ['doc_count']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.aggs.Filter(filter=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.UniqueBucketAgg

KEY = 'filter'
VALUE_ATTRS = ['doc_count']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.aggs.Missing(field, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.UniqueBucketAgg

BLACKLISTED_MAPPING_TYPES = []
KEY = 'missing'
VALUE_ATTRS = ['doc_count']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.aggs.Nested(path, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.UniqueBucketAgg

KEY = 'nested'
VALUE_ATTRS = ['doc_count']
WHITELISTED_MAPPING_TYPES = ['nested']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.aggs.ReverseNested(path=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.UniqueBucketAgg

KEY = 'reverse_nested'
VALUE_ATTRS = ['doc_count']
WHITELISTED_MAPPING_TYPES = ['nested']
get_filter(key)[source]

Provide filter to get documents belonging to document of given key.

class pandagg.aggs.Avg(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'avg'
VALUE_ATTRS = ['value']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
class pandagg.aggs.Max(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'max'
VALUE_ATTRS = ['value']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
class pandagg.aggs.Sum(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'sum'
VALUE_ATTRS = ['value']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
class pandagg.aggs.Min(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'min'
VALUE_ATTRS = ['value']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
class pandagg.aggs.Cardinality(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'cardinality'
VALUE_ATTRS = ['value']
class pandagg.aggs.Stats(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'stats'
VALUE_ATTRS = ['count', 'min', 'max', 'avg', 'sum']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
class pandagg.aggs.ExtendedStats(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'extended_stats'
VALUE_ATTRS = ['count', 'min', 'max', 'avg', 'sum', 'sum_of_squares', 'variance', 'std_deviation', 'std_deviation_bounds']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
class pandagg.aggs.Percentiles(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

Percents body argument can be passed to specify which percentiles to fetch.

KEY = 'percentiles'
VALUE_ATTRS = ['values']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
class pandagg.aggs.PercentileRanks(field, values, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'percentile_ranks'
VALUE_ATTRS = ['values']
WHITELISTED_MAPPING_TYPES = ['long', 'integer', 'short', 'byte', 'double', 'float', 'half_float', 'scaled_float', 'ip', 'token_count', 'date', 'boolean']
class pandagg.aggs.GeoBound(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'geo_bounds'
VALUE_ATTRS = ['bounds']
WHITELISTED_MAPPING_TYPES = ['geo_point']
class pandagg.aggs.GeoCentroid(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

KEY = 'geo_centroid'
VALUE_ATTRS = ['location']
WHITELISTED_MAPPING_TYPES = ['geo_point']
class pandagg.aggs.TopHits(meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.MetricAgg

KEY = 'top_hits'
VALUE_ATTRS = ['hits']
class pandagg.aggs.ValueCount(field=None, script=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.FieldOrScriptMetricAgg

BLACKLISTED_MAPPING_TYPES = []
KEY = 'value_count'
VALUE_ATTRS = ['value']
class pandagg.aggs.AvgBucket(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'avg_bucket'
VALUE_ATTRS = ['value']
class pandagg.aggs.Derivative(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'derivative'
VALUE_ATTRS = ['value']
class pandagg.aggs.MaxBucket(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'max_bucket'
VALUE_ATTRS = ['value']
class pandagg.aggs.MinBucket(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'min_bucket'
VALUE_ATTRS = ['value']
class pandagg.aggs.SumBucket(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'sum_bucket'
VALUE_ATTRS = ['value']
class pandagg.aggs.StatsBucket(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'stats_bucket'
VALUE_ATTRS = ['count', 'min', 'max', 'avg', 'sum']
class pandagg.aggs.ExtendedStatsBucket(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'extended_stats_bucket'
VALUE_ATTRS = ['count', 'min', 'max', 'avg', 'sum', 'sum_of_squares', 'variance', 'std_deviation', 'std_deviation_bounds']
class pandagg.aggs.PercentilesBucket(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'percentiles_bucket'
VALUE_ATTRS = ['values']
class pandagg.aggs.MovingAvg(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'moving_avg'
VALUE_ATTRS = ['value']
class pandagg.aggs.CumulativeSum(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'cumulative_sum'
VALUE_ATTRS = ['value']
class pandagg.aggs.BucketScript(script, buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.ScriptPipeline

KEY = 'bucket_script'
VALUE_ATTRS = ['value']
class pandagg.aggs.BucketSelector(script, buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.ScriptPipeline

KEY = 'bucket_selector'
VALUE_ATTRS = None
class pandagg.aggs.BucketSort(script, buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.ScriptPipeline

KEY = 'bucket_sort'
VALUE_ATTRS = None
class pandagg.aggs.SerialDiff(buckets_path, gap_policy=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.Pipeline

KEY = 'serial_diff'
VALUE_ATTRS = ['value']
class pandagg.aggs.MatchAll(meta=None, **body)[source]

Bases: pandagg.node.aggs.bucket.Filter

class pandagg.aggs.Composite(sources, size=None, after_key=None, meta=None, **body)[source]

Bases: pandagg.node.aggs.abstract.BucketAggClause

KEY = 'composite'
VALUE_ATTRS = ['doc_count']
extract_buckets(response_value)[source]
get_filter(key)[source]

In composite aggregation, key is a map, source name -> value

pandagg.connections module

class pandagg.connections.Connections[source]

Bases: object

Class responsible for holding connections to different clusters. Used as a singleton in this module.

add_connection(alias, conn)[source]

Add a connection object, it will be passed through as-is.

configure(**kwargs)[source]

Configure multiple connections at once, useful for passing in config dictionaries obtained from other sources, like Django’s settings or a configuration management tool.

Example:

connections.configure(
    default={'hosts': 'localhost'},
    dev={'hosts': ['esdev1.example.com:9200'], 'sniff_on_start': True},
)

Connections will only be constructed lazily when requested through get_connection.

create_connection(alias='default', **kwargs)[source]

Construct an instance of elasticsearch.Elasticsearch and register it under given alias.

get_connection(alias='default')[source]

Retrieve a connection, construct it if necessary (only configuration was passed to us). If a non-string alias has been passed through we assume it’s already a client instance and will just return it as-is.

Raises KeyError if no client (or its definition) is registered under the alias.

remove_connection(alias)[source]

Remove connection from the registry. Raises KeyError if connection wasn’t found.

pandagg.discovery module

class pandagg.discovery.Index(name, settings, mappings, aliases, client=None)[source]

Bases: object

search(nested_autocorrect=True, repr_auto_execute=True)[source]
class pandagg.discovery.Indices(**kwargs)[source]

Bases: lighttree.interactive.Obj

pandagg.discovery.discover(using, index='*')[source]
Parameters:
  • using – Elasticsearch client
  • index – Comma-separated list or wildcard expression of index names used to limit the request.

pandagg.exceptions module

exception pandagg.exceptions.AbsentMappingFieldError[source]

Bases: pandagg.exceptions.MappingError

Field is not present in mappings.

exception pandagg.exceptions.InvalidAggregation[source]

Bases: Exception

Wrong aggregation definition

exception pandagg.exceptions.InvalidOperationMappingFieldError[source]

Bases: pandagg.exceptions.MappingError

Invalid aggregation type on this mappings field.

exception pandagg.exceptions.MappingError[source]

Bases: Exception

Basic Mappings Error

exception pandagg.exceptions.VersionIncompatibilityError[source]

Bases: Exception

Pandagg is not compatible with this ElasticSearch version.

pandagg.mappings module

class pandagg.mappings.Mappings(properties=None, dynamic=False, **kwargs)[source]

Bases: pandagg.tree._tree.Tree

list_nesteds_at_field(field_path)[source]

List nested paths that apply at a given path.

>>> mappings = Mappings(dynamic=False, properties={
>>>     'id': {'type': 'keyword'},
>>>     'comments': {'type': 'nested', 'properties': {
>>>         'comment_text': {'type': 'text'},
>>>         'date': {'type': 'date'}
>>>     }}
>>> })
>>> mappings.list_nesteds_at_field('id')
[]
>>> mappings.list_nesteds_at_field('comments')
['comments']
>>> mappings.list_nesteds_at_field('comments.comment_text')
['comments']
mapping_type_of_field(field_path)[source]

Return field type of provided field path.

>>> mappings = Mappings(dynamic=False, properties={
>>>     'id': {'type': 'keyword'},
>>>     'comments': {'type': 'nested', 'properties': {
>>>         'comment_text': {'type': 'text'},
>>>         'date': {'type': 'date'}
>>>     }}
>>> })
>>> mappings.mapping_type_of_field('id')
'keyword'
>>> mappings.mapping_type_of_field('comments')
'nested'
>>> mappings.mapping_type_of_field('comments.comment_text')
'text'
nested_at_field(field_path)[source]

Return nested path applied on a given path. Return None is none applies.

>>> mappings = Mappings(dynamic=False, properties={
>>>     'id': {'type': 'keyword'},
>>>     'comments': {'type': 'nested', 'properties': {
>>>         'comment_text': {'type': 'text'},
>>>         'date': {'type': 'date'}
>>>     }}
>>> })
>>> mappings.nested_at_field('id')
None
>>> mappings.nested_at_field('comments')
'comments'
>>> mappings.nested_at_field('comments.comment_text')
'comments'
node_class

alias of pandagg.node.mappings.abstract.Field

to_dict(from_=None, depth=None)[source]

Serialize Mappings as dict.

Parameters:from – identifier of a field, if provided, limits serialization to this field and its

children (used for recursion, shouldn’t be useful) :param depth: integer, if provided, limit the serialization to a given depth :return: dict

validate_agg_clause(agg_clause, exc=True)[source]

Ensure that if aggregation clause relates to a field (field or path) this field exists in mappings, and that required aggregation type is allowed on this kind of field.

Parameters:
  • agg_clause – AggClause you want to validate on these mappings
  • exc – boolean, if set to True raise exception if invalid
Return type:

boolean

validate_document(d)[source]
class pandagg.mappings.IMappings(mappings, client=None, index=None, depth=1, root_path=None, initial_tree=None)[source]

Bases: pandagg.utils.DSLMixin, lighttree.interactive.TreeBasedObj

Interactive wrapper upon mappings tree, allowing field navigation and quick access to single clause aggregations computation.

class pandagg.mappings.Text(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'text'
class pandagg.mappings.Keyword(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'keyword'
class pandagg.mappings.ConstantKeyword(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'constant_keyword'
class pandagg.mappings.WildCard(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'wildcard'
class pandagg.mappings.Long(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'long'
class pandagg.mappings.Integer(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'integer'
class pandagg.mappings.Short(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'short'
class pandagg.mappings.Byte(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'byte'
class pandagg.mappings.Double(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'double'
class pandagg.mappings.HalfFloat(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'half_float'
class pandagg.mappings.ScaledFloat(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'scaled_float'
class pandagg.mappings.Date(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'date'
class pandagg.mappings.DateNanos(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'date_nanos'
class pandagg.mappings.Boolean(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'boolean'
class pandagg.mappings.Binary(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'binary'
class pandagg.mappings.IntegerRange(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'integer_range'
class pandagg.mappings.Float(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'float'
class pandagg.mappings.FloatRange(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'float_range'
class pandagg.mappings.LongRange(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'long_range'
class pandagg.mappings.DoubleRange(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'double_range'
class pandagg.mappings.DateRange(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

KEY = 'date_range'
class pandagg.mappings.Object(**body)[source]

Bases: pandagg.node.mappings.abstract.ComplexField

KEY = 'object'
class pandagg.mappings.Nested(**body)[source]

Bases: pandagg.node.mappings.abstract.ComplexField

KEY = 'nested'
class pandagg.mappings.GeoPoint(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

For lat/lon points

KEY = 'geo_point'
class pandagg.mappings.GeoShape(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

For complex shapes like polygons

KEY = 'geo_shape'
class pandagg.mappings.IP(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

for IPv4 and IPv6 addresses

KEY = 'ip'
class pandagg.mappings.Completion(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

To provide auto-complete suggestions

KEY = 'completion'
class pandagg.mappings.TokenCount(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

To count the number of tokens in a string

KEY = 'token_count'
class pandagg.mappings.MapperMurMur3(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

To compute hashes of values at index-time and store them in the index

KEY = 'murmur3'
class pandagg.mappings.MapperAnnotatedText(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

To index text containing special markup (typically used for identifying named entities)

KEY = 'annotated-text'
class pandagg.mappings.Percolator(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

Accepts queries from the query-dsl

KEY = 'percolator'
class pandagg.mappings.Join(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

Defines parent/child relation for documents within the same index

KEY = 'join'
class pandagg.mappings.RankFeature(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

Record numeric feature to boost hits at query time.

KEY = 'rank_feature'
class pandagg.mappings.RankFeatures(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

Record numeric features to boost hits at query time.

KEY = 'rank_features'
class pandagg.mappings.DenseVector(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

Record dense vectors of float values.

KEY = 'dense_vector'
class pandagg.mappings.SparseVector(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

Record sparse vectors of float values.

KEY = 'sparse_vector'
class pandagg.mappings.SearchAsYouType(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

A text-like field optimized for queries to implement as-you-type completion

KEY = 'search_as_you_type'
class pandagg.mappings.Alias(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

Defines an alias to an existing field.

KEY = 'alias'
class pandagg.mappings.Flattened(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

Allows an entire JSON object to be indexed as a single field.

KEY = 'flattened'
class pandagg.mappings.Shape(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

For arbitrary cartesian geometries.

KEY = 'shape'
class pandagg.mappings.Histogram(**body)[source]

Bases: pandagg.node.mappings.abstract.RegularField

For pre-aggregated numerical values for percentiles aggregations.

KEY = 'histogram'
class pandagg.mappings.Index(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

The index to which the document belongs.

KEY = '_index'
class pandagg.mappings.Type(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

The document’s mappings type.

KEY = '_type'
class pandagg.mappings.Id(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

The document’s ID.

KEY = '_id'
class pandagg.mappings.FieldNames(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

All fields in the document which contain non-null values.

KEY = '_field_names'
class pandagg.mappings.Source(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

The original JSON representing the body of the document.

KEY = '_source'
class pandagg.mappings.Size(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

The size of the _source field in bytes, provided by the mapper-size plugin.

KEY = '_size'
class pandagg.mappings.Ignored(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

All fields in the document that have been ignored at index time because of ignore_malformed.

KEY = '_ignored'
class pandagg.mappings.Routing(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

A custom routing value which routes a document to a particular shard.

KEY = '_routing'
class pandagg.mappings.Meta(multiple=None, nullable=True, **body)[source]

Bases: pandagg.node.mappings.abstract.Field

Application specific metadata.

KEY = '_meta'

pandagg.query module

class pandagg.query.Query(q=None, mappings=None, nested_autocorrect=False)[source]

Bases: pandagg.tree._tree.Tree

applied_nested_path_at_node(nid)[source]

Return nested path applied at a clause.

Parameters:nid – clause identifier
Returns:None if no nested is applied, else applied path (str)
bool(must=None, should=None, must_not=None, filter=None, insert_below=None, on=None, mode='add', **body)[source]
>>> Query().bool(must={"term": {"some_field": "yolo"}})
boosting(positive=None, negative=None, insert_below=None, on=None, mode='add', **body)[source]
constant_score(filter=None, boost=None, insert_below=None, on=None, mode='add', **body)[source]
dis_max(queries, insert_below=None, on=None, mode='add', **body)[source]
filter(type_or_query, insert_below=None, on=None, mode='add', bool_body=None, **body)[source]
function_score(query, insert_below=None, on=None, mode='add', **body)[source]
has_child(query, insert_below=None, on=None, mode='add', **body)[source]
has_parent(query, insert_below=None, on=None, mode='add', **body)[source]
must(type_or_query, insert_below=None, on=None, mode='add', bool_body=None, **body)[source]

Create copy of initial Query and insert provided clause under “bool” query “must”.

>>> Query().must('term', some_field=1)
>>> Query().must({'term': {'some_field': 1}})
>>> from pandagg.query import Term
>>> Query().must(Term(some_field=1))
Keyword Arguments:
 
  • insert_below (str) – named query clause under which the inserted clauses should be placed.
  • compound_param (str) – param under which inserted clause will be placed in compound query
  • on (str) – named compound query clause on which the inserted compound clause should be merged.
  • mode (str one of ‘add’, ‘replace’, ‘replace_all’) – merging strategy when inserting clauses on a existing compound clause.
    • ‘add’ (default) : adds new clauses keeping initial ones
    • ‘replace’ : for each parameter (for instance in ‘bool’ case : ‘filter’, ‘must’, ‘must_not’, ‘should’), replace existing clauses under this parameter, by new ones only if declared in inserted compound query
    • ‘replace_all’ : existing compound clause is completely replaced by the new one
must_not(type_or_query, insert_below=None, on=None, mode='add', bool_body=None, **body)[source]
nested(path, query=None, insert_below=None, on=None, mode='add', **body)[source]
node_class

alias of pandagg.node.query.abstract.QueryClause

pinned_query(organic, insert_below=None, on=None, mode='add', **body)[source]
query(type_or_query, insert_below=None, on=None, mode='add', compound_param=None, **body)[source]

Insert provided clause in copy of initial Query.

>>> from pandagg.query import Query
>>> Query().query('term', some_field=23)
{'term': {'some_field': 23}}
>>> from pandagg.query import Term
>>> Query()\
>>> .query({'term': {'some_field': 23})\
>>> .query(Term(other_field=24))\
{'bool': {'must': [{'term': {'some_field': 23}}, {'term': {'other_field': 24}}]}}
Keyword Arguments:
 
  • insert_below (str) – named query clause under which the inserted clauses should be placed.
  • compound_param (str) – param under which inserted clause will be placed in compound query
  • on (str) – named compound query clause on which the inserted compound clause should be merged.
  • mode (str one of ‘add’, ‘replace’, ‘replace_all’) – merging strategy when inserting clauses on a existing compound clause.
    • ‘add’ (default) : adds new clauses keeping initial ones
    • ‘replace’ : for each parameter (for instance in ‘bool’ case : ‘filter’, ‘must’, ‘must_not’, ‘should’), replace existing clauses under this parameter, by new ones only if declared in inserted compound query
    • ‘replace_all’ : existing compound clause is completely replaced by the new one
script_score(query, insert_below=None, on=None, mode='add', **body)[source]
should(type_or_query, insert_below=None, on=None, mode='add', bool_body=None, **body)[source]
show(*args, line_max_length=80, **kwargs)[source]

Return compact representation of Query.

>>> Query()        >>> .must({"exists": {"field": "some_field"}})        >>> .must({"term": {"other_field": {"value": 5}}})        >>> .show()
<Query>
bool
└── must
    ├── exists                                                  field=some_field
    └── term                                          field=other_field, value=5

All *args and **kwargs are propagated to lighttree.Tree.show method. :return: str

to_dict(from_=None)[source]

Serialize Query as dict.

class pandagg.query.Exists(field, _name=None)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

KEY = 'exists'
line_repr(depth, **kwargs)[source]

Control how node is displayed in tree representation. _ ├── one end │ └── two myEnd └── three

class pandagg.query.Fuzzy(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'fuzzy'
class pandagg.query.Ids(values, _name=None)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

KEY = 'ids'
line_repr(depth, **kwargs)[source]

Control how node is displayed in tree representation. _ ├── one end │ └── two myEnd └── three

to_dict(with_name=True)[source]
class pandagg.query.Prefix(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'prefix'
class pandagg.query.Range(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'range'
class pandagg.query.Regexp(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'regexp'
class pandagg.query.Term(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'term'
class pandagg.query.Terms(**body)[source]

Bases: pandagg.node.query.abstract.AbstractSingleFieldQueryClause

KEY = 'terms'
class pandagg.query.TermsSet(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'terms_set'
class pandagg.query.Type(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'type'
class pandagg.query.Wildcard(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'wildcard'
class pandagg.query.Intervals(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'intervals'
class pandagg.query.Match(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'match'
class pandagg.query.MatchBoolPrefix(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'match_bool_prefix'
class pandagg.query.MatchPhrase(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'match_phrase'
class pandagg.query.MatchPhrasePrefix(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'match_phrase_prefix'
class pandagg.query.MultiMatch(fields, _name=None, **body)[source]

Bases: pandagg.node.query.abstract.MultiFieldsQueryClause

KEY = 'multi_match'
class pandagg.query.Common(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'common'
class pandagg.query.QueryString(_name=None, **body)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

KEY = 'query_string'
class pandagg.query.SimpleQueryString(_name=None, **body)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

KEY = 'simple_string'
class pandagg.query.Bool(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

>>> Bool(must=[], should=[], filter=[], must_not=[], boost=1.2)
KEY = 'bool'
class pandagg.query.Boosting(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'boosting'
class pandagg.query.ConstantScore(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'constant_score'
class pandagg.query.FunctionScore(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'function_score'
class pandagg.query.DisMax(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'dis_max'
class pandagg.query.Nested(path, **kwargs)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'nested'
class pandagg.query.HasParent(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'has_parent'
class pandagg.query.HasChild(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'has_child'
class pandagg.query.ParentId(_name=None, **body)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

KEY = 'parent_id'
class pandagg.query.Shape(_name=None, **body)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

KEY = 'shape'
class pandagg.query.GeoShape(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'geo_shape'
class pandagg.query.GeoPolygone(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'geo_polygon'
class pandagg.query.GeoDistance(distance, **body)[source]

Bases: pandagg.node.query.abstract.AbstractSingleFieldQueryClause

KEY = 'geo_distance'
line_repr(depth, **kwargs)[source]

Control how node is displayed in tree representation. _ ├── one end │ └── two myEnd └── three

class pandagg.query.GeoBoundingBox(field=None, _name=None, _expand__to_dot=True, **params)[source]

Bases: pandagg.node.query.abstract.KeyFieldQueryClause

KEY = 'geo_bounding_box'
class pandagg.query.DistanceFeature(field, _name=None, **body)[source]

Bases: pandagg.node.query.abstract.FlatFieldQueryClause

KEY = 'distance_feature'
class pandagg.query.MoreLikeThis(fields, _name=None, **body)[source]

Bases: pandagg.node.query.abstract.MultiFieldsQueryClause

KEY = 'more_like_this'
class pandagg.query.Percolate(field, _name=None, **body)[source]

Bases: pandagg.node.query.abstract.FlatFieldQueryClause

KEY = 'percolate'
class pandagg.query.RankFeature(field, _name=None, **body)[source]

Bases: pandagg.node.query.abstract.FlatFieldQueryClause

KEY = 'rank_feature'
class pandagg.query.Script(_name=None, **body)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

KEY = 'script'
class pandagg.query.Wrapper(_name=None, **body)[source]

Bases: pandagg.node.query.abstract.LeafQueryClause

KEY = 'wrapper'
class pandagg.query.ScriptScore(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'script_score'
class pandagg.query.PinnedQuery(_name=None, **body)[source]

Bases: pandagg.node.query.compound.CompoundClause

KEY = 'pinned'

pandagg.response module

class pandagg.response.Aggregations(data, search)[source]

Bases: object

get(key)[source]
keys()[source]
serialize(output='tabular', **kwargs)[source]
Parameters:
  • output – output format, one of “raw”, “tree”, “interactive_tree”, “normalized”, “tabular”, “dataframe”
  • kwargs – tabular serialization kwargs
Returns:

to_dataframe(grouped_by=None, normalize_children=True, with_single_bucket_groups=False)[source]
to_interactive_tree()[source]
to_normalized()[source]
to_tabular(index_orient=True, grouped_by=None, expand_columns=True, expand_sep='|', normalize=True, with_single_bucket_groups=False)[source]

Build tabular view of ES response grouping levels (rows) until ‘grouped_by’ aggregation node included is reached, and using children aggregations of grouping level as values for each of generated groups (columns).

Suppose an aggregation of this shape (A & B bucket aggregations):

A──> B──> C1
     ├──> C2
     └──> C3

With grouped_by=’B’, breakdown ElasticSearch response (tree structure), into a tabular structure of this shape:

                      C1     C2    C3
A           B
wood        blue      10     4     0
            red       7      5     2
steel       blue      1      9     0
            red       23     4     2
Parameters:
  • index_orient – if True, level-key samples are returned as tuples, else in a dictionnary
  • grouped_by – name of the aggregation node used as last grouping level
  • normalize – if True, normalize columns buckets
Returns:

index_names, values

to_tree()[source]
class pandagg.response.Hit(data)[source]

Bases: object

class pandagg.response.Hits(hits)[source]

Bases: object

to_dataframe(expand_source=True, source_only=True)[source]

Return hits as pandas dataframe. Requires pandas dependency. :param expand_source: if True, _source sub-fields are expanded as columns :param source_only: if True, doesn’t include hit metadata (except id which is used as dataframe index)

class pandagg.response.Response(data, search)[source]

Bases: object

success

pandagg.search module

class pandagg.search.MultiSearch(**kwargs)[source]

Bases: pandagg.search.Request

Combine multiple Search objects into a single request.

add(search)[source]

Adds a new Search object to the request:

ms = MultiSearch(index='my-index')
ms = ms.add(Search(doc_type=Category).filter('term', category='python'))
ms = ms.add(Search(doc_type=Blog))
execute()[source]

Execute the multi search request and return a list of search results.

to_dict()[source]
class pandagg.search.Request(using, index=None)[source]

Bases: object

index(*index)[source]

Set the index for the search. If called empty it will remove all information.

Example:

s = Search() s = s.index(‘twitter-2015.01.01’, ‘twitter-2015.01.02’) s = s.index([‘twitter-2015.01.01’, ‘twitter-2015.01.02’])
params(**kwargs)[source]

Specify query params to be used when executing the search. All the keyword arguments will override the current values. See https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.search for all available parameters.

Example:

s = Search()
s = s.params(routing='user-1', preference='local')
using(client)[source]

Associate the search request with an elasticsearch client. A fresh copy will be returned with current instance remaining unchanged.

Parameters:client – an instance of elasticsearch.Elasticsearch to use or an alias to look up in elasticsearch_dsl.connections
class pandagg.search.Search(using=None, index=None, mappings=None, nested_autocorrect=False, repr_auto_execute=False)[source]

Bases: pandagg.utils.DSLMixin, pandagg.search.Request

agg(name, type_or_agg=None, insert_below=None, at_root=False, **body)[source]

Insert provided agg clause in copy of initial Aggs.

Accept following syntaxes for type_or_agg argument:

string, with body provided in kwargs >>> Aggs().agg(name=’some_agg’, type_or_agg=’terms’, field=’some_field’)

python dict format: >>> Aggs().agg(name=’some_agg’, type_or_agg={‘terms’: {‘field’: ‘some_field’})

AggClause instance: >>> from pandagg.aggs import Terms >>> Aggs().agg(name=’some_agg’, type_or_agg=Terms(field=’some_field’))

Parameters:
  • name – inserted agg clause name
  • type_or_agg – either agg type (str), or agg clause of dict format, or AggClause instance
  • insert_below – name of aggregation below which provided aggs should be inserted
  • at_root – if True, aggregation is inserted at root
  • body – aggregation clause body when providing string type_of_agg (remaining kwargs)
Returns:

copy of initial Aggs with provided agg inserted

aggs(aggs, insert_below=None, at_root=False)[source]

Insert provided aggs in copy of initial Aggs.

Accept following syntaxes for provided aggs:

python dict format: >>> Aggs().aggs({‘some_agg’: {‘terms’: {‘field’: ‘some_field’}}, ‘other_agg’: {‘avg’: {‘field’: ‘age’}}})

Aggs instance: >>> Aggs().aggs(Aggs({‘some_agg’: {‘terms’: {‘field’: ‘some_field’}}, ‘other_agg’: {‘avg’: {‘field’: ‘age’}}}))

dict with Agg clauses values: >>> from pandagg.aggs import Terms, Avg >>> Aggs().aggs({‘some_agg’: Terms(field=’some_field’), ‘other_agg’: Avg(field=’age’)})

Parameters:
  • aggs – aggregations to insert into existing aggregation
  • insert_below – name of aggregation below which provided aggs should be inserted
  • at_root – if True, aggregation is inserted at root
Returns:

copy of initial Aggs with provided aggs inserted

bool(must=None, should=None, must_not=None, filter=None, insert_below=None, on=None, mode='add', **body)[source]
>>> Query().bool(must={"term": {"some_field": "yolo"}})
count()[source]

Return the number of hits matching the query and filters. Note that only the actual number is returned.

delete() executes the query by delegating to delete_by_query()[source]
exclude(type_or_query, insert_below=None, on=None, mode='add', **body)[source]

Must not wrapped in filter context.

execute()[source]

Execute the search and return an instance of Response wrapping all the data.

filter(type_or_query, insert_below=None, on=None, mode='add', bool_body=None, **body)[source]
classmethod from_dict(d)[source]

Construct a new Search instance from a raw dict containing the search body. Useful when migrating from raw dictionaries.

Example:

s = Search.from_dict({
    "query": {
        "bool": {
            "must": [...]
        }
    },
    "aggs": {...}
})
s = s.filter('term', published=True)
groupby(name, type_or_agg=None, insert_below=None, at_root=None, **body)[source]

Insert provided aggregation clause in copy of initial Aggs.

Given the initial aggregation:

A──> B
└──> C

If insert_below = ‘A’:

A──> new──> B
       └──> C
>>> Aggs().groupby('per_user_id', 'terms', field='user_id')
{"per_user_id":{"terms":{"field":"user_id"}}}
>>> Aggs().groupby('per_user_id', {'terms': {"field": "user_id"}})
{"per_user_id":{"terms":{"field":"user_id"}}}
>>> from pandagg.aggs import Terms
>>> Aggs().groupby('per_user_id', Terms(field="user_id"))
{"per_user_id":{"terms":{"field":"user_id"}}}
Return type:pandagg.aggs.Aggs
highlight(*fields, **kwargs)[source]

Request highlighting of some fields. All keyword arguments passed in will be used as parameters for all the fields in the fields parameter. Example:

Search().highlight('title', 'body', fragment_size=50)

will produce the equivalent of:

{
    "highlight": {
        "fields": {
            "body": {"fragment_size": 50},
            "title": {"fragment_size": 50}
        }
    }
}

If you want to have different options for different fields you can call highlight twice:

Search().highlight('title', fragment_size=50).highlight('body', fragment_size=100)

which will produce:

{
    "highlight": {
        "fields": {
            "body": {"fragment_size": 100},
            "title": {"fragment_size": 50}
        }
    }
}
highlight_options(**kwargs)[source]

Update the global highlighting options used for this request. For example:

s = Search()
s = s.highlight_options(order='score')
must(type_or_query, insert_below=None, on=None, mode='add', bool_body=None, **body)[source]

Create copy of initial Query and insert provided clause under “bool” query “must”.

>>> Query().must('term', some_field=1)
>>> Query().must({'term': {'some_field': 1}})
>>> from pandagg.query import Term
>>> Query().must(Term(some_field=1))
Keyword Arguments:
 
  • insert_below (str) – named query clause under which the inserted clauses should be placed.
  • compound_param (str) – param under which inserted clause will be placed in compound query
  • on (str) – named compound query clause on which the inserted compound clause should be merged.
  • mode (str one of ‘add’, ‘replace’, ‘replace_all’) – merging strategy when inserting clauses on a existing compound clause.
    • ‘add’ (default) : adds new clauses keeping initial ones
    • ‘replace’ : for each parameter (for instance in ‘bool’ case : ‘filter’, ‘must’, ‘must_not’, ‘should’), replace existing clauses under this parameter, by new ones only if declared in inserted compound query
    • ‘replace_all’ : existing compound clause is completely replaced by the new one
must_not(type_or_query, insert_below=None, on=None, mode='add', bool_body=None, **body)[source]
post_filter(*args, **kwargs)[source]
query(type_or_query, insert_below=None, on=None, mode='add', **body)[source]

Insert provided clause in copy of initial Query.

>>> from pandagg.query import Query
>>> Query().query('term', some_field=23)
{'term': {'some_field': 23}}
>>> from pandagg.query import Term
>>> Query()\
>>> .query({'term': {'some_field': 23})\
>>> .query(Term(other_field=24))\
{'bool': {'must': [{'term': {'some_field': 23}}, {'term': {'other_field': 24}}]}}
Keyword Arguments:
 
  • insert_below (str) – named query clause under which the inserted clauses should be placed.
  • compound_param (str) – param under which inserted clause will be placed in compound query
  • on (str) – named compound query clause on which the inserted compound clause should be merged.
  • mode (str one of ‘add’, ‘replace’, ‘replace_all’) – merging strategy when inserting clauses on a existing compound clause.
    • ‘add’ (default) : adds new clauses keeping initial ones
    • ‘replace’ : for each parameter (for instance in ‘bool’ case : ‘filter’, ‘must’, ‘must_not’, ‘should’), replace existing clauses under this parameter, by new ones only if declared in inserted compound query
    • ‘replace_all’ : existing compound clause is completely replaced by the new one
scan()[source]

Turn the search into a scan search and return a generator that will iterate over all the documents matching the query.

Use params method to specify any additional arguments you with to pass to the underlying scan helper from elasticsearch-py - https://elasticsearch-py.readthedocs.io/en/master/helpers.html#elasticsearch.helpers.scan

script_fields(**kwargs)[source]

Define script fields to be calculated on hits. See https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-script-fields.html for more details.

Example:

s = Search()
s = s.script_fields(times_two="doc['field'].value * 2")
s = s.script_fields(
    times_three={
        'script': {
            'inline': "doc['field'].value * params.n",
            'params': {'n': 3}
        }
    }
)
should(type_or_query, insert_below=None, on=None, mode='add', bool_body=None, **body)[source]
size(size)[source]

Equivalent to:

s = Search().params(size=size)
sort(*keys)[source]

Add sorting information to the search request. If called without arguments it will remove all sort requirements. Otherwise it will replace them. Acceptable arguments are:

'some.field'
'-some.other.field'
{'different.field': {'any': 'dict'}}

so for example:

s = Search().sort(
    'category',
    '-title',
    {"price" : {"order" : "asc", "mode" : "avg"}}
)

will sort by category, title (in descending order) and price in ascending order using the avg mode.

The API returns a copy of the Search object and can thus be chained.

source(fields=None, **kwargs)[source]

Selectively control how the _source field is returned.

Parameters:fields – wildcard string, array of wildcards, or dictionary of includes and excludes

If fields is None, the entire document will be returned for each hit. If fields is a dictionary with keys of ‘includes’ and/or ‘excludes’ the fields will be either included or excluded appropriately.

Calling this multiple times with the same named parameter will override the previous values with the new ones.

Example:

s = Search()
s = s.source(includes=['obj1.*'], excludes=["*.description"])

s = Search()
s = s.source(includes=['obj1.*']).source(excludes=["*.description"])
suggest(name, text, **kwargs)[source]

Add a suggestions request to the search.

Parameters:
  • name – name of the suggestion
  • text – text to suggest on

All keyword arguments will be added to the suggestions body. For example:

s = Search()
s = s.suggest('suggestion-1', 'Elasticsearch', term={'field': 'body'})
to_dict(count=False, **kwargs)[source]

Serialize the search into the dictionary that will be sent over as the request’s body.

Parameters:count – a flag to specify if we are interested in a body for count - no aggregations, no pagination bounds etc.

All additional keyword arguments will be included into the dictionary.

update_from_dict(d)[source]

Apply options from a serialized body to the current instance. Modifies the object in-place. Used mostly by from_dict.

pandagg.utils module

class pandagg.utils.DSLMixin[source]

Bases: object

Base class for all DSL objects - queries, filters, aggregations etc. Wraps a dictionary representing the object’s json.

class pandagg.utils.DslMeta(name, bases, attrs)[source]

Bases: type

Base Metaclass for DslBase subclasses that builds a registry of all classes for given DslBase subclass (== all the query types for the Query subclass of DslBase).

It then uses the information from that registry (as well as name and deserializer attributes from the base class) to construct any subclass based on it’s name.

pandagg.utils.equal_queries(d1, d2)[source]

Compares if two queries are equivalent (do not consider nested list orders).

pandagg.utils.ordered(obj)[source]

Module contents

Contributing to Pandagg

We want to make contributing to this project as easy and transparent as possible.

Our Development Process

We use github to host code, to track issues and feature requests, as well as accept pull requests.

Pull Requests

We actively welcome your pull requests.

  1. Fork the repo and create your branch from master.
  2. If you’ve added code that should be tested, add tests.
  3. If you’ve changed APIs, update the documentation.
  4. Ensure the test suite passes.
  5. Make sure your code lints.

Any contributions you make will be under the MIT Software License

In short, when you submit code changes, your submissions are understood to be under the same MIT License that covers the project. Feel free to contact the maintainers if that’s a concern.

Issues

We use GitHub issues to track public bugs. Please ensure your description is clear and has sufficient instructions to be able to reproduce the issue.

Report bugs using Github’s issues

We use GitHub issues to track public bugs. Report a bug by opening a new issue; it’s that easy!

Write bug reports with detail, background, and sample code

Great Bug Reports tend to have:

  • A quick summary and/or background
  • Steps to reproduce
    • Be specific!
    • Give sample code if you can.
  • What you expected would happen
  • What actually happens
  • Notes (possibly including why you think this might be happening, or stuff you tried that didn’t work)

License

By contributing, you agree that your contributions will be licensed under its MIT License.

References

This document was adapted from the open-source contribution guidelines of briandk’s gist

pandagg is a Python package providing a simple interface to manipulate ElasticSearch queries and aggregations. It brings the following features:

  • flexible aggregation and search queries declaration
  • query validation based on provided mapping
  • parsing of aggregation results in handy format: interactive bucket tree, normalized tree or tabular breakdown
  • mapping interactive navigation

Installing

pandagg can be installed with pip:

$ pip install pandagg

Alternatively, you can grab the latest source code from GitHub:

$ git clone git://github.com/alkemics/pandagg.git
$ python setup.py install

Usage

The User Guide is the place to go to learn how to use the library.

An example based on publicly available IMDB data is documented in repository examples/imdb directory, with a jupyter notebook to showcase some of pandagg functionalities: here it is.

The pandagg package documentation provides API-level documentation.

License

pandagg is made available under the Apache 2.0 License. For more details, see LICENSE.txt.

Contributing

We happily welcome contributions, please see Contributing to Pandagg for details.