IMDB dataset

You might know the Internet Movie Database, commonly called IMDB.

Well it’s a simple example to showcase some of Elasticsearch capabilities.

In this case, relational databases (SQL) are a good fit to store with consistence this kind of data. Yet indexing some of this data in a optimized search engine will allow more powerful queries.

Query requirements

In this example, we’ll suppose most usage/queries requirements will be around the concept of movie (rather than usages focused on fetching actors or directors, even though it will still be possible with this data structure).

The index should provide good performances trying to answer these kind question (non-exhaustive):

  • in which movies this actor played?
  • what movies genres were most popular among decades?
  • which actors have played in best-rated movies, or worst-rated movies?
  • which actors movies directors prefer to cast in their movies?
  • which are best ranked movies of last decade in Action or Documentary genres?

Data source

I exported following SQL tables from MariaDB following these instructions.

Relational schema is the following:

_images/imdb_ijs.svgimdb tables

Index mappings

Overview

The base unit (document) will be a movie, having a name, rank (ratings), year of release, a list of actors and a list of directors.

Schematically:

Movie:
 - name
 - year
 - rank
 - [] genres
 - [] directors
 - [] actor roles

Which fields require nesting?

Since genres contain a single keyword field, in no case we need it to be stored as a nested field. On the contrary, actor roles and directors require a nested field if we consider applying multiple simultanous query clauses on their sub-fields (for instance search movie in which actor is a woman AND whose role is nurse). More information on distinction between array and nested fields here.

Text or keyword fields?

Some fields are easy to choose, in no situation gender will require a full text search, thus we’ll store it as a keyword. On the other hand actors and directors names (first and last) will require full-text search, we’ll thus opt for a text field. Yet we might want to aggregate on exact keywords to count number of movies per actor for instance. More inforamtion on distinction between text and keyword fields here

Mappings

<Mappings>
_
├── directors                                                [Nested]
│   ├── director_id                                           Keyword
│   ├── first_name                                            Text
│   │   └── raw                                             ~ Keyword
│   ├── full_name                                             Text
│   │   └── raw                                             ~ Keyword
│   ├── genres                                                Keyword
│   └── last_name                                             Text
│       └── raw                                             ~ Keyword
├── genres                                                    Keyword
├── movie_id                                                  Keyword
├── name                                                      Text
│   └── raw                                                 ~ Keyword
├── nb_directors                                              Integer
├── nb_roles                                                  Integer
├── rank                                                      Float
├── roles                                                    [Nested]
│   ├── actor_id                                              Keyword
│   ├── first_name                                            Text
│   │   └── raw                                             ~ Keyword
│   ├── full_name                                             Text
│   │   └── raw                                             ~ Keyword
│   ├── gender                                                Keyword
│   ├── last_name                                             Text
│   │   └── raw                                             ~ Keyword
│   └── role                                                  Keyword
└── year                                                      Integer

Steps to start playing with your index

You can either directly use the demo index available here with credentials user: pandagg, password: pandagg:

Access it with following client instantiation:

from elasticsearch import Elasticsearch
client = Elasticsearch(
    hosts=['https://beba020ee88d49488d8f30c163472151.eu-west-2.aws.cloud.es.io:9243/'],
    http_auth=('pandagg', 'pandagg')
)

Or follow below steps to install it yourself locally. In this case, you can either generate yourself the files, or download them from here (file md5 b363dee23720052501e24d15361ed605).

Dump tables

Follow instruction on bottom of https://relational.fit.cvut.cz/dataset/IMDb page and dump following tables in a directory:

  • movies.csv
  • movies_genres.csv
  • movies_directors.csv
  • directors.csv
  • directors_genres.csv
  • roles.csv
  • actors.csv

Clone pandagg and setup environment

git clone git@github.com:alkemics/pandagg.git
cd pandagg

virtualenv env
python setup.py develop
pip install pandas simplejson jupyter seaborn

Then copy conf.py.dist file into conf.py and edit variables as suits you, for instance:

# your cluster address
ES_HOST = 'localhost:9200'

# where your table dumps are stored, and where serialized output will be written
DATA_DIR = '/path/to/dumps/'
OUTPUT_FILE_NAME = 'serialized.json'

Serialize movie documents and insert them

# generate serialized movies documents, ready to be inserted in ES
# can take a while
python examples/imdb/serialize.py

# create index with mappings if necessary, bulk insert documents in ES
python examples/imdb/load.py

Explore pandagg notebooks

An example notebook is available to showcase some of pandagg functionalities: here it is.

Code is present in examples/imdb/IMDB exploration.py file.