Database

By default, Prodigy uses SQLite to store annotations in a simple database file in your Prodigy home directory. If you want to use the default database with its default settings, no further configuration is required and you can start using Prodigy straight away. Alternatively, you can choose to use Prodigy with MySQL or PostgreSQL, or write your own custom recipe to plug in any other storage solution.

Prodigy uses the peewee package to manage the database integration. This gives you a lot of flexibility in terms of setup and debugging, and allows you to use more advanced features via the Playhouse extension.

Why do I need a database?

Prodigy uses the database to store all annotations by project and annotation session. Even if you’re only using Prodigy with a model in the loop, you’ll usually want a record of the collected annotations as a backup, or to use them as evaluation data. Prodigy is also very powerful without a model in the loop – for example to bootstrap word lists or collect feedback on the output of two models and generate evaluation data for machine translation or image classification.

Do I need to upload data to the database before I can annotate?

No! The database only exists to store collected annotations. You can read in the raw data straight from a file or a custom source. As your data comes in, Prodigy will assign hashes to the examples, so it’ll always be able to tell whether an example has been annotated already or not.

If you have existing annotations, you can convert them to Prodigy’s format for the given task and use the db-in command to import them to the database.

Setup

SQLite

The default database option that stores all annotations in a flat SQLite database file. Unless otherwise specified, the database is created in the Prodigy home directory.

prodigy.json
{
  "db": "sqlite",
  "db_settings": {
    "sqlite": {
      "name": "prodigy.db",
      "path": "/custom/path"
    }
  }
}

Setting	Description	Default
`name`	Database file name. To only store the database in memory, use `:memory:`.	`prodigy.db`
`path`	Path to directory containing database file.	`PRODIGY_HOME`
…	`sqlite3` connection parameters.	-

MySQL

The settings for a MySQL database can take any MySQLdb or PyMySQL connection parameters, depending on your database driver.

prodigy.json
{
  "db": "mysql",
  "db_settings": {
    "mysql": {
      "host": "localhost",
      "user": "username",
      "passwd": "xxx",
      "db": "prodigy"
    }
  }
}

PostgreSQL

The settings for a PostgreSQL database can take any psycopg2 connection parameters. Instead of providing the settings in your prodigy.json, you can also set environment variables.

prodigy.json
{
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
      "dbname": "prodigy",
      "user": "username",
      "password": "xxx"
    }
  }
}

Permissions and user privileges

When setting up your database integration, you might not want to give your user permission to perform all operations. Prodigy uses the following operations, some of which are optional and only required for certain commands or initial setup:

Operation	Required	Details
`SELECT`		Retrieving datasets and annotations.
`INSERT`		Adding datasets and annotations.
`UPDATE`		Updating datasets and annotations.
`DELETE`		Deleting datasets. Only used for the `prodigy drop` command, so permission can be omitted if you don’t need this feature or prefer to delete records manually.
`CREATE`		Creation of tables `Dataset`, `Example` and `Link`. Not required if you create them manually.

To test your database connection, you can also write a simple Python script that connects to Prodigy’s database and performs the most important operations. For more details, check out this thread on the forum.

test_database.py
from prodigy.components.db import connect

examples = [{"text": "hello world", "_task_hash": 123, "_input_hash": 456}]

db = connect()                                     # uses settings from prodigy.json
db.add_dataset("test_dataset")                     # add dataset
assert "test_dataset" in db                        # check that dataset was added
db.add_examples(examples, ["test_dataset"])        # add examples to dataset
examples = db.get_dataset_examples("test_dataset") # retrieve a dataset's examples
assert len(examples) == 1                          # check that examples were added

Tables

Here are the tables Prodigy creates and how they map to the annotations you collect. You typically shouldn’t have to interact with the database and its tables directly.

Table	Description
`Dataset`	The dataset / session IDs and meta information.
`Example`	The individual annotation examples. Each example is only added once, so if you add the same annotation to multiple datasets, it’ll only have one record here.
`Link`	Example IDs linked to datasets. This is how Prodigy knows which examples belong to which datasets and sessions.

Importing existing data

To use existing annotations collected with other tools in Prodigy, you can import them via the db-in command. You can import data of all file formats supported by Prodigy. However, JSON or JSONL is usually recommended, as it gives you more flexibility. By default, all examples will be set to "answer": "accept". You can specify a different answer using the --answer argument on the command line.

prodigydb-innew_dataset/path/to/data.jsonlImported 1550 annotations to ‘new_dataset’.

Database class

Prodigy’s database model is included with the library and is available via prodigy.components.db.

Example
from prodigy.components.db import connect

db = connect()
all_dataset_names = db.datasets
examples = db.get_dataset_examples("my_dataset")

connect function

Prodigy provides a simple connection helper that takes care of connecting to one of the built-in database options using the database ID and database settings. If no database config is provided, it will be read off the prodigy.json settings, and default to 'sqlite' with the standard settings.

Argument	Type	Description
`db_id`	str	ID of database, i.e. `'sqlite'`, `'postgresql'` or `'mysql'`. Defaults to `'sqlite'`.
`db_settings`	dict	Database-specific settings. If not provided, settings will be read off the `prodigy.json`.
RETURNS	`Database`	The database.

from prodigy.components.db import connect
db = connect("sqlite", {"name": "my_db.db"})

Database.init method

Argument	Type	Description
`db`	`peewee.Database`	peewee database. Will be available as the `db` attribute of the database.
`display_id`	str	ID of database, e.g. `'sqlite'`. Will be available as the `db_id` attribute of the database. For custom databases plugged in by the user, the ID will default to `'custom'`.
`display_name`	str	Human-readable name of the database, e.g. `'SQLite'`. Will be available as the `db_name` attribute. For custom database module, the display name will defaults to the function name, class name, or `repr(db)`.
RETURNS	`Database`	The database.

To plug in custom database, you can initialize the Database class with a custom instance of peewee.Database or its extension package Playhouse, for example:

import prodigy
from prodigy.components.db import Database
from playhouse.postgres_ext import PostgresqlExtDatabase

psql_db = PostgresqlExtDatabase("my_database", user="postgres")
db = Database(psql_db, "postgresql", "Custom PostgreSQL Database")

@prodigy.recipe("recipe-with-custom-db")
def recipe_with_custom_db():
    return {"db": db}

Database.len method

Argument	Type	Description
RETURNS	int	The number of datasets in the database.

Database.contains method

Argument	Type	Description
`name`	str	Dataset ID.
RETURNS	bool	Whether the dataset exists in the database.

Database.datasets property

Argument	Type	Description
RETURNS	list	A list of all dataset IDs in the database (excluding sessions).

Database.sessions property

Argument	Type	Description
RETURNS	list	A list of all session IDs in the database (excluding user-created datasets).

Database.reconnect method

Reconnect to the database. Called on API requests to avoid timeout issues, especially with MySQL. If the database connection is still open, it will be closed before reconnecting.

Database.close method

Close the database connection (if not already closed). Called after API requests to avoid timeout issues, especially with MySQL.

Database.get_dataset method

Argument	Type	Description
`name`	str	The dataset ID.
RETURNS	list	A list of all examples in the dataset.

Database.get_dataset_examples methodNew: 1.11.8

Argument	Type	Description
`name`	str	The dataset ID.
`session`	bool	Whether the Dataset is a session or not.
RETURNS	List[Dict[str, Any]]	A list of all examples in the dataset serialized to the Prodigy Task format.

Database.iter_dataset_examples methodNew: 1.12.0

Argument	Type	Description
`name`	str	The dataset ID.
`session`	bool	Whether the Dataset is a session or not.
RETURNS	Iterator[Dict[str, Any]]	An iterator of examples in the dataset serialized to the Prodigy Task format.

Database.get_meta method

Argument	Type	Description
`name`	str	The dataset ID.
RETURNS	dict	The meta data associated with the dataset.

Database.get_examples method

Argument	Type	Description
`ids`	list	A list of task IDs.
`by`	str	Select by field, defaults to `task_hash`.
RETURNS	list	The examples with the given IDs.

Database.get_input_hashes method

Argument	Type	Description
`*names`	list	Dataset IDs.
RETURNS	set	A set of input hashes for examples of the given datasets.

Database.get_task_hashes method

Argument	Type	Description
`*names`	list	Dataset IDs.
RETURNS	set	A set of task hashes for examples of the given datasets.

Database.get_hashes methodNew: 1.12

Argument	Type	Description
`*names`	list	The dataset names.
`kind`	str	The kind of hash to check, can be “input” or “task” (default)
RETURNS	int	Set of the hashes in the provided dataset names.

Database.count_dataset methodNew: 1.10

Argument	Type	Description
`name`	str	The dataset name.
`session`	bool	Use only session datasets.
RETURNS	int	The number of examples in the dataset.

Database.add_dataset method

Argument	Type	Description
`name`	str	The dataset ID.
`meta`	dict	The dataset meta.
`session`	bool	Whether dataset is a session.
RETURNS	list	The dataset.

Database.add_examples method

Argument	Type	Description
`examples`	list	A list of annotated examples.
`datasets`	list	A list of datasets to add the examples to.

Database.link method

Argument	Type	Description
`name`	str	The dataset ID.
`example_ids`	list	IDs of examples to link to the dataset.

Database.unlink method

Argument	Type	Description
`example_ids`	list	IDs of examples to unlink.

Database.drop_dataset method

Argument	Type	Description
`name`	list	The dataset ID.
`batch_size`	int	Delete examples in batches of the given size.
RETURNS	bool	Whether the dataset was successfully deleted.

Database.get_hash_count methodNew: 1.12

Argument	Type	Description
`*names`	list	Dataset IDs.
`hash`	int	Hash value to check
`kind`	str	The kind of hash to check, can be “input” or “task” (default)
RETURNS	int	The number of times a task hash appears in the dataset

Database.get_hashes_min_cardinality methodNew: 1.12

Argument	Type	Description
`*names`	list	Dataset IDs.
`n`	int	The minimum number of times that the task hash needs to appear in the datasets
`kind`	str	The kind of hash to check, can be “input” or “task” (default)
RETURNS	int	Set of hashes that appear at least `n` times in the datasets

Database.get_dataset_sessions methodNew: 1.12

Get all session datasets associated with a parent dataset. Finds all the session datasets that have examples also associated with the parent dataset. Can be an expensive query for large datasets.

Argument	Type	Description
`dataset_name`	str	The parent dataset name
RETURNS	List[str]	The list of session dataset names

Implementing a custom database

Custom recipes let you return an optional "db" component. If it’s not set, it will default to the database specified in your prodigy.json or to "sqlite". The database plugged in via a custom recipe can also be False (to not use any DB) or a custom class that follows Prodigy’s Database API.

@prodigy.recipe('custom-recipe')
def custom_recipe():
    return {'db': YourCustomDB}  # etc.

Essentially, all your custom class needs to do is expose methods to add and retrieve datasets and annotated examples. For instance:

class YourCustomDB:
    def __init__(self, *args, **kwargs):
        # initialize your custom database

    def get_dataset_examples(self, name, default=None):
        # get examples for a given dataset name

    # other methods and properties

How your database handler resolves those queries is entirely up to you. If your database class implements the methods reconnect and close, Prodigy will call those on each request to the REST API, allowing you to explicitly manage the connection and prevent timeouts between requests.