Prodigy Plugins

Some Prodigy recipes require a 3rd party library in order to work. To keep Prodigy lightweight we’ve separated some of these recipes out into their own packages so that you may install them as a plugin. These plugins always target the most recent version of Prodigy with regards to compatibility.

This section of the docs showscases such plugins. Note that you can also explore these recipes on Github to serve as a source of inspiration to customise further.


🤗 Prodigy HF	Recipes that interact with the Huggingface stack. Recipes: `hf.train.ner`, `hf.correct.ner`, `hf.upload` and more	Repo
📄 Prodigy PDF	Recipes that help with the annotation of PDF files. Recipes: `pdf.image.manual`, `pdf.ocr.correct`, `pdf.spans.manual`, `pdf.layout.fetch`	Repo
🤫 Prodigy Whisper	Recipes that leverage OpenAI’s Whisper model for audio transcription. Recipes: `whisper.audio.annotate`	Repo
🍰 Prodigy Segment	Recipes that leverage Meta’s Segment-Anything model for image segmentation. Recipes: `segment.image.manual` and more	Repo
🏘 Prodigy ANN	Recipes that allow you to use approximate nearest neighbor techniques to help you annotate. Recipes: `ann.text.index`, `ann.image.index`, `ann.text.fetch` and more	Repo
🌕 Prodigy Lunr	Recipes that allow you to use old-school string matching techniques to help you annotate. Recipes: `lunr.text.index`, `lunr.text.fetch` and more	Repo
🦆 sense2vec	Recipes that allow to fetch terms using phrase embeddings trained on Reddit. Recipes: `sense2vec.teach`, `sense2vec.to-patterns` and more	Repo
🔎 Prodigy Evaluate	Recipes that compute evaluation metrics for spaCy pipelines. Recipes: `evaluate.evaluate`, `evaluate.evaluate-example` and more	Repo

🤗 Prodigy-HF

This plugin contains recipes that interact with the Hugging Face stack. Some recipes will allow you to directly train transformer models on top of your annotations while other recipes allow you to upload artifacts to HF cloud environment.

To use these recipes, you’ll first need to install the plugin.

Install prodigy-hf
pip install "prodigy-hf @ git+https://github.com/explosion/prodigy-hf"

Once it is installed you can explore some of the new recipes.

Training Hugging Face models

The first recipe that you may enjoy from this plugin is the recipe to train custom NER models.

prodigyhf.train.nerfashion,eval:fashion-evalhf-model-dir--epochs 10--model-name distilbert-base-uncased

Once the model is done training you’ll be able to inspect the hf-model-dir folder to find all the trained state.

You can also choose to re-use this trained model to help you annotate data. The plugin features a hf.ner.correct recipe that works similarily to ner.correct except here we get to use a Hugging Face model. This means that you can also use models from the Hugging Face Hub. This recipe will internally map the predictions from the transformer model to spaCy tokens.

prodigyhf.ner.correctfashionhf-model-dir/checkpoint-20examples.jsonl--lang en

Note that this plugin also offers variants of these recipes for text classification. Check out the API docs for hf.train.textcat and hf.correct.textcat for more details.

Interacting with Hugging Face Hub

Alternatively, you may also use these plugin to upload your annotated datasets to Hugging Face Hub.

Logging in

All the recipes below assume that you’ve authenticated beforehand via the following command:

Logging into Huggingface from the CLI
huggingface-cli login

prodigyhf.uploadfashion,eval:fashion-evalusername/reponame✔ Upload completed! You should be able to view repo at
https://huggingface.co/datasets/username/reponame.

Internally this recipe will validate the dataset for consistency and will attempt to anonymise the annotators before uploading. You can turn this behavior off with flags and you can also specify that you want the dataset not to appear publicly.

Storing keyed datasets keyed-datasets

You can choose to prefix a dataset name, like eval:fashion-eval to have the examples appear in a different key on the DataSetDict. You can confirm this locally when downloading the dataset via the datasets library.

from datasets import load_dataset

load_dataset("username/reponame")

Notice how the eval key appears in the output.

DatasetDict({
    train: Dataset({
        features: ['text', 'meta', '_input_hash', '_task_hash', 'tokens', 'spans', '_session_id', '_view_id', 'answer'],
        num_rows: 1235
    })
    eval: Dataset({
        features: ['text', 'meta', '_input_hash', '_task_hash', 'tokens', 'spans', '_session_id', '_view_id', 'answer'],
        num_rows: 500
    })
})

API

`hf.train.ner` command

Interface: terminal only
Use case: train Hugging Face models directly

Trains a Hugging Face model for NER directly on your annotated datasets.

prodigyhf.train.nerdatasetsout_dir--model-name--batch-size--eval-split--learning-rate--verbose

Argument	Type	Description	Default
`datasets`	positional	One or more (comma-separated) datasets for the named entity recognizer. Use the `eval:` prefix for evaluation
`out_dir`	positional	Folder to store trained model and checkpoints.
`--model-name`, `-mn`	option	Pick the model you’d like to use as a starting point for training.	”distilbert-base-uncased”
`--batch-size`, `-bs`	option	Batch size for training.	8
`--eval-split`, `-es`	option	If no evaluation sets are provided for a component, this setting can be used to split off a a percentage of the training examples for evaluation. If no evaluation splits are given the train set performance will be reported.
`--learning-rate`, `-lr`	option	Learning rate.	2e-5
`--verbose`, `-v`	flag	Output all the logs/warnings from Hugging Face libraries.	False

`hf.correct.ner` manual

Interface: ner_manual
Use case: Annotate NER with a model in the loop

Annotate NER data with a transformer model in the loop.

prodigyhf.correct.nerdataset--model-namesource--lang

Argument	Type	Description	Default
`dataset`	positional	Dataset to save annotations into
`out_dir`	positional	Path to transformer model. Can also point to a model on Hugging Face Hub.
`source`	positional	Source file to annotate
`--lang`, `-l`	option	Language to assume for the spaCy tokeniser	”en”

`hf.train.textcat` command

Interface: terminal only
Use case: train Hugging Face models directly

Trains a Hugging Face model for text classification directly on your annotated datasets.

Warning: Exclusive categories

Note that hf.correct.textcat and hf.train.textcat assume mutually exclusive textcat labels. You cannot train a pipeline where multiple classes can be detected for a single example.

While you can use this recipe to train models on binary datasets, you should be aware that the resulting model will only be capable of emitting “accept” or “reject” as labels.

If you use textcat.manual to annotate data for a transformer pipeline that you should also be mindful when you’re doing binary annotations that you may loose important information. So in general we recommend doing this:

prodigytextcat.manualdatasetexamples.jsonl--labels pos,not-pos--exclusive

By using --labels pos,not-pos instead of --labels pos you are making sure that both classes are available to the pipeline after it is training.

prodigyhf.train.textcatdatasetsout_dir--model-name--batch-size--eval-split--learning-rate--verbose

Argument	Type	Description	Default
`datasets`	positional	One or more (comma-separated) datasets for the named entity recognizer. Use the `eval:` prefix for evaluation
`out_dir`	positional	Folder to store trained model and checkpoints.
`--model-name`, `-mn`	option	The name of the model to be used as a starting point for training.	”distilbert-base-uncased”
`--batch-size`, `-bs`	option	Batch size for training.	8
`--eval-split`, `-es`	option	If no evaluation sets are provided for a component, this setting can be used to split off a a percentage of the training examples for evaluation. If no evaluation splits are given the train set performance will be reported.
`--learning-rate`, `-lr`	option	Learning rate.	2e-5
`--verbose`, `-v`	flag	Output all the logs/warnings from Huggingface libraries.	False

`hf.correct.textcat` manual

Interface: choice
Use case: Annotate textcat data with a model in the loop

Annotate data for text classification with a transformer model in the loop.

prodigyhf.correct.textcatdataset--model-namesource

Argument	Type	Description
`dataset`	positional	Dataset to save annotations into
`out_dir`	positional	Path to transformer model. Can also point to a model on Hugging Face Hub.
`source`	positional	Source file to annotate

`hf.upload` command

Interface: terminal only
Use case: upload annotations to Hugginface Hub

Upload your annotations to Hugging Face Hub.

This recipe assumes that you’ve authenticated beforehand via the following command:

Logging into Hugging Face from the CLI
huggingface-cli login

You can use the same command multiple times to upload the most recent version of your data to the hub.

prodigydatasetsrepo_id--keep-annotator-ids--patch_values--private

Argument	Type	Description	Default
`datasets`	positional	One or more (comma-separated) datasets to upload. Use the `name:` prefix to add keys to the dataset.
`repo_id`	positional	Name of the repo to upload to. Should be formatted as `<username>/<reponame>`.
`--keep-annotator-ids`, `-k`	flag	Don’t anonymize the annotators.	False
`--patch_values`, `-nv`	flag	If keys are missing between datasets, patch them with `None` values.	False
`--private`, `-p`	flag	Upload dataset as a private repository.	False

Prodigy-PDF

This plugin contains recipes for annotating PDF files using the familiar image-based image_manual interface, as well as recipes for OCR (Optical Character Recognition) to extract text-based content from documents.

To use these recipes, you’ll first need to install the plugin. In order for the recipes to work, you may also need to install system dependencies for tesseract.

Install prodigy-pdf
pip install "prodigy-pdf @ git+https://github.com/explosion/prodigy-pdf"

brew install tesseract  # macOs
sudo apt install tesseract-ocr  # Linux

Once it is installed, you can start annotating PDFs as images via pdf.image.manual:

Example

prodigypdf.image.manualpapers./pdfs--labels FIGURE,FOOTNOTE,PARAGRAPH

Prodigy

This live demo requires JavaScript to be enabled.

If you like, you can re-use the pdf annotations with the pdf.ocr.correct recipe to apply OCR to the annotated segments. This recipe uses pytessaract under the hood to give suggestions that you can correct.

Example

prodigypdf.ocr.correctocr_imagesdataset:papers--labels PARAGRAPH--fold-dashes

Prodigy

This live demo requires JavaScript to be enabled.

API

`pdf.image.manual` manual

Interface: image_manual
Use case: Add annotations to PDFs

Add layout annotations to a PDF.

prodigypdf.image.manualdatasetpdf_folder--labels--remove-base64--split-pages

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`pdf_folder`	str	Folder that contains your PDF files.
`--labels`, `-l`	str	Comma-separated labels to annotate.
`--remove-base64`, `-R`	bool	Don’t save the base64 images of the PDF.	`False`
`--split-pages`, `-S`	bool	New: 0.3 View each page as a separate task. By default, multi-page documents are grouped together using the `pages` interface.	`False`

`pdf.ocr.correct` manual

Interface: image/ text_input
Use case: Add OCR annotations to PDF segments

Applies Optical Character Recognition (OCR) to annotated segments from pdf.image.manual and gives a textbox for corrections.

prodigypdf.ocr.correctdatasetsource--labels--scale--fold-dashes--remove-base64--autofocus

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`source`	str	Source with PDF annotations.
`--labels`, `-l`	str	Labels to consider.
`--scale`, `-s`	int	Zoom scale. Increase above 3 to upscale the image for OCR.	3
`--remove-base64`, `-R`	bool	Don’t save the base64 images of the PDFs.	False
`--fold-dashes`, `-f`	bool	Removes dashes at the end of a text line and folds them with the next term.	False
`--autofocus`, `-af`	bool	Autofocus on the transcript UI.	False

`pdf.spans.manual` manualNew: 0.4.0

Interface: spans_manual
Use case: Add span annotation to document contents

This recipe lets you apply span annotations to text-based document contents extracted with spacy-layout and Docling. For higher annotation speed and efficiency, you can set --focus text,list_item or similar to walk through individual text blocks of the given layout labels, which are highlighted in a visual preview of the document page. When annotating in focus mode, the span of text you’re working on is preserved as "text_span" in the JSON data, so you’ll always be able to relate it back into the original full document.

To speed up loading times, you can optionally use pdf.layout.fetch to pre-fetch the data extracted from the PDFs. You can then run this recipe with the name of a dataset using the dataset: prefix, or the path to the created JSONL file.

prodigypdf.spans.manualdatasetspacy_modelsource--label--add-ents--focus--disable--split-pages--hide-preview

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy model or `blank:en` or similar for tokenization, and optional named entities.
`source`	str	Path to directory of PDF files, or dataset or JSONL file created with `pdf.layout.fetch`.
`--label`, `-l`	str	Comma-separated span labels to annotate. If named entities are added, those will be filtered by the provided labels.	`None`
`--add-ents`, `-E`	bool	Add named entities predicted by the model.	`False`
`--focus`, `-f`	str	Annotate in focus mode: comma-separated list of layout span labels to focus on and annotate section-by-section, e.g. `text`, `list_item` or `section_header`. The current section will be highlighted in the visual preview.	`None`
`--disable`, `-d`	str	Comma-separated layout span labels to disable and make unselectable in the UI.	`None`
`--split-pages`, `-S`	bool	View pages as separate tasks instead of grouped together using the `pages` interface.	`False`
`--hide-preview`, `-HP`	bool	Don’t show side-by-side preview of document layout next to the extracted text.	`False`

Example

prodigypdf.spans.manualpapersblank:en./pdfs--label EVENT,PLACE--focus text,list_item

This live demo requires JavaScript to be enabled.

`pdf.layout.fetch` commandNew: 0.4.0

Interface: terminal only
Use case: Pre-process PDFs

This recipe lets you preprocess your PDFs to speed up loading times during annotation with pdf.spans.manual. The data can be saved to a JSONL file, or a Prodigy dataset using the dataset: prefix. The data includes everythig needed to render the examples with the different layout configurations. You only need to decide upfront if you want the data to be paginated or split by sections, since this impacts the JSON structure that will be created.

prodigypdf.layout.fetchoutputspacy_modelsource--focus--split-pages

Argument	Type	Description	Default
`output`	str	Path to JSONL file or name of dataset using the `dataset:` prefix.
`spacy_model`	str	Loadable spaCy model or `blank:en` or similar for tokenization.
`source`	str	Path to directory of PDF files.
`--focus`, `-f`	str	Annotate in focus mode: comma-separated list of layout span labels to focus on and annotate section-by-section, e.g. `text`, `list_item` or `section_header`. The current section will be highlighted in the visual preview.	`None`
`--split-pages`, `-S`	bool	View pages as separate tasks instead of grouped together using the `pages` interface.	`False`

Example

prodigypdf.layout.fetchdataset:papers_layoutblank:en./pdfsℹ Creating preprocessed PDFs✔ Saved fetched data to dataset papers_layout

🤫 Prodigy-Whisper

OpenAI released an open model for audio annotation called Whisper. It’s a model that can be downloaded locally, it has support for multiple languages and you’re even able to pick from a selection of models. The model isn’t perfect, but when you’re transcribing text, it can really help to have such a model provide a starting point. The goal of this plugin is to help you get started with this right away.

To use this plugin, you’ll need to install it first.

Install prodigy-whisper
pip install "prodigy-whisper @ git+https://github.com/explosion/prodigy-whisper"

In order to use the plugin you’ll also need to have ffmpeg installed. Most package managers should have these available so you should be able to use one of the following commands.

Install ffmpeg
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
# on Arch Linux
sudo pacman -S ffmpeg
# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

Once the plugin is installed you can use the whisper.audio.transcribe recipe. It is very similar to audio.transcribe recipe that Prodigy provides, but this recipe uses Whisper to provide an initial transcription.

Example

prodigywhisper.audio.transcribetranscripts./recordings--model base

This live demo requires JavaScript to be enabled.

In the base form you can already see that Whisper does a pretty good job at transcription. But it may be easier to correct short pieces of audio instead of a long one. This is where Wishper can help out as well. It is able to segment a long audio clip into shorter segments and each of these segments can then be annotated in Prodigy.

To use this feature, you can add the --segment flag to the recipe call.

Example

prodigywhisper.audio.transcribetranscripts./recordings--model base--segment

Now, you can go through the segments one by one and each segment will have metadata attached so that you can link it back to the timestamps in the original file. This is what the first segment would look like.

This live demo requires JavaScript to be enabled.

This is what the second segment would look like.

This live demo requires JavaScript to be enabled.

API

`whisper.audio.transcribe` manual

Interface: blocks/ audio/ text_input
Saves: annotations to the database
Use case: Manually create transcriptions for audio with a Whisper model in the loop

Manually transcribe audio files by typing the transcript into a text field with the help of Whisper. The API is built on top of audio.transcribe and will allow you to configure everything that the original recipe can. The only input addition is that this recipe also allows you to select a Whisper model. The recipe uses the "base" model by default, but you should be able to pick any of the models shown on here.

prodigywhisper.audio.transcribedatasetsource--loader--autoplay--keep-base64--fetch-media--playpause-key--text-rows--text-rows--exclude

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`source`	str	Path to a directory containing audio files or pre-formatted JSONL file if `--loader jsonl` is set.
`--model`, `-m`	str	Name of OpenAI Whisper model to use.	`base`
`--loader`, `-lo`	str	Optional ID of source loader, e.g. `audio` or `video`.	`audio`
`--autoplay`, `-A`	bool	Autoplay the audio when a new task loads.	`False`
`--keep-base64`, `-B`	bool	If `audio` loader is used: don’t remove the base64-encoded audio data from the task before it’s saved to the database.	`False`
`--fetch-media`, `-FM`	bool	Convert local paths and URLs to base64. Can be enabled if you’re annotating a JSONL file with paths or for re-annotating an existing dataset.	`False`
`--playpause-key`, `-pk`	str	Alternative keyboard shortcuts to toggle play/pause so it doesn’t conflict with text input field.	`"command+enter, option+enter, ctrl+enter"`
`--text-rows`, `-tr`	int	Height of the text input field, in rows.	`6`
`--field-id`, `-fi`	str	Add the transcript text to the data using this key, e.g. `"transcript": "Text here"`.	`"transcript"`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`

Prodigy-Segment

Sometimes you’re interested in selecting pixels from an image, as opposed to merely selecting a bounding box. Selecting the right pixels can be tedious work so you may want to use a model in the loop to help you. A good choice for such a model is Meta’s Segment Anything model, which we’ve integrated into Prodigy via the prodigy-segment plugin.

This model is able to take bounding box annotations from Prodigy to construct a pixel segmentation map under the hood. From the UI, that might look like this:

Using Prodigy-Segment

For a quick overview of the features, you may also enjoy this Youtube tutorial.

Before you’ll be able to use recipes, you’ll want to make sure you’ve downloaded the appropriate model checkpoint beforehand. You can check the available models here but this tutorial will assume the “default” model-type. The weights for this model can be downloaded via:

Download the weights for the `default` model-type
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

Once the model is downloaded you can get started by running the segment.image.manual recipe.

prodigysegment.image.manualsegment-cat-dogimagessam_vit_h_4b8939.pth--model-type default--labels cat,dog

When you run this model, you may notice that it’s fairly slow. This isn’t a big suprise given the size of the model but it can be a serious burden, especially if your machine does not have a GPU. For a better experience, you may want to pre-compute the features ahead of annotation time and cache those results to disk. It may take a while to precompute all the images, but once they are done the annotation experience feels seamless and realtime again.

To precompute a cache, you can use the segment.fill-cache recipe.

prodigysegment.fill-cacheimagessam_vit_h_4b8939.pth--model-type default--cache segment-anything-cache

This will store all the features in a folder (configurable via the --cache flag) which the segment.image.manual recipe can immediately pick up.

prodigysegment.image.manualsegment-cat-dogimagessam_vit_h_4b8939.pth--model-type default--label cat,dog--cache segment-anything-cache

The pixel maps, once annotated, are stored under the spans key in your examples. You can explore these maps one by one in a Jupyter notebook using the script shown below.

Script to loop over all annotated examples
import base64
from io import BytesIO
from PIL import Image

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset_examples("<dataset-name>")

def mask_to_pil(mask_str):
    indicator = "base64,"
    mask_str = mask_str[mask_str.find(indicator) + len(indicator):]
    bytes = BytesIO(base64.b64decode(mask_str))
    return Image.open(bytes)

# Loop over all the examples and display them.
for ex in examples:
    print(ex['path'])
    for span in ex.get("spans", []):
        # Use builtin `display` to view pixel map
        display(mask_to_pil(span['mask']))

From here you can re-use the Pillow library to either store these pixel maps into the required format for your pipeline or you can stream them directly into a learning algorithm from Python.

API

`segment.image.manual` manual

Interface: blocks/ image_manual
Saves: annotations to the database
Use case: Annotate pixels by drawing bounding boxes

Manually transcribe pixels in images with Meta’s segment anything model under the hood.

prodigysegment.image.manualdatasetsourcecheckpoint--label--loader--exclude--width--darken--no-fetch--remove-base64--model-type--cache

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`source`	str	Path to a directory containing audio files or pre-formatted JSONL file if `--loader jsonl` is set.
`checkpoint`	Path	Path to a model checkpoint.
`--label`, `-l`	str / `Path`	One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line.
`--loader`, `-lo`	str	Optional ID of source loader.	`images`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`
`--width`, `-w`	int	Width of card and maximum image width in pixels.	`675`
`--darken`, `-D`	bool	Darken image to make boxes stand out more.	`False`
`--no-fetch`, `-NF`	bool	Don’t fetch images as base64. Ideally requires a JSONL file as input, with `--loader jsonl` set and all images available as URLs.	`False`
`--remove-base64`, `-R`	bool	Remove base64-encoded image data before storing example in the database and only keep the reference to the local file path. Caution: If enabled, make sure to keep original files!	`False`
`--model-type`, `-mt`	str	Type of model to use.	`default`
`--cache`, `-c`	`Path`	Path to feature cache to speed up inference.	`segment-anything-cache`

`segment.fill-cache` command

Interface: terminal only
Saves: inference features into disk cache
Use case: Prepare images for segmented annotation

Prepares a local disk cache to speed up inference for segment.image.manual. This can cause a huge speedup if you’re running on a non-GPU device.

prodigysegment.fill-cachesource--loadercheckpoint--model-type--cache

Argument	Type	Description	Default
`source`	str	Path to a directory containing audio files or pre-formatted JSONL file if `--loader jsonl` is set.
`checkpoint`	Path	Path to a model checkpoint.
`--loader`, `-lo`	str	Optional ID of source loader.	`images`
`--cache`, `-c`	`Path`	Path to feature cache to speed up inference.	`segment-anything-cache`

Prodigy-ANN

Sometimes you may want to query your examples to find a relevant subset for annotation. A modern method for doing this is to use numeric vectors to represent text and you can use approximate neighest neighbor (ANN) techniques to fetch relevant examples. The goal is to spend more time looking at examples that matter, like examples similar to items that the model gets wrong. Curating these examples first might be a pragmatic method to steer the model in the right direction.

This is the general approach for the ANN recipes

If you’re interested to see a quick demo for Prodigy-ANN applied to a text dataset, you may appreciate this Prodigy short on Youtube.

To use this plugin, you’ll need to install it first.

Install prodigy-ann
pip install "prodigy-ann @ git+https://github.com/explosion/prodigy-ann"

As a first step for this approach you’ll first need to generate an index with vector representations of your text. To encode the text this library uses sentence-transformers and it uses hnswlib as an index for these vectors.

To index your documents, you can run the ann.text.index recipe.

prodigyann.text.indexexamples.jsonlexamples.indexindexing: 100%|███████████████████████████| 2210/2210 [00:09<00:00, 243.64it/s]

Once it is indexed you can use text queries find and curate interesting subsets. A general method to prepare these subsets is to use ann.text.fetch. This will fetch a subset of vectors that are close in vector space and save the associated examples on disk. From there you can use any Prodigy recipe you like.

prodigyann.text.fetchexamples.jsonlexamples.indexsubset.jsonl--query “this is an outrage!”

More interfaces

As a convenience this plugin also provides the textcat.ann.manual, ner.ann.manual and spans.ann.manual so that you may query and annotate directly. These recipes have the same arguments as their native Prodigy textcat.manual, ner.manual and spans.manual counterparts but add a --query parameter so that you may pass your query.

Interactive Queries

Sometimes you may want to update the stream while you’re annotating. You can do that without restarting the server by using the --allow-reset flag when you’re starting the textcat.ann.manual, ner.ann.manual or spans.ann.manual recipes.

prodigytextcat.ann.manualexamples.jsonlexamples.index--query “new academic dataset”--allow-reset

Here’s an example of what the experience might look like from the UI.

Retreiving Images

You can use these embedding retreival techniques for images too. Models like CLIP allow you to embed images and text in the same space, which means that you can query the images by using text.

The approach for images is very similar to the approach for text too. To get started you’ll first want to run an indexing recipe over a folder of images via the ann.image.index recipe.

prodigyann.image.indexpath/to/image_folderimage.indexindexing: 100%|███████████████████████████| 210/210 [01:49<00:00]

Once the index is built, you can query it. You can choose to query it to prepare a .jsonl file to re-use later via the ann.image.fetch recipe.

prodigyann.image.fetchpath/to/image_folderexamples.indexout.jsonl--query “laptops”--remove-base64--n 100

Alternatively the plugin also provides a wrapper around the familiar image.manual recipe. This will retreive the images before passing it on to the image_manual interface. This interface also allows you to reset the stream via the --allow-reset flag.

prodigyimage.ann.manualannotated_laptopspath/to/image_folderexamples.index--query “laptops”--remove-base64--n 100--labels laptop,phone--allow-reset

Here’s an example of what the experience might look like from the UI.

API

`ann.text.index` command

Interface: terminal only
Use case: Prepare an HNSWlib index

Builds an HSNWLIB index on example text data.

prodigyann.text.indexsourceexamples.index

Argument	Type	Description	Default
`source`	Path	Path to source to index.
`index_path`	Path	Path of trained index

`ann.text.fetch` command

Interface: terminal only
Use case: Query to get a subset of interest.

Fetch a relevant subset using a HNSWlib index.

prodigyann.text.fetchsourceindex_pathout_path--query--n

Argument	Type	Description	Default
`source`	Path	Path to source to index.
`index_path`	Path	Path of trained index
`out_path`	Path	Path to stored subset of interest
`--query`, `-q`	str	Query to encode and pass to index
`--n`, `-n`	str	Number of results to return from index	200

`ann.image.index` command

Interface: terminal only
Use case: Prepare an HNSWlib index.

Builds an HSNWLIB index on example image data.

prodigyann.image.indexsourceexamples.index

Argument	Type	Description	Default
`source`	Path	Path to source folder of images to index.
`index_path`	Path	Path of trained index

`ann.image.fetch` command

Interface: terminal only
Use case: Query to get a subset of interest

Fetch a relevant subset of images using a HNSWlib index.

prodigyann.image.fetchsourceindex_pathout_path--query--query--remove-base64

Argument	Type	Description	Default
`source`	Path	Path to source folder of images for index.
`index_path`	Path	Path of trained index
`out_path`	Path	Path to stored subset of interest
`--query`, `-q`	str	Query to encode and pass to index
`-n`	int	Number of items to retreive	200
`remove-base64`, `-R`	bool	Don’t save the base64 images on disk	False

Prodigy-Lunr

Instead of using semantic vectors with approximate nearest neighbors to find relevant subsets you can also resort to the “regular” search techniques. To accomodate these techniques we’ve added support for recipes that use lunr. These recipes are very similar to their ann.* counterparts but will rely on string matching techniques to retreive relevant examples.

To use this plugin, you’ll need to install it first.

Install prodigy-lunr
pip install "prodigy-lunr @ git+https://github.com/explosion/prodigy-lunr"

To index your documents, you can run the ann.text.index recipe. This will generate an index and serialize it to disk by writing it into a gzipped json file.

prodigylunr.text.indexexamples.jsonlindex.gz.jsonindexing: 100%|███████████████████████████| 2210/2210 [00:09<00:00, 243.64it/s]

Once it is indexed you can use text queries find and curate interesting subsets. A general method to prepare these subsets is to use lunr.text.fetch. This will fetch a subset of vectors that are close in vector space and save the associated examples on disk. From there you can use any Prodigy recipe you like.

prodigylunr.text.fetchexamples.jsonlindex.gz.jsonsubset.jsonl--query “outrage better service unhappy”

More interfaces

As a convenience this plugin also provides the textcat.lunr.manual, ner.lunr.manual and spans.lunr.manual so that you may query and annotate directly. These recipes have the same arguments as their native Prodigy textcat.manual, ner.manual and spans.manual counterparts but add a --query parameter so that you may pass your query.

Interactive Queries

Sometimes you may want to update the stream while you’re annotating. You can do that without restarting the server by using the --allow-reset flag when you’re starting the textcat.lunr.manual, ner.lunr.manual or spans.lunr.manual recipes.

prodigytextcat.lunr.manualexamples.jsonlindex.gz.json--query “outrage better service unhappy”--allow-reset

API

`lunr.text.index` command

Interface: terminal only
Use case: Prepare an HNSWlib index.

Builds an HSNWLIB index on example text data.

prodigylunr.text.indexsourceexamples.index

Argument	Type	Description	Default
`source`	Path	Path to source to index.
`index_path`	Path	Path to stored lunr index

`lunr.text.fetch` command

Interface: terminal only
Use case: Query to get a subset of interest.

Fetch a relevant subset using a HNSWlib index.

prodigylunr.text.fetchsourceindex_pathout_path--query

Argument	Type	Description
`source`	Path	Path to source to index.
`index_path`	Path	Path to stored lunr index
`out_path`	Path	Path to stored subset of interest
`--query`, `-q`	str	Query to encode and pass to index

Sense2vec

sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detailed word vectors. This library is a simple Python implementation for loading, querying and training sense2vec models. To explore the semantic similarities across all Reddit comments of 2015 and 2019, see the interactive demo. There are also more details in this blogpost.

To see a demo on how to use this tool with Prodigy, you may enjoy this Youtube video where we use it to detect video games in text.

To use sense2vec, you’ll first need to install it.

python -m pip install sense2vec

To use the pre-trained vectors in Prodigy you’ll need to download the archive(s) and extract them. Large files have been split into multi-part downloads. All the available versions can be found below.

Vectors	Size	Description	Download Link (zipped)
`s2v_reddit_2019_lg`	4 GB	Reddit comments 2019 (01-07)	part 1, part 2, part 3
`s2v_reddit_2015_md`	573 MB	Reddit comments 2015	part 1

To merge the multi-part archives, you can run the following:

cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz

Once downloaded (and merged) you should be able to unarchive via:

tar -xvf s2v_reddit_lg.tar.gz

Now that the archive is extracted you can point the sense2vec.teach recipe to it. This will allow Prodigy to suggest similar terms based on the most similar phrases from sense2vec, and the suggestions will be adjusted as you annotate and accept similar phrases. For each seed term, the best matching sense according to the sense2vec vectors will be used.

prodigysense2vec.teachvideo_game_yesno/path/to/s2v_reddit_2019_lg--seeds “mass effect,knights of the old republic,halo 3”--resume

This live demo requires JavaScript to be enabled.

After curating the generated examples you can choose to export the collected phrases as pattern files which can be used with spaCy’s EntityRuler or recipes like ner.manual by using the sense2vec.to-patterns recipe.

prodigysense2vec.to-patternsvideo_game_yesnoblank:enVIDEO_GAMEpatterns.jsonl

This will generate a patterns.jsonl file locally that has contents that may look like:

{"label": "VIDEO_GAME", "pattern": [{"LOWER": "mass"}, {"LOWER": "effect"}]}
{"label": "VIDEO_GAME", "pattern": [{"LOWER": "knights"}, {"LOWER": "of"}, {"LOWER": "the"}, {"LOWER": "old"}, {"LOWER": "republic"}]}
{"label": "VIDEO_GAME", "pattern": [{"LOWER": "halo"}, {"LOWER": "3"}]}
{"label": "VIDEO_GAME", "pattern": [{"LOWER": "jade"}, {"LOWER": "empire"}]}

More recipes

Sense2vec also has the sense2vec.eval, sense2vec.eval-most-similar and sense2vec.eval-ab recipes. These may be interesting if you’re interested in evaluating a sense2vec model. For more information on those, you can check the README on the Github repository.

`sense2vec.teach` binary

Interface: html
Saves: annotations to the database
Use case: curate terminology phrases via sense2vec

Bootstrap a terminology list using sense2vec.

prodigysense2vec.teachdatasetvectors_path--seeds--threshold--n-similar--batch-size--case-sensitive--resume

Argument	Type	Description	Default
`dataset`	positional	Dataset to save annotations to.
`vectors_path`	positional	Path to pretrained sense2vec vectors.
`--seeds`, `-s`	option	One or more comma-separated seed phrases.
`--threshold`, `-t`	option	Similarity threshold.	0.85
`--n-similar`, `-n`	option	Number of similar items to get at once.	100
`--batch-size`, `-b`	option	Batch size for submitting annotations.	5
`--case-sensitive`, `-CS`	option	Show the same terms with different casing.	False
`--resume`, `-R`	flag	Resume from an existing phrases dataset.	False

`sense2vec.to-patterns` command

Interface: terminal only
Use case: generate pattern files

Convert a dataset of phrases collected with sense2vec.teach to token-based match patterns.

prodigysense2vec.to-patternsdatasetspacy_modellabel--output-file--case-sensitive--dry

Argument	Type	Description	Default
`dataset`	positional	Phrase dataset to convert.
`spacy_model`	positional	spaCy model for tokenization.
`label`	positional	Label to apply to all patterns.
`--output-file`, `-o`	option	Optional output file. Defaults to stdout.
`--case-sensitive`, `-CS`	flag	Make patterns case-sensitive.	False
`--dry`, `-D`	flag	Perform a dry run and don’t output anything.	False

🔎 Prodigy-evaluate

This Prodigy plugin allows you to evaluate your spaCy pipeline overall or on a per-example basis. To use these recipes, you’ll first need to install the plugin.

Install prodigy-evaluate
pip install "prodigy-evaluate @ git+https://github.com/explosion/prodigy-evaluate"

Once installed, you can make use of the two main recipes in this plugin: evaluate.evaluate and evaluate.evaluate-example.

`evaluate.evaluate` command

Interface: terminal only
Use case: evaluate a spaCy pipeline on one or more datasets

This recipe allows you to evaluate a spaCy pipeline on one or more datasets for different components. Per-component datasets can be passed in the same way as in the case of train recipe only all datasets will be considered evaluation sets.

The --label-stats flag lets you investigate per-label scores like precision, recall and F1 scores for NER and textcat components. The --confusion-matrix flag will output a confusion matrix for the NER and textcat components. If you’d like to customize how the confusion matrix is rendered, you can save the an array of the confusion matrix by passing an output path via the --cf-path argument and use it with your favourite data visualization library. Please note that a separate inference is run to obtain the confusion matrix and as results are not deterministic, there may be slight variations in evaluation and confusion matrix results.

Example evaluate.evaluate output

prodigyevaluate.evaluatemy_custom_ner_model--ner ner_dataset--label-statsℹ Using CPU================================= Results =================================TOK     100.00
NER P   92.80
NER R   99.58
NER F   96.07
SPEED   26868============================== NER (per type) ==============================                 P        R       F
SKILL        92.53    99.55   95.91
EXPERIENCE   96.88   100.00   98.41

Argument	Type	Description	Default
`model`	str	Name of spaCy pipeline to evaluate.
`--ner`	str	One or more (comma-separated) datasets for the named entity recognizer.	`None`
`--textcat`	str	One or more (comma-separated) datasets for the text classifier (exclusive categories).	`None`
`--textcat-multilabel`	str	One or more (comma-separated) datasets for the text classifier (non-exclusive categories).	`None`
`--senter`	str	One or more (comma-separated) datasets for the sentence recognizer.	`None`
`--parser`	str	One or more (comma-separated) datasets for the dependency parser.	`None`
`--tagger`	str	One or more (comma-separated) datasets for the part-of-speech tagger.	`None`
`--spancat`	str	One or more (comma-separated) datasets for the span categorizer.	`None`
`--coref`	str	One or more (comma-separated) datasets for the coreference model. Requires spacy-experimental.	`None`
`--label-stats`, `-LS`	bool	Compute per-label statistics for NER and textcat components.	`False`
`--gpu_id`	int	ID of the GPU to use.	`-1`
`--verbose`	bool	Print detailed information about the evaluation.	`False`
`--confusion-matrix`, `-CF`	bool	Compute confusion matrix for NER, textcat and textcat-multilabel components.	`False`
`--cf-path`, `-CP`	str	Local path to save the confusion matrix to. Available for NER, textcat and textcat-multilabel components.	`None`
`--spans-key`	str	Key to use for spans in the evaluation data.	`sc`

`evaluate.evaluate-example` command

Interface: terminal only
Use case: evaluate a spaCy pipeline on one or more datasets on a per-example basis

Evaluate a spaCy pipeline on one or more datasets for different components on a per-example basis. Datasets are provided in the same per-component format as the prodigy evaluate command e.g. --ner my_eval_dataset_1,my_eval_dataset_2. This command will run an evaluation on each example individually and then sort by the desired --metric argument.

This is helpful for debugging and for understanding the hardest or easiest examples for your model. The example below shows how to evaluate a model on a dataset on a per-example basis and sort by the lowest NER F1 score.

If you would like to save the top examples sorted by your metric, you can use the --output-path argument to save the examples in .jsonl format to file. If you’re evaluating NER, spancat or textcat pipeline, this .jsonl file could then be used as input to Prodigy correct ( ner.correct, spans.correct, textcat.correct) or model-annotate ( ner.model-annotate, spans.model-annotate, textcat.model-annotate) workflows to quickly inspect your model’s predictions on hardest examples.

Example evaluate.evaluate-example output

prodigyevaluate.evaluate-examplemy_custom_ner_model--ner ner_dataset--metric ents_f--n-results 3ℹ Using CPU============================= Scored Examples =============================Example                     ents_f
-----------------           ------
I live in london.           0.0
My name is Freya.           0.0
Where is Antonia?           0.0

Argument	Type	Description	Default
`model`	str	Name of spaCy pipeline to evaluate.
`--ner`	str	One or more (comma-separated) datasets for the named entity recognizer.	`None`
`--textcat`	str	One or more (comma-separated) datasets for the text classifier (exclusive categories).	`None`
`--textcat-multilabel`	str	One or more (comma-separated) datasets for the text classifier (non-exclusive categories).	`None`
`--senter`	str	One or more (comma-separated) datasets for the sentence recognizer.	`None`
`--parser`	str	One or more (comma-separated) datasets for the dependency parser.	`None`
`--tagger`	str	One or more (comma-separated) datasets for the part-of-speech tagger.	`None`
`--spancat`	str	One or more (comma-separated) datasets for the span categorizer.	`None`
`--coref`	str	One or more (comma-separated) datasets for the coreference model. Requires `spacy-experimental`.	`None`
`--metric`	str	The metric to sort the examples by. The following metrics are supported: `token_acc`, `tag_acc`, `pos_acc`, `morph_acc`, `lemma_acc`, `dep_uas`, `dep_las`, `ents_p`, `ents_r`, `ents_f`, `cats_score`, `sents_p`, `sents_r`, `sents_f`, `spans_sc_p`, `spans_sc_r`, `spans_sc_f`, `speed`. Please choose a metric most appropriate to your model.	`None`
`--n-results`, `-NR`	int	Number of top examples to display.	`10`
`--gpu_id`	int	ID of the GPU to use.	`-1`
`--verbose`	bool	Print detailed information about the evaluation.	`False`
`--output-path`, `-OP`	str	Path to a `jsonl` file to save the scored examples to.	`None`

`evaluate.nervaluate` command

Interface: terminal only
Use case: evaluate a spaCy NER component using full named-entity evaluation metrics based on SemEval '13

Evaluate a spaCy NER component using full named-entity evaluation metrics based on SemEval ‘13. Datasets are provided in the same per-component format as the prodigy evaluate command e.g. --ner my_eval_dataset_1,my_eval_dataset_2.

This command leverages the nervaluate Python library to “go beyond a simple token/tag based schema, and consider different scenarios based on weather all the tokens that belong to a named entity were classified or not, and also whether the correct entity type was assigned.”

This is helpful if you are interested in partial matches as part of your NER evaluation use-case. If you are interested in per-label evaluation metrics, you can pass the --per-label flag to the command.

prodigyevaluate.nervaluatemy_custom_ner_model--ner ner_dataset--per-label

Argument	Type	Description	Default
`model`	str	Name of spaCy pipeline to evaluate. Must have a trained NER model to evaluate.
`--ner`	str	One or more (comma-separated) datasets for the named entity recognizer. Use the eval: prefix for evaluation sets.	`None`
`--gpu_id`	int	ID of the GPU to use.	`-1`
`--verbose`	bool	Print detailed information about the evaluation.	`False`
`--per-label`	bool	print per-label NER metrics to the terminal.	`False`