Prodigy Plugins
Some Prodigy recipes require a 3rd party library in order to work. To keep Prodigy lightweight we’ve separated some of these recipes out into their own packages so that you may install them as a plugin. These plugins always target the most recent version of Prodigy with regards to compatibility.
This section of the docs showscases such plugins. Note that you can also explore these recipes on Github to serve as a source of inspiration to customise further.
🤗 Prodigy HF | Recipes that interact with the Huggingface stack. Recipes: hf.train.ner , hf.correct.ner , hf.upload and more | |
📄 Prodigy PDF | Recipes that help with the annotation of PDF files. Recipes: pdf.image.manual , pdf.ocr.correct , pdf.spans.manual , pdf.layout.fetch | |
🤫 Prodigy Whisper | Recipes that leverage OpenAI’s Whisper model for audio transcription. Recipes: whisper.audio.annotate | |
🍰 Prodigy Segment | Recipes that leverage Meta’s Segment-Anything model for image segmentation. Recipes: segment.image.manual and more | |
🏘 Prodigy ANN | Recipes that allow you to use approximate nearest neighbor techniques to help you annotate. Recipes: ann.text.index , ann.image.index , ann.text.fetch and more | |
🌕 Prodigy Lunr | Recipes that allow you to use old-school string matching techniques to help you annotate. Recipes: lunr.text.index , lunr.text.fetch and more | |
🦆 sense2vec | Recipes that allow to fetch terms using phrase embeddings trained on Reddit. Recipes: sense2vec.teach , sense2vec.to-patterns and more | |
🔎 Prodigy Evaluate | Recipes that compute evaluation metrics for spaCy pipelines. Recipes: evaluate.evaluate , evaluate.evaluate-example and more |
🤗 Prodigy-HF
This plugin contains recipes that interact with the Hugging Face stack. Some recipes will allow you to directly train transformer models on top of your annotations while other recipes allow you to upload artifacts to HF cloud environment.
To use these recipes, you’ll first need to install the plugin.
Install prodigy-hf
pip install "prodigy-hf @ git+https://github.com/explosion/prodigy-hf"
Once it is installed you can explore some of the new recipes.
Training Hugging Face models
The first recipe that you may enjoy from this plugin is the recipe to train custom NER models.
Once the model is done training you’ll be able to inspect the hf-model-dir
folder to find all the trained state.
You can also choose to re-use this trained model to help you annotate data. The
plugin features a hf.ner.correct
recipe that works similarily to
ner.correct
except here we get to use a Hugging Face model. This means
that you can also use models from the
Hugging Face Hub.
This recipe will internally map the predictions from the transformer model to
spaCy tokens.
Note that this plugin also offers variants of these recipes for text
classification. Check out the API docs for hf.train.textcat
and
hf.correct.textcat
for more details.
Interacting with Hugging Face Hub
Alternatively, you may also use these plugin to upload your annotated datasets to Hugging Face Hub.
Internally this recipe will validate the dataset for consistency and will attempt to anonymise the annotators before uploading. You can turn this behavior off with flags and you can also specify that you want the dataset not to appear publicly.
API
hf.train.ner
command
Trains a Hugging Face model for NER directly on your annotated datasets.
Argument | Type | Description | Default |
---|---|---|---|
datasets | positional | One or more (comma-separated) datasets for the named entity recognizer. Use the eval: prefix for evaluation | |
out_dir | positional | Folder to store trained model and checkpoints. | |
--model-name , -mn | option | Pick the model you’d like to use as a starting point for training. | ”distilbert-base-uncased” |
--batch-size , -bs | option | Batch size for training. | 8 |
--eval-split , -es | option | If no evaluation sets are provided for a component, this setting can be used to split off a a percentage of the training examples for evaluation. If no evaluation splits are given the train set performance will be reported. | |
--learning-rate , -lr | option | Learning rate. | 2e-5 |
--verbose , -v | flag | Output all the logs/warnings from Hugging Face libraries. | False |
hf.correct.ner
manual
Annotate NER data with a transformer model in the loop.
Argument | Type | Description | Default |
---|---|---|---|
dataset | positional | Dataset to save annotations into | |
out_dir | positional | Path to transformer model. Can also point to a model on Hugging Face Hub. | |
source | positional | Source file to annotate | |
--lang , -l | option | Language to assume for the spaCy tokeniser | ”en” |
hf.train.textcat
command
Trains a Hugging Face model for text classification directly on your annotated datasets.
Argument | Type | Description | Default |
---|---|---|---|
datasets | positional | One or more (comma-separated) datasets for the named entity recognizer. Use the eval: prefix for evaluation | |
out_dir | positional | Folder to store trained model and checkpoints. | |
--model-name , -mn | option | The name of the model to be used as a starting point for training. | ”distilbert-base-uncased” |
--batch-size , -bs | option | Batch size for training. | 8 |
--eval-split , -es | option | If no evaluation sets are provided for a component, this setting can be used to split off a a percentage of the training examples for evaluation. If no evaluation splits are given the train set performance will be reported. | |
--learning-rate , -lr | option | Learning rate. | 2e-5 |
--verbose , -v | flag | Output all the logs/warnings from Huggingface libraries. | False |
hf.correct.textcat
manual
Annotate data for text classification with a transformer model in the loop.
Argument | Type | Description | Default |
---|---|---|---|
dataset | positional | Dataset to save annotations into | |
out_dir | positional | Path to transformer model. Can also point to a model on Hugging Face Hub. | |
source | positional | Source file to annotate |
hf.upload
command
Upload your annotations to Hugging Face Hub.
You can use the same command multiple times to upload the most recent version of your data to the hub.
Argument | Type | Description | Default |
---|---|---|---|
datasets | positional | One or more (comma-separated) datasets to upload. Use the name: prefix to add keys to the dataset. | |
repo_id | positional | Name of the repo to upload to. Should be formatted as <username>/<reponame> . | |
--keep-annotator-ids , -k | flag | Don’t anonymize the annotators. | False |
--patch_values , -nv | flag | If keys are missing between datasets, patch them with None values. | False |
--private , -p | flag | Upload dataset as a private repository. | False |
Prodigy-PDF
This plugin contains recipes for annotating PDF files using the familiar image-based image_manual
interface, as well as recipes for OCR (Optical Character Recognition) to extract text-based content from documents.
To use these recipes, you’ll first need to install the plugin. In order for the recipes to work, you may also need to install system dependencies
for tesseract
.
Install prodigy-pdf
pip install "prodigy-pdf @ git+https://github.com/explosion/prodigy-pdf"brew install tesseract # macOssudo apt install tesseract-ocr # Linux
Once it is installed, you can start annotating PDFs as images via
pdf.image.manual
:
If you like, you can re-use the pdf annotations with the pdf.ocr.correct
recipe to apply OCR to the annotated segments. This recipe uses
pytessaract under the hood to give
suggestions that you can correct.
API
pdf.image.manual
manual
Add layout annotations to a PDF.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
pdf_folder | str | Folder that contains your PDF files. | |
--labels , -l | str | Comma-separated labels to annotate. | |
--remove-base64 , -R | bool | Don’t save the base64 images of the PDF. | False |
--split-pages , -S | bool | New: 0.3 View each page as a separate task. By default, multi-page documents are grouped together using the pages interface. | False |
pdf.ocr.correct
manual
Applies Optical Character Recognition (OCR) to annotated segments from pdf.image.manual
and gives a textbox
for corrections.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
source | str | Source with PDF annotations. | |
--labels , -l | str | Labels to consider. | |
--scale , -s | int | Zoom scale. Increase above 3 to upscale the image for OCR. | 3 |
--remove-base64 , -R | bool | Don’t save the base64 images of the PDFs. | False |
--fold-dashes , -f | bool | Removes dashes at the end of a text line and folds them with the next term. | False |
--autofocus , -af | bool | Autofocus on the transcript UI. | False |
pdf.spans.manual
manualNew: 0.4.0
This recipe lets you apply span annotations to text-based document contents extracted with spacy-layout
and Docling. For higher annotation speed and efficiency, you can set --focus text,list_item
or similar to walk through individual text blocks of the given layout labels, which are highlighted in a visual preview of the document page. When annotating in focus mode, the span of text you’re working on is preserved as "text_span"
in the JSON data, so you’ll always be able to relate it back into the original full document.
To speed up loading times, you can optionally use pdf.layout.fetch
to pre-fetch the data extracted from the PDFs. You can then run this recipe with the name of a dataset using the dataset:
prefix, or the path to the created JSONL file.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy model or blank:en or similar for tokenization, and optional named entities. | |
source | str | Path to directory of PDF files, or dataset or JSONL file created with pdf.layout.fetch . | |
--label , -l | str | Comma-separated span labels to annotate. If named entities are added, those will be filtered by the provided labels. | None |
--add-ents , -E | bool | Add named entities predicted by the model. | False |
--focus , -f | str | Annotate in focus mode: comma-separated list of layout span labels to focus on and annotate section-by-section, e.g. text , list_item or section_header . The current section will be highlighted in the visual preview. | None |
--disable , -d | str | Comma-separated layout span labels to disable and make unselectable in the UI. | None |
--split-pages , -S | bool | View pages as separate tasks instead of grouped together using the pages interface. | False |
--hide-preview , -HP | bool | Don’t show side-by-side preview of document layout next to the extracted text. | False |
pdf.layout.fetch
commandNew: 0.4.0
This recipe lets you preprocess your PDFs to speed up loading times during annotation with pdf.spans.manual
. The data can be saved to a JSONL file, or a Prodigy dataset using the dataset:
prefix. The data includes everythig needed to render the examples with the different layout configurations. You only need to decide upfront if you want the data to be paginated or split by sections, since this impacts the JSON structure that will be created.
Argument | Type | Description | Default |
---|---|---|---|
output | str | Path to JSONL file or name of dataset using the dataset: prefix. | |
spacy_model | str | Loadable spaCy model or blank:en or similar for tokenization. | |
source | str | Path to directory of PDF files. | |
--focus , -f | str | Annotate in focus mode: comma-separated list of layout span labels to focus on and annotate section-by-section, e.g. text , list_item or section_header . The current section will be highlighted in the visual preview. | None |
--split-pages , -S | bool | View pages as separate tasks instead of grouped together using the pages interface. | False |
🤫 Prodigy-Whisper
OpenAI released an open model for audio annotation called Whisper. It’s a model that can be downloaded locally, it has support for multiple languages and you’re even able to pick from a selection of models. The model isn’t perfect, but when you’re transcribing text, it can really help to have such a model provide a starting point. The goal of this plugin is to help you get started with this right away.
To use this plugin, you’ll need to install it first.
Install prodigy-whisper
pip install "prodigy-whisper @ git+https://github.com/explosion/prodigy-whisper"
In order to use the plugin you’ll also need to have ffmpeg
installed. Most
package managers should have these available so you should be able to use one of
the following commands.
Install ffmpeg
# on Ubuntu or Debiansudo apt update && sudo apt install ffmpeg# on Arch Linuxsudo pacman -S ffmpeg# on MacOS using Homebrew (https://brew.sh/)brew install ffmpeg# on Windows using Chocolatey (https://chocolatey.org/)choco install ffmpeg# on Windows using Scoop (https://scoop.sh/)scoop install ffmpeg
Once the plugin is installed you can use the whisper.audio.transcribe
recipe.
It is very similar to audio.transcribe
recipe that Prodigy provides, but
this recipe uses Whisper to provide an initial transcription.
In the base form you can already see that Whisper does a pretty good job at transcription. But it may be easier to correct short pieces of audio instead of a long one. This is where Wishper can help out as well. It is able to segment a long audio clip into shorter segments and each of these segments can then be annotated in Prodigy.
To use this feature, you can add the --segment
flag to the recipe call.
Now, you can go through the segments one by one and each segment will have metadata attached so that you can link it back to the timestamps in the original file. This is what the first segment would look like.
This is what the second segment would look like.
API
whisper.audio.transcribe
manual
Manually transcribe audio files by typing the transcript into a text field with
the help of Whisper. The API is built on top of audio.transcribe
and will
allow you to configure everything that the original recipe can. The only input
addition is that this recipe also allows you to select a Whisper model. The
recipe uses the "base"
model by default, but you should be able to pick any of
the models shown on
here.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
source | str | Path to a directory containing audio files or pre-formatted JSONL file if --loader jsonl is set. | |
--model , -m | str | Name of OpenAI Whisper model to use. | base |
--loader , -lo | str | Optional ID of source loader, e.g. audio or video . | audio |
--autoplay , -A | bool | Autoplay the audio when a new task loads. | False |
--keep-base64 , -B | bool | If audio loader is used: don’t remove the base64-encoded audio data from the task before it’s saved to the database. | False |
--fetch-media , -FM | bool | Convert local paths and URLs to base64. Can be enabled if you’re annotating a JSONL file with paths or for re-annotating an existing dataset. | False |
--playpause-key , -pk | str | Alternative keyboard shortcuts to toggle play/pause so it doesn’t conflict with text input field. | "command+enter, option+enter, ctrl+enter" |
--text-rows , -tr | int | Height of the text input field, in rows. | 6 |
--field-id , -fi | str | Add the transcript text to the data using this key, e.g. "transcript": "Text here" . | "transcript" |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
Prodigy-Segment
Sometimes you’re interested in selecting pixels from an image, as opposed to
merely selecting a bounding box. Selecting the right pixels can be tedious work
so you may want to use a model in the loop to help you. A good choice for such a
model is Meta’s Segment Anything model, which
we’ve integrated into Prodigy via the
prodigy-segment
plugin.
This model is able to take bounding box annotations from Prodigy to construct a pixel segmentation map under the hood. From the UI, that might look like this:
Using Prodigy-Segment
For a quick overview of the features, you may also enjoy this Youtube tutorial.
Before you’ll be able to use recipes, you’ll want to make sure you’ve downloaded the appropriate model checkpoint beforehand. You can check the available models here but this tutorial will assume the “default” model-type. The weights for this model can be downloaded via:
Download the weights for the `default` model-type
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
Once the model is downloaded you can get started by running the
segment.image.manual
recipe.
When you run this model, you may notice that it’s fairly slow. This isn’t a big suprise given the size of the model but it can be a serious burden, especially if your machine does not have a GPU. For a better experience, you may want to pre-compute the features ahead of annotation time and cache those results to disk. It may take a while to precompute all the images, but once they are done the annotation experience feels seamless and realtime again.
To precompute a cache, you can use the segment.fill-cache
recipe.
This will store all the features in a folder (configurable via the --cache
flag) which the segment.image.manual
recipe can immediately pick up.
The pixel maps, once annotated, are stored under the spans
key in your
examples. You can explore these maps one by one in a Jupyter notebook using the
script shown below.
Script to loop over all annotated examples
import base64from io import BytesIOfrom PIL import Imagefrom prodigy.components.db import connectdb = connect()examples = db.get_dataset_examples("<dataset-name>")def mask_to_pil(mask_str):indicator = "base64,"mask_str = mask_str[mask_str.find(indicator) + len(indicator):]bytes = BytesIO(base64.b64decode(mask_str))return Image.open(bytes)# Loop over all the examples and display them.for ex in examples:print(ex['path'])for span in ex.get("spans", []):# Use builtin `display` to view pixel mapdisplay(mask_to_pil(span['mask']))
From here you can re-use the Pillow library to either store these pixel maps into the required format for your pipeline or you can stream them directly into a learning algorithm from Python.
API
segment.image.manual
manual
Manually transcribe pixels in images with Meta’s segment anything model under the hood.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
source | str | Path to a directory containing audio files or pre-formatted JSONL file if --loader jsonl is set. | |
checkpoint | Path | Path to a model checkpoint. | |
--label , -l | str / Path | One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. | |
--loader , -lo | str | Optional ID of source loader. | images |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
--width , -w | int | Width of card and maximum image width in pixels. | 675 |
--darken , -D | bool | Darken image to make boxes stand out more. | False |
--no-fetch , -NF | bool | Don’t fetch images as base64. Ideally requires a JSONL file as input, with --loader jsonl set and all images available as URLs. | False |
--remove-base64 , -R | bool | Remove base64-encoded image data before storing example in the database and only keep the reference to the local file path. Caution: If enabled, make sure to keep original files! | False |
--model-type , -mt | str | Type of model to use. | default |
--cache , -c | Path | Path to feature cache to speed up inference. | segment-anything-cache |
segment.fill-cache
command
Prepares a local disk cache to speed up inference for segment.image.manual
.
This can cause a huge speedup if you’re running on a non-GPU device.
Argument | Type | Description | Default |
---|---|---|---|
source | str | Path to a directory containing audio files or pre-formatted JSONL file if --loader jsonl is set. | |
checkpoint | Path | Path to a model checkpoint. | |
--loader , -lo | str | Optional ID of source loader. | images |
--cache , -c | Path | Path to feature cache to speed up inference. | segment-anything-cache |
Prodigy-ANN
Sometimes you may want to query your examples to find a relevant subset for annotation. A modern method for doing this is to use numeric vectors to represent text and you can use approximate neighest neighbor (ANN) techniques to fetch relevant examples. The goal is to spend more time looking at examples that matter, like examples similar to items that the model gets wrong. Curating these examples first might be a pragmatic method to steer the model in the right direction.
If you’re interested to see a quick demo for Prodigy-ANN applied to a text dataset, you may appreciate this Prodigy short on Youtube.
To use this plugin, you’ll need to install it first.
Install prodigy-ann
pip install "prodigy-ann @ git+https://github.com/explosion/prodigy-ann"
As a first step for this approach you’ll first need to generate an index with vector representations of your text. To encode the text this library uses sentence-transformers and it uses hnswlib as an index for these vectors.
To index your documents, you can run the ann.text.index
recipe.
Once it is indexed you can use text queries find and curate interesting subsets.
A general method to prepare these subsets is to use ann.text.fetch
. This will
fetch a subset of vectors that are close in vector space and save the associated
examples on disk. From there you can use any Prodigy recipe you like.
More interfaces
As a convenience this plugin also provides the textcat.ann.manual
,
ner.ann.manual
and spans.ann.manual
so that you may query and annotate
directly. These recipes have the same arguments as their native Prodigy
textcat.manual
, ner.manual
and spans.manual
counterparts
but add a --query
parameter so that you may pass your query.
Interactive Queries
Sometimes you may want to update the stream while you’re annotating. You can do
that without restarting the server by using the --allow-reset
flag when you’re
starting the textcat.ann.manual
, ner.ann.manual
or spans.ann.manual
recipes.
Here’s an example of what the experience might look like from the UI.
Retreiving Images
You can use these embedding retreival techniques for images too. Models like CLIP allow you to embed images and text in the same space, which means that you can query the images by using text.
The approach for images is very similar to the approach for text too. To get
started you’ll first want to run an indexing recipe over a folder of images via
the ann.image.index
recipe.
Once the index is built, you can query it. You can choose to query it to prepare
a .jsonl
file to re-use later via the ann.image.fetch
recipe.
Alternatively the plugin also provides a wrapper around the familiar
image.manual
recipe. This will retreive the images before passing it on
to the image_manual
interface. This interface also allows you to reset
the stream via the --allow-reset
flag.
Here’s an example of what the experience might look like from the UI.
API
ann.text.index
command
Builds an HSNWLIB index on example text data.
Argument | Type | Description | Default |
---|---|---|---|
source | Path | Path to source to index. | |
index_path | Path | Path of trained index |
ann.text.fetch
command
Fetch a relevant subset using a HNSWlib index.
Argument | Type | Description | Default |
---|---|---|---|
source | Path | Path to source to index. | |
index_path | Path | Path of trained index | |
out_path | Path | Path to stored subset of interest | |
--query , -q | str | Query to encode and pass to index | |
--n , -n | str | Number of results to return from index | 200 |
ann.image.index
command
Builds an HSNWLIB index on example image data.
Argument | Type | Description | Default |
---|---|---|---|
source | Path | Path to source folder of images to index. | |
index_path | Path | Path of trained index |
ann.image.fetch
command
Fetch a relevant subset of images using a HNSWlib index.
Argument | Type | Description | Default |
---|---|---|---|
source | Path | Path to source folder of images for index. | |
index_path | Path | Path of trained index | |
out_path | Path | Path to stored subset of interest | |
--query , -q | str | Query to encode and pass to index | |
-n | int | Number of items to retreive | 200 |
remove-base64 , -R | bool | Don’t save the base64 images on disk | False |
Prodigy-Lunr
Instead of using semantic vectors with approximate nearest neighbors to find
relevant subsets you can also resort to the “regular” search techniques. To
accomodate these techniques we’ve added support for recipes that use
lunr. These recipes are very similar
to their ann.*
counterparts but will rely on string matching techniques to
retreive relevant examples.
To use this plugin, you’ll need to install it first.
Install prodigy-lunr
pip install "prodigy-lunr @ git+https://github.com/explosion/prodigy-lunr"
To index your documents, you can run the ann.text.index
recipe. This will
generate an index and serialize it to disk by writing it into a gzipped json
file.
Once it is indexed you can use text queries find and curate interesting subsets.
A general method to prepare these subsets is to use lunr.text.fetch
. This will
fetch a subset of vectors that are close in vector space and save the associated
examples on disk. From there you can use any Prodigy recipe you like.
More interfaces
As a convenience this plugin also provides the textcat.lunr.manual
,
ner.lunr.manual
and spans.lunr.manual
so that you may query and annotate
directly. These recipes have the same arguments as their native Prodigy
textcat.manual
, ner.manual
and spans.manual
counterparts
but add a --query
parameter so that you may pass your query.
Interactive Queries
Sometimes you may want to update the stream while you’re annotating. You can do
that without restarting the server by using the --allow-reset
flag when you’re
starting the textcat.lunr.manual
, ner.lunr.manual
or spans.lunr.manual
recipes.
API
lunr.text.index
command
Builds an HSNWLIB index on example text data.
Argument | Type | Description | Default |
---|---|---|---|
source | Path | Path to source to index. | |
index_path | Path | Path to stored lunr index |
lunr.text.fetch
command
Fetch a relevant subset using a HNSWlib index.
Argument | Type | Description | Default |
---|---|---|---|
source | Path | Path to source to index. | |
index_path | Path | Path to stored lunr index | |
out_path | Path | Path to stored subset of interest | |
--query , -q | str | Query to encode and pass to index |
Sense2vec
sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detailed word vectors. This library is a simple Python implementation for loading, querying and training sense2vec models. To explore the semantic similarities across all Reddit comments of 2015 and 2019, see the interactive demo. There are also more details in this blogpost.
To see a demo on how to use this tool with Prodigy, you may enjoy this Youtube video where we use it to detect video games in text.
To use sense2vec, you’ll first need to install it.
python -m pip install sense2vec
To use the pre-trained vectors in Prodigy you’ll need to download the archive(s) and extract them. Large files have been split into multi-part downloads. All the available versions can be found below.
Vectors | Size | Description | Download Link (zipped) |
---|---|---|---|
s2v_reddit_2019_lg | 4 GB | Reddit comments 2019 (01-07) | part 1, part 2, part 3 |
s2v_reddit_2015_md | 573 MB | Reddit comments 2015 | part 1 |
To merge the multi-part archives, you can run the following:
cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz
Once downloaded (and merged) you should be able to unarchive via:
tar -xvf s2v_reddit_lg.tar.gz
Now that the archive is extracted you can point the sense2vec.teach
recipe to
it. This will allow Prodigy to suggest similar terms based on the most similar
phrases from sense2vec, and the suggestions will be adjusted as you annotate and
accept similar phrases. For each seed term, the best matching sense according to
the sense2vec vectors will be used.
After curating the generated examples you can choose to export the collected
phrases as pattern files which can be used with
spaCy’s EntityRuler
or recipes like ner.manual
by using the sense2vec.to-patterns
recipe.
This will generate a patterns.jsonl
file locally that has contents that may
look like:
{"label": "VIDEO_GAME", "pattern": [{"LOWER": "mass"}, {"LOWER": "effect"}]}{"label": "VIDEO_GAME", "pattern": [{"LOWER": "knights"}, {"LOWER": "of"}, {"LOWER": "the"}, {"LOWER": "old"}, {"LOWER": "republic"}]}{"label": "VIDEO_GAME", "pattern": [{"LOWER": "halo"}, {"LOWER": "3"}]}{"label": "VIDEO_GAME", "pattern": [{"LOWER": "jade"}, {"LOWER": "empire"}]}
More recipes
Sense2vec also has the
sense2vec.eval
,
sense2vec.eval-most-similar
and
sense2vec.eval-ab
recipes. These may be interesting if you’re interested in evaluating a sense2vec
model. For more information on those, you can check the
README on
the Github repository.
sense2vec.teach
binary
Bootstrap a terminology list using sense2vec.
Argument | Type | Description | Default |
---|---|---|---|
dataset | positional | Dataset to save annotations to. | |
vectors_path | positional | Path to pretrained sense2vec vectors. | |
--seeds , -s | option | One or more comma-separated seed phrases. | |
--threshold , -t | option | Similarity threshold. | 0.85 |
--n-similar , -n | option | Number of similar items to get at once. | 100 |
--batch-size , -b | option | Batch size for submitting annotations. | 5 |
--case-sensitive , -CS | option | Show the same terms with different casing. | False |
--resume , -R | flag | Resume from an existing phrases dataset. | False |
sense2vec.to-patterns
command
Convert a dataset of phrases collected with sense2vec.teach to token-based match patterns.
Argument | Type | Description | Default |
---|---|---|---|
dataset | positional | Phrase dataset to convert. | |
spacy_model | positional | spaCy model for tokenization. | |
label | positional | Label to apply to all patterns. | |
--output-file , -o | option | Optional output file. Defaults to stdout. | |
--case-sensitive , -CS | flag | Make patterns case-sensitive. | False |
--dry , -D | flag | Perform a dry run and don’t output anything. | False |
🔎 Prodigy-evaluate
This Prodigy plugin allows you to evaluate your spaCy pipeline overall or on a per-example basis. To use these recipes, you’ll first need to install the plugin.
Install prodigy-evaluate
pip install "prodigy-evaluate @ git+https://github.com/explosion/prodigy-evaluate"
Once installed, you can make use of the two main recipes in this plugin:
evaluate.evaluate
and evaluate.evaluate-example
.
evaluate.evaluate
command
This recipe allows you to evaluate a spaCy pipeline on one or more datasets for
different components. Per-component datasets can be passed in the same way as in
the case of
The --label-stats
flag lets you investigate per-label scores like precision,
recall and F1 scores for NER
and textcat
components. The
--confusion-matrix
flag will output a confusion matrix for the NER
and
textcat
components. If you’d like to customize how the confusion matrix is
rendered, you can save the an array of the confusion matrix by passing an output
path via the --cf-path
argument and use it with your favourite data
visualization library. Please note that a separate inference is run to obtain the confusion matrix and as results are not deterministic, there may be slight variations in evaluation and confusion matrix results.
Argument | Type | Description | Default |
---|---|---|---|
model | str | Name of spaCy pipeline to evaluate. | |
--ner | str | One or more (comma-separated) datasets for the named entity recognizer. | None |
--textcat | str | One or more (comma-separated) datasets for the text classifier (exclusive categories). | None |
--textcat-multilabel | str | One or more (comma-separated) datasets for the text classifier (non-exclusive categories). | None |
--senter | str | One or more (comma-separated) datasets for the sentence recognizer. | None |
--parser | str | One or more (comma-separated) datasets for the dependency parser. | None |
--tagger | str | One or more (comma-separated) datasets for the part-of-speech tagger. | None |
--spancat | str | One or more (comma-separated) datasets for the span categorizer. | None |
--coref | str | One or more (comma-separated) datasets for the coreference model. Requires spacy-experimental. | None |
--label-stats , -LS | bool | Compute per-label statistics for NER and textcat components. | False |
--gpu_id | int | ID of the GPU to use. | -1 |
--verbose | bool | Print detailed information about the evaluation. | False |
--confusion-matrix , -CF | bool | Compute confusion matrix for NER, textcat and textcat-multilabel components. | False |
--cf-path , -CP | str | Local path to save the confusion matrix to. Available for NER, textcat and textcat-multilabel components. | None |
--spans-key | str | Key to use for spans in the evaluation data. | sc |
evaluate.evaluate-example
command
Evaluate a spaCy pipeline on one or more datasets for different components on a
per-example basis. Datasets are provided in the same per-component format as
the prodigy evaluate command e.g. --ner my_eval_dataset_1,my_eval_dataset_2
.
This command will run an evaluation on each example individually and then sort
by the desired --metric
argument.
This is helpful for debugging and for understanding the hardest or easiest examples for your model. The example below shows how to evaluate a model on a dataset on a per-example basis and sort by the lowest NER F1 score.
If you would like to save the top examples sorted by your metric, you can use
the --output-path
argument to save the examples in .jsonl
format to file. If
you’re evaluating NER, spancat or textcat pipeline, this .jsonl
file could
then be used as input to Prodigy correct
( ner.correct
,
spans.correct
, textcat.correct
) or model-annotate
( ner.model-annotate
, spans.model-annotate
,
textcat.model-annotate
) workflows to quickly inspect your model’s
predictions on hardest examples.
Argument | Type | Description | Default |
---|---|---|---|
model | str | Name of spaCy pipeline to evaluate. | |
--ner | str | One or more (comma-separated) datasets for the named entity recognizer. | None |
--textcat | str | One or more (comma-separated) datasets for the text classifier (exclusive categories). | None |
--textcat-multilabel | str | One or more (comma-separated) datasets for the text classifier (non-exclusive categories). | None |
--senter | str | One or more (comma-separated) datasets for the sentence recognizer. | None |
--parser | str | One or more (comma-separated) datasets for the dependency parser. | None |
--tagger | str | One or more (comma-separated) datasets for the part-of-speech tagger. | None |
--spancat | str | One or more (comma-separated) datasets for the span categorizer. | None |
--coref | str | One or more (comma-separated) datasets for the coreference model. Requires spacy-experimental . | None |
--metric | str | The metric to sort the examples by. The following metrics are supported: token_acc , tag_acc , pos_acc , morph_acc , lemma_acc , dep_uas , dep_las , ents_p , ents_r , ents_f , cats_score , sents_p , sents_r , sents_f , spans_sc_p , spans_sc_r , spans_sc_f , speed . Please choose a metric most appropriate to your model. | None |
--n-results , -NR | int | Number of top examples to display. | 10 |
--gpu_id | int | ID of the GPU to use. | -1 |
--verbose | bool | Print detailed information about the evaluation. | False |
--output-path , -OP | str | Path to a jsonl file to save the scored examples to. | None |
evaluate.nervaluate
command
Evaluate a spaCy NER component using full named-entity evaluation metrics based
on SemEval ‘13. Datasets are provided in the same per-component format as the
prodigy evaluate command e.g. --ner my_eval_dataset_1,my_eval_dataset_2
.
This command leverages the nervaluate
Python library to “go beyond a simple
token/tag based schema, and consider different scenarios based on weather all
the tokens that belong to a named entity were classified or not, and also
whether the correct entity type was assigned.”
This is helpful if you are interested in partial matches as part of your NER
evaluation use-case. If you are interested in per-label evaluation metrics, you
can pass the --per-label
flag to the command.
Argument | Type | Description | Default |
---|---|---|---|
model | str | Name of spaCy pipeline to evaluate. Must have a trained NER model to evaluate. | |
--ner | str | One or more (comma-separated) datasets for the named entity recognizer. Use the eval: prefix for evaluation sets. | None |
--gpu_id | int | ID of the GPU to use. | -1 |
--verbose | bool | Print detailed information about the evaluation. | False |
--per-label | bool | print per-label NER metrics to the terminal. | False |