Built-in Recipes
A Prodigy recipe is a Python function that can be run via the command line.
Prodigy comes with lots of useful recipes, and it’s very easy to
write your own. Recipes don’t have to start the web
server – you can also use the recipe decorator as a quick way to make your
Python function into a command-line utility. To view the recipe arguments and
documentation on the command line, run the command with --help
, for example
prodigy ner.manual --help
.
Named Entity Recognition | Tag names and concepts as spans in text. |
Span Categorization | Label arbitrary and potentially overlapping spans in text. |
Text Classification | Assign one or more categories to whole texts. |
Part-of-speech Tagging | Assign part-of-speech tags to tokens. |
Sentence Segmentation | Assign sentence boundaries. |
Dependency Parsing | Assign and correct syntactic dependency attachments in text. |
Coreference Resolution | Resolve mentions and references to the same words in text. |
Relations | Annotate any relations between words and phrases. |
Computer Vision | Annotate images and image segments. |
Audio & Video | Annotate and segment audio and video files. |
Large Language Models | Perform zero or few-shot annotation using large-language models. |
Training | Train models and export training corpora. |
Vectors & Terminology | Create patterns and terminology lists from word vectors. |
Review & Evaluate | Review data, resolve conflicts and compute inter-annotator agreement. |
Utilities & Commands | Manage datasets, view data and streams, and more. |
Plugins | Extend Prodigy with more workflows, e.g. PDFs, Hugging Face and more. |
Deprecated Recipes | Recipes that have already been replaced by better alternatives. |
Named Entity Recognition
ner.manual
manual
Mark entity spans in a text by highlighting them and selecting the respective labels. The model is used to tokenize the text to allow less sensitive highlighting, since the token boundaries are used to set the entity spans. The label set can be defined as a comma-separated list on the command line or as a path to a text file with one label per line. If no labels are specified, Prodigy will check if labels are present in the model. This recipe does not require an entity recognizer, and doesn’t do any active learning.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy pipeline for tokenization or blank:lang for a blank model (e.g. blank:en for English). | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels. | |
--patterns , -pt | str | New: 1.9 Optional path to match patterns file to pre-highlight entity spans. | None |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
--highlight-chars , -C | bool | New: 1.14.5 Allow switching between highlighting individual characters and tokens. If set, character highlighing is set by deafault and no "tokens" information will be saved with the example. |
ner.correct
manual
Create gold-standard data for NER by correcting the model’s suggestions. The spaCy pipeline will be used to predict entities contained in the text, which the annotator can remove and correct if necessary.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy pipeline. | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels. | |
--update , -UP | bool | New: 1.11 Update the model in the loop with the received annotations. | False |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
--unsegmented , -U | bool | Don’t split sentences. | False |
--component , -c | str | New: 1.11 Name of NER component in the pipeline. | "ner" |
ner.teach
binary
Collect the best possible training data for a named entity recognition model with the model in the loop. Based on your annotations, Prodigy will decide which questions to ask next. If the suggested entity is fully correct, you can accept it. If it’s entirely or partially wrong, you should reject it. As of v1.11, the recipe will also ask you about examples containing no entities at all, which can improve overall accuracy of your model. So if you see an example with no highlighted suggestions, you can accept it if the text contains no entities, or reject it if it does contain entities of the labels you’re annotating.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy pipeline. | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | Label(s) to annotate. Accepts single label or comma-separated list. If not set, all available labels will be returned. | None |
--patterns , -pt | str | Optional path to match patterns file to pre-highlight entity spans in addition to those suggested by the model. | None |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
--unsegmented , -U | bool | Don’t split sentences. | False |
ner.silver-to-gold
manual
Take existing “silver” datasets with binary accept/reject annotations, merge the annotations to find the best possible analysis given the constraints defined in the annotations, and manually edit it to create a perfect and complete “gold” dataset.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset ID to save annotations to. | |
silver_sets | str | Comma-separated names of existing binary datasets to convert. | |
spacy_model | str | Loadable spaCy pipeline. | |
--label , -l | str | One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels. |
ner.eval-ab
binary
Load two models and a stream of text, compare their predictions and select which result you prefer. The outputs will be randomized, so you won’t know which model is which. When you stop the server, the results are calculated. This recipe is especially helpful if you’re updating an existing model or if you’re trying out a new strategy on the same problem. Even if two models achieve similar accuracy, one of them can still be subjectively “better”, so this recipe lets you analyze that.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
model_a | str | First loadable spaCy pipeline to compare. | |
model_b | str | Second loadable spaCy pipeline to compare. | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels. | |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
--unsegmented , -U | bool | Don’t split sentences. | False |
ner.model-annotate
commandNew: 1.13.1
Leverage a model to add NER annotations to the database. You can repeat this
multiple times with different models so that you may easily compare their
predictions using the review
recipe and curate examples where models
disagree.
For more information on this method of annotating you can consult this guide.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spacy pipeline that can do named entity recognition. | |
source | str | Path to text source or - to read from standard input. | |
model_alias | str | Model alias to be used as “annotator id” in the UI. | |
--label , -l | str | Optional subset of labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels. | None |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
component | str | Specific NER component to use in spaCy pipeline | ner |
Span Categorization
spans.manual
manualNew: 1.11
Mark entity spans in a text by highlighting them and selecting the respective labels. The model is used to tokenize the text to allow less sensitive highlighting, since the token boundaries are used to set the entity spans. The label set can be defined as a comma-separated list on the command line or as a path to a text file with one label per line. If no labels are specified, Prodigy will check if labels are present in the model. This recipe does not require an entity recognizer, and doesn’t do any active learning.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy pipeline for tokenization or blank:lang for a blank model (e.g. blank:en for English). | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels. | |
--patterns , -pt | str | Optional path to match patterns file to pre-highlight entity spans. | None |
--suggester , -sg | str | Optional name of suggester function registered in spaCy’s misc registry. If set, annotations will be validated against the suggester during annotation and you will see an error if the annotation doesn’t match any suggestions. Should be a function that creates the suggester with all required arguments. You can use the -F option to provide a Python file. | None |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
--highlight-chars , -C | bool | New: 1.14.5 Allow switching between highlighting individual characters and tokens. If set, character highlighing is set by deafault and no "tokens" information will be saved with the example. |
If you’re using a custom
suggester function for the
span categorizer, you can provide it via the --suggester
argument and Prodigy
will validate submitted annotations against it as you annotate. If you’re not
using a suggester, data-to-spacy
and train
will infer the
best-matching ngram suggester based on the available span annotations in your
data.
suggester.py
from spacy import registryfrom spacy.pipeline.spancat import build_ngram_suggester@registry.misc("123_ngram_suggester.v1")def custom_ngram_suggester():return build_ngram_suggester(sizes=[1, 2, 3]) # all ngrams of size 1, 2 and 3
spans.correct
manualNew: 1.11.1
Create gold-standard data for span categorization by correcting the model’s
predictions. Requires a spaCy pipeline with a
trained span categorizer and will show all
spans in the given group. To customize the span group to read from, you can use
the --key
argument.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy pipeline. | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels. | |
--update , -UP | bool | Update the model in the loop with the received annotations. | False |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
--component , -c | str | Name of span categorizer component in the pipeline. | "spancat" |
spans.model-annotate
commandNew: 1.13.1
Leverage a model to add span annotations to the database. You can repeat this
multiple times with different models so that you may easily compare their
predictions using the review
recipe and curate examples where models
disagree.
For more information on this method of annotating you can consult this guide.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spacy pipeline that can do span categorization. | |
source | str | Path to text source or - to read from standard input. | |
model_alias | str | Model alias to be used as “annotator id” in the UI. | |
--labels , -l | str | Optional subset of labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels. | None |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
component | str | Specific spancat component to use in spaCy pipeline | spancat |
Text Classification
textcat.manual
manual
Manually annotate categories that apply to a text. If only one label is set, the
classification
interface is used. If more than one label is specified,
the choice
interface is used and categories are added as multiple
choice options. If the --exclusive
flag is set, categories become mutually
exclusive, meaning that only one can be selected during annotation.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | Category label to apply. | '' |
--exclusive, -E | bool | Treat labels as mutually exclusive. If not set, an example may have multiple correct classes. | False |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
--accept_empty , -ae | bool | New: 1.14.7 Allow empty choices, even when annotating mutually exclusive classes. | False |
textcat.correct
manualNew: 1.11
Create training data for an existing trained text classification model by
correcting the model’s suggestions. The --threshold
is used to determine
whether a label should be pre-selected, e.g. if it’s set to 0.5
(default), all
labels with a score of 0.5
and above will be checked automatically. Prodigy
will automatically infer whether the categories are mutually exclusive, based on
the component configuration.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy pipeline. | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels. | |
--update , -UP | bool | Update the model in the loop with the received annotations. | False |
--threshold , -t | float | Score threshold to pre-select label, e.g. 0.75 to select all labels with a score of 0.75 and above. | 0.5 |
--component , -c | str | Name of text classification component in the pipeline. Will be guessed if not set. | None |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
--accept-empty , -ae | bool | New: 1.14.7 Allow empty choices, even when annotating mutually exclusive classes. | False |
textcat.teach
binary
Collect the best possible training data for a text classification model by using
a model in the loop. Based on your annotations, Prodigy will decide which
questions to ask next. All annotations will be stored in the database. If a
patterns file is supplied via the --patterns
argument, the matches will be
included in the stream and the matched spans are highlighted, so you’re able to
tell which words or phrases the selection was based on. Note that the exact
pattern matches have no influence when updating the model – they’re only used to
help pre-select examples for annotation.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy pipeline or blank:lang for a blank model (e.g. blank:en for English). | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | Category label to apply. | '' |
--patterns , -pt | str | Optional path to match patterns file to filter out examples containing terms and phrases. | None |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
textcat.model-annotate
commandNew: 1.13.1
Leverage a model to add texcat annotations to the database. You can repeat this
multiple times with different models so that you may easily compare their
predictions using the review
recipe and curate examples where models
disagree.
For more information on this method of annotating you can consult this guide.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spacy pipeline that can do text classification. | |
source | str | Path to text source or - to read from standard input. | |
model_alias | str | Model alias to be used as “annotator id” in the UI. | |
--labels , -l | str | Optional subset of labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels. | None |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
threshold | str | Override the threshold for the classification model | None |
component | str | Specific textcat component to use in spaCy pipeline. Will try to make an educated guess if no component is passed. | None |
Part-of-speech Tagging
pos.correct
manual
Create gold-standard data for part-of-speech tagging by correcting the model’s
suggestions. The spaCy pipeline will be used to predict fine-grained
part-of-speech tags (Token.tag_
), which the annotator can remove and correct
if necessary. It’s often more efficient to focus on a few labels at a time,
instead of annotating all labels jointly.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy pipeline. | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | One or more tags to annotate. Supports a comma-separated list or a path to a file with one label per line. If not set, all tags are shown. | |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
--unsegmented , -U | bool | Don’t split sentences. | False |
pos.teach
binary
Collect the best possible training data for a part-of-speech tagging model with the model in the loop. Based on your annotations, Prodigy will decide which questions to ask next. It’s often more efficient to focus on a few labels at a time, instead of annotating all labels jointly.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy pipeline. | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | Label(s) to annotate. Accepts single label or comma-separated list. If not set, all available labels will be returned. | None |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
--unsegmented , -U | bool | Don’t split sentences. | False |
Sentence Segmentation
sent.correct
manualNew: 1.11
Create gold-standard data for sentence segmentation by correcting the model’s
suggestions. The spaCy pipeline will be used to predict sentence boundaries,
which the annotator can correct if necessary. The recipe uses the label S
to
mark tokens that start a sentence. You can double-click a sentence start token
in the UI to add a new sentence boundary, or click on an incorrect prediction to
remove it.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy pipeline. | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
sent.teach
binary
Collect the best possible training data for a sentence segmentation model with
the model in the loop. Based on your annotations, Prodigy will decide which
questions to ask next. The recipe uses S
to mark tokens that start sentences
and I
for all other tokens. You can then hit accept or
reject, depending on whether the suggested token is correctly
labelled as a sentence start or other token.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy pipeline with SentenceRecognizer. | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
Dependency Parsing
dep.correct
manualNew: 1.10
Create gold-standard data for dependency parsing by correcting the model’s
suggestions. The spaCy pipeline will be used to predict dependencies for the
given labels, which the annotator can remove and correct if necessary. If
--update
is set, the model in the loop will be updated with the annotations
and its updated predictions will be reflected in future batches. The recipe
performs no example selection and all texts will be shown as they come in.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy pipeline with a dependency parser. | |
source | str | Path to text source, - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | Label(s) to annotate. Accepts single label or comma-separated list. If not set, all available labels will be used. | None |
--update , -U | bool | Whether to update the model in the loop during annotation. | False |
--wrap , -W | bool | Wrap lines in the UI by default (instead of showing tokens in one row). | False |
--unsegmented , -U | bool | Don’t split sentences. | False |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
dep.teach
binary
Collect the best possible training data for a dependency parsing model with the model in the loop. Based on your annotations, Prodigy will decide which questions to ask next. It’s often more efficient to focus on a few most relevant labels at a time, instead of annotating all labels jointly.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy pipeline. | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | Label(s) to annotate. Accepts single label or comma-separated list. If not set, all available labels will be returned. | None |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
--unsegmented , -U | bool | Don’t split sentences. | False |
Coreference Resolution
coref.manual
manualNew: 1.10
Create training data for coreference resolution. Coreference resolution is the challenge of linking ambiguous mentions such as “her” or “that woman” back to an antecedent providing more context about the entity in question. This recipe allows you to focus on nouns, proper nouns and pronouns specifically, by disabling all other tokens. You can customize the labels used to extract those using the recipe arguments. Also see the usage guide on coreference annotation.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy pipeline with the required capabilities (entity recognizer part-of-speech tagger) or blank:lang for a blank model (e.g. blank:en for English). | |
source | str | Path to text source, - to read from standard input or dataset:name to load from existing annotations. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | Label(s) to use for coreference annotation. Accepts single label or comma-separated list. | "COREF" |
--pos-tags , -ps | str | List of coarse-grained POS tags to enable for annotation. | "NOUN,PROPN,PRON,DET" |
--poss-pron-tags , -pp | str | List of fine-grained tag values for possessive pronoun to use. | "PRP$" |
--ner-labels , -nl | str | List of NER labels to use if model has a named entity recognizer. | "PERSON,ORG" |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
Relations
rel.manual
manualNew: 1.10
Annotate directional relations and dependencies between tokens and
expressions by selecting the head, child and dependency label and optionally
assign labelled spans for named entities or other expressions. This workflow
is extremely powerful and can be used for basic dependency annotation, as well
as joint named entity and entity relation annotation. If --span-label
defines
additional span labels, a second mode for span highlighting is added.
The recipe lets you take advantage of several efficiency tricks: spans can
be pre-defined using an existing NER dataset, entities or noun phrases from a
model or fully custom match patterns. You can also disable certain tokens to
make them unselectable. This lets you focus on what matters and prevents
annotators from introducing mistakes. For more details and examples, check out
the
usage guide on custom relation annotation
and see the task-specific recipes dep.correct
and coref.manual
that include pre-defined configurations.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy pipeline with the required capabilities (if entities or noun phrases should be merged) or blank:lang for a blank model (e.g. blank:en for English). | |
source | str | Path to text source, - to read from standard input or dataset:name to load from existing annotations. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | Label(s) to annotate. Accepts single label or comma-separated list. | None |
--span-label , -sl | str | Optional span label(s) to annotate. If set, an additional span highlighting mode is added. | None |
--patterns , -pt | str | Path to patterns file defining spans to be added and merged. | None |
--disable-patterns , -dpt | str | Path to patterns file defining tokens to disable (make unselectable). | None |
--add-ents , -AE | bool | Add entities predicted by the model. | False |
--add-nps , -AN | bool | Add noun phrases (if noun chunks rules are available), based on tagger and parser. | False |
--wrap , -W | bool | Wrap lines in the UI by default (instead of showing tokens in one row). | False |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
--hide-arrow-heads , -HA | bool | Hide the arrow heads visually | False |
disable_patterns.jsonl
{"pattern": [{"is_punct": true}]}{"pattern": [{"pos": "VERB"}]}{"pattern": [{"lower": {"in": ["'s", "’s"]}}]}
Computer Vision
image.manual
manual
Annotate images by drawing rectangular bounding boxes and polygon shapes. Each
shape will be added to the task’s "spans"
with its label and a "points"
property containing the [x, y]
pixel coordinate tuples.
See here for more details on the JSONL
format. You can click and drag or click and release to draw boxes. Polygon
shapes can also be closed by double-clicking when adding the last point, similar
to closing a shape in Photoshop or Illustrator. Clicking on the label will
select a shape so you can change the label or delete it.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
source | str | Path to a directory containing image files or pre-formatted JSONL file if --loader jsonl is set. | |
--loader , -lo | str | Optional ID of source loader. | images |
--label , -l | str / Path | One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. | |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
--width , -w | int | New: 1.10 Width of card and maximum image width in pixels. | 675 |
--darken , -D | bool | Darken image to make boxes stand out more. | False |
--no-fetch , -NF | bool | New: 1.9 Don’t fetch images as base64. Ideally requires a JSONL file as input, with --loader jsonl set and all images available as URLs. | False |
--remove-base64 , R | bool | New: 1.10 Remove base64-encoded image data before storing example in the database and only keep the reference to the local file path. Caution: If enabled, make sure to keep original files! | False |
If you organize your images in subdirectories, you can set --loader pages
to group them together in a single interface using the pages
UI. This can be especially useful for multi-page documents or collections of images that should be viewed together. If you’re working with PDFs, the Prodigy-PDF plugin also supports loading paginated documents.
Audio and Video
audio.manual
manualNew: 1.10
Manually label regions for the given labels in the audio or video file. The
recipe expects a directory of audio files as the source
argument and will use
the audio
loader (default) to load the data.
To load video files instead, you can set --loader video
. Each added region
will be added to the "audio_spans"
with a start and end timestamp and the
selected label.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
source | str | Path to a directory containing audio files or pre-formatted JSONL file if --loader jsonl is set. | |
--loader , -lo | str | Optional ID of source loader, e.g. audio or video . | audio |
--label , -l | str / Path | One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. | |
--autoplay , -A | bool | Autoplay the audio when a new task loads. | False |
--keep-base64 , -B | bool | If audio loader is used: don’t remove the base64-encoded audio data from the task before it’s saved to the database. | False |
--fetch-media , -FM | bool | Convert local paths and URLs to base64. Can be enabled if you’re annotating a JSONL file with paths or for re-annotating an existing dataset. | False |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
audio.transcribe
manualNew: 1.10
Manually transcribe audio and video files by typing the transcript into a text
field. The recipe expects a directory of audio files as the source
argument
and will use the audio
loader (default) to
load the data. To load video files instead, you can set --loader video
. The
transcript will be stored as the key "transcript"
. To make it easier to toggle
play and pause as you transcribe and to prevent clashes with the text input
field (like with the default enter), this recipe lets you customize
the keyboard shortcuts. To toggle play/pause, you can press
command/option/alt/ctrl+enter
or provide your own overrides via --playpause-key
.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
source | str | Path to a directory containing audio files or pre-formatted JSONL file if --loader jsonl is set. | |
--loader , -lo | str | Optional ID of source loader, e.g. audio or video . | audio |
--autoplay , -A | bool | Autoplay the audio when a new task loads. | False |
--keep-base64 , -B | bool | If audio loader is used: don’t remove the base64-encoded audio data from the task before it’s saved to the database. | False |
--fetch-media , -FM | bool | Convert local paths and URLs to base64. Can be enabled if you’re annotating a JSONL file with paths or for re-annotating an existing dataset. | False |
--playpause-key , -pk | str | Alternative keyboard shortcuts to toggle play/pause so it doesn’t conflict with text input field. | "command+enter, option+enter, ctrl+enter" |
--text-rows , -tr | int | Height of the text input field, in rows. | 4 |
--field-id , -fi | str | New: 1.10.1 Add the transcript text to the data using this key, e.g. "transcript": "Text here" . | "transcript" |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
Training models
train
commandNew: 1.9
Train a model with one or more components (NER, text classification, tagger,
parser, sentence recognizer or span categorizer) using one or more Prodigy
datasets with annotations. The recipe calls into spaCy directly and can update
an existing model or train a new model from scratch. For each component, you can
provide optional datasets for evaluation using the eval:
prefix, e.g.
--ner dataset,eval:eval_dataset
. If no evaluation sets are specified, the
--eval-split
is used to determine the percentage held back for evaluation.
Datasets will be merged and conflicts will be filtered out. If your data
contains potentially conflicting annotations, it’s recommended to first use
review
to resolve them. If you specify an output directory as the first
argument, the best model will be saved at the end. You can then load it into
spaCy by pointing spacy.load
at
the directory.
Argument | Type | Description | Default |
---|---|---|---|
output_dir | str | Path to output directory. If not set, nothing will be saved. | None |
--ner , -n | str | One or more (comma-separated) datasets for the named entity recognizer. Use the eval: prefix for evaluation sets. | None |
--textcat , -tc | str | One or more (comma-separated) datasets for the text classifier (exclusive categories). Use the eval: prefix for evaluation sets. | None |
--textcat-multilabel , -tcm | str | One or more (comma-separated) datasets for the text classifier (non-exclusive categories). Use the eval: prefix for evaluation sets. | None |
--tagger , -t | str | One or more (comma-separated) datasets for the part-of-speech tagger. Use the eval: prefix for evaluation sets. | None |
--parser , -p | str | One or more (comma-separated) datasets for the dependency parser. Use the eval: prefix for evaluation sets. | None |
--senter , -s | str | One or more (comma-separated) datasets for the sentence recognizer. Use the eval: prefix for evaluation sets. | None |
--spancat , -sc | str | One or more (comma-separated) datasets for the span categorizer. Use the eval: prefix for evaluation sets. | None |
--coref , -co | str | New: 1.12 One or more (comma-separated) datasets for the coreference model. Requires spacy-experimental. Use the eval: prefix for evaluation sets. | None |
--config , -c | str | Optional path to training config.cfg to use. If not set, it will be auto-generated using the default setttings. | None |
--base-model , -m | str | Optional spaCy pipeline to update or use for tokenization and sentence segmentation. | None |
--lang , -l | str | Code of language to use if no config or base model are provided. | "en" |
--eval-split , -es | float | If no evaluation sets are provided for a component, split off a a percentage of the training examples for evaluation. | 0.2 |
--label-stats , -L | bool | Show a breakdown of per-label stats after training. | False |
--verbose , -V | bool | Enable verbose logging. | False |
--silent , -S | bool | Don’t print any updates. | False |
--gpu-id , -g | int | GPU ID for training on GPU or -1 for CPU. | -1 |
overrides | any | Config parameters to override. Should be options starting with -- that correspond to the config section and value to override, e.g. --training.max_epochs=5 . | None |
-F | str | One or more comma-separated paths to Python files to import, e.g. for custom registered functions. | None |
The example below shows how you might run the same command but with overrides. Notice how this training run uses less steps and has a different learning rate.
train-curve
commandNew: 1.9
Train a model with one or more components (NER, text classification, tagger,
parser, sentence recognizer or span categorizer) with different portions of the
training examples and print the accuracy figures and accuracy improvements with
more data. This recipe takes pretty much the same arguments as train
.
--n-samples
sets the number of sample models to train at different stages. For
instance, 10
will train models for 10% of the examples, 20%, 30% and so on.
This recipe is useful to determine the quality of the collected annotations, and
whether more training examples will improve the accuracy. As a rule of thumb, if
accuracy improves within the last 25%, training with more examples will likely
result in better accuracy.
Argument | Type | Description | Default |
---|---|---|---|
--ner , -n | str | One or more (comma-separated) datasets for the named entity recognizer. Use the eval: prefix for evaluation sets. | None |
--textcat , -tc | str | One or more (comma-separated) datasets for the text classifier (exclusive categories). Use the eval: prefix for evaluation sets. | None |
--textcat-multilabel , -tcm | str | One or more (comma-separated) datasets for the text classifier (non-exclusive categories). Use the eval: prefix for evaluation sets. | None |
--tagger , -t | str | One or more (comma-separated) datasets for the part-of-speech tagger. Use the eval: prefix for evaluation sets. | None |
--parser , -p | str | One or more (comma-separated) datasets for the dependency parser. Use the eval: prefix for evaluation sets. | None |
--senter , -s | str | One or more (comma-separated) datasets for the sentence recognizer. Use the eval: prefix for evaluation sets. | None |
--spancat , -sc | str | One or more (comma-separated) datasets for the span categorizer. Use the eval: prefix for evaluation sets. | None |
--coref , -co | str | New: 1.12 Optional path to training config.cfg to use. If not set, it will be auto-generated using the default setttings. | None |
--base-model , -m | str | Optional spaCy pipeline to use for tokenization and sentence segmentation. | None |
--lang , -l | str | Code of language to use if no config or base model are provided. | "en" |
--eval-split , -es | float | If no evaluation sets are provided for a component, split off a a percentage of the training examples for evaluation. | 0.2 |
--verbose , -V | bool | Enable verbose logging. | False |
--n-samples , -ns | int | Number of samples to train, e.g. 4 for results at 25%, 50%, 75% and 100%. | 4 |
--show-plot , -P | bool | Show a visual plot of the curve (requires the plotext library). | False |
overrides | any | Config parameters to override. Should be options starting with -- that correspond to the config section and value to override, e.g. --training.max_epochs=3 . | None |
-F | str | One or more comma-separated paths to Python files to import, e.g. for custom registered functions. | None |
data-to-spacy
commandNew: 1.9
Combine multiple datasets, merge annotations on the same examples and output
training and evaluation data in spaCy’s
binary .spacy
format,
which you can use with spacy train
. The
command takes an output directory and generates all data required to train a
pipeline with spaCy, including the config and pre-generated labels data to speed
up the training process. This recipe will merge annotations for the different
pipeline components and outputs a combined training corpus. If an example is
only present in one dataset type, its annotations for the other components will
be missing values. It’s recommended to use the review
recipe on the
different annotation types first to resolve conflicts properly.
Argument | Type | Description | Default |
---|---|---|---|
output_dir | str | Path to output directory. | |
--ner , -n | str | One or more (comma-separated) datasets for the named entity recognizer. Use the eval: prefix for evaluation sets. | None |
--textcat , -tc | str | One or more (comma-separated) datasets for the text classifier (exclusive categories). Use the eval: prefix for evaluation sets. | None |
--textcat-multilabel , -tcm | str | One or more (comma-separated) datasets for the text classifier (non-exclusive categories). Use the eval: prefix for evaluation sets. | None |
--tagger , -t | str | One or more (comma-separated) datasets for the part-of-speech tagger. Use the eval: prefix for evaluation sets. | None |
--parser , -p | str | One or more (comma-separated) datasets for the dependency parser. Use the eval: prefix for evaluation sets. | None |
--senter , -s | str | One or more (comma-separated) datasets for the sentence recognizer. Use the eval: prefix for evaluation sets. | None |
--spancat , -sc | str | One or more (comma-separated) datasets for the span categorizer. Use the eval: prefix for evaluation sets. | None |
--coref , -co | str | New: 1.12 One or more (comma-separated) datasets for the coreference resolver. Requires spacy-experimental. Use the eval: prefix for evaluation sets. | None |
--config , -c | str | Optional path to training config.cfg to use. If not set, it will be auto-generated using the default setttings. | None |
--base-model , -m | str | Optional spaCy pipeline to use for tokenization and sentence segmentation. | None |
--lang , -l | str | Code of language to use if no config or base model are provided. | "en" |
--eval-split , -es | float | If no evaluation sets are provided for a component, split off a a percentage of the training examples for evaluation. If set to 0 , no evaluation set will be generated. | 0.2 |
--verbose , -V | bool | Enable verbose logging. | False |
-F | str | One or more comma-separated paths to Python files to import, e.g. for custom registered functions. | None |
spacy-config
commandNew: 1.11
Generate a starter config for training from Prodigy datasets which you can use
with spacy train
. A custom reader will be
used that merges annotations on the same examples. It’s recommended to use the
review
recipe on the different annotation types first to resolve
conflicts properly (instead of relying on this recipe to just filter conflicting
annotations and decide on one).
Argument | Type | Description | Default |
---|---|---|---|
output_dir | str | Path to output directory. | |
--ner , -n | str | One or more (comma-separated) datasets for the named entity recognizer. Use the eval: prefix for evaluation sets. | None |
--textcat , -tc | str | One or more (comma-separated) datasets for the text classifier (exclusive categories). Use the eval: prefix for evaluation sets. | None |
--textcat-multilabel , -tcm | str | One or more (comma-separated) datasets for the text classifier (non-exclusive categories). Use the eval: prefix for evaluation sets. | None |
--tagger , -t | str | One or more (comma-separated) datasets for the part-of-speech tagger. Use the eval: prefix for evaluation sets. | None |
--parser , -p | str | One or more (comma-separated) datasets for the dependency parser. Use the eval: prefix for evaluation sets. | None |
--senter , -s | str | One or more (comma-separated) datasets for the sentence recognizer. Use the eval: prefix for evaluation sets. | None |
--spancat , -sc | str | One or more (comma-separated) datasets for the span categorizer. Use the eval: prefix for evaluation sets. | None |
--coref , -co | str | New: 1.12 One or more (comma-separated) datasets for the coreference resolver. Requires spacy-experimental. Use the eval: prefix for evaluation sets. | None |
--eval-split , -es | float | If no evaluation sets are provided for a component, split off a a percentage of the training examples for evaluation. If set to 0 , no evaluation set will be generated. | 0.2 |
--config , -c | str | Optional path to training config.cfg to use. If not set, it will be auto-generated using the default setttings. | None |
--base-model , -m | str | Optional spaCy pipeline to use for tokenization and sentence segmentation. | None |
--lang , -l | str | Code of language to use if no config or base model are provided. | "en" |
--verbose , -V | bool | Enable verbose logging. | False |
--silent , -S | bool | Don’t output any status or logs | False |
-F | str | One or more comma-separated paths to Python files to import, e.g. for custom registered functions. | None |
Vectors and Terminology
terms.teach
binary
Build a terminology list interactively using a model’s word vectors and seed terms, either a comma-separated list or a text file containing one term per line. Based on the seed terms, a target vector is created and only terms similar to that target vector are shown. As you annotate, the recipe iterates over the vector model’s vocab and updates the target vector with the words you accept.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
vectors | str | Loadable spaCy pipeline with word vectors and a vocab, e.g. en_core_web_lg or custom vectors trained on domain-specific text. | |
--seeds , -s | str / Path | Comma-separated list or path to file with seed terms (one term per line). | '' |
--resume , -R | bool | Resume from existing terms dataset and update target vector accordingly. | False |
terms.to-patterns
command
Convert a dataset collected with terms.teach
or
sense2vec.teach
to a JSONL-formatted patterns file. You can
optionally provide a spaCy pipeline for tokenization to create token-based
patterns and make them case-insensitive. If no model is provided, the patterns
will be generated as exact string matches. Pattern files can be used in Prodigy
to bootstrap annotation and pre-highlight suggestions, for example in
ner.manual
. You can also use them with
spaCy’s EntityRuler
for rule-based named entity recognition.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset ID to convert. | |
output_file | str | Optional path to an output file. | sys.stdout |
--label , -l | str | Label to assign to the patterns. | None |
--spacy-model , -m | str | New: 1.9 Optional spaCy pipeline for tokenization to create token-based patterns, or blank:lang to start with a blank model (e.g. blank:en for English). | None |
--case-sensitive , -CS | bool | New: 1.9 Make patterns case-sensitive. | False |
Large Language Models
ner.llm.correct
New: 1.13
This recipe marks entity predictions obtained from a large language model configured by spacy-llm and allows you to accept them as correct, or to manually curate them. This recipe may require environment variables to be set, depending on the large language model that you’re using. The details of this are explained in this section
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
config_path | str | Path to the spacy-llm config file. | |
source | str | Path to source data to annotate. The data should at least contain a "text" field. | |
--loader , -lo | str | Loader (guessed from file extension if not set). | None |
--segment , -S | bool | Split text into sentences | False |
--component , -c | str | Name of component to use for annotation. | llm |
--overrides | str | Overrides for the spacy-llm config file | None |
Give me an example configuration file to get started.
Check the spacy-llm documentation to learn more about how to set up these configuration files.
Example spacy-llm config file for NER
[nlp]lang = "en"pipeline = ["llm"][components][components.llm]factory = "llm"[components.llm.task]@llm_tasks = "spacy.NER.v2"labels = ["DISH", "INGREDIENT", "EQUIPMENT"][components.llm.model]@llm_models = "spacy.GPT-4.v1"config = {"temperature": 0.3}[components.llm.cache]@llm_misc = "spacy.BatchCache.v1"path = "local-ner-cache"batch_size = 3max_batches_in_mem = 10
ner.llm.fetch
New: 1.13
The ner.llm.correct
recipe fetches examples from large language models
while annotating, but this recipe that can fetch a large batch of examples
upfront. After downloading such a batch of examples you can use
ner.manual
to correct the annotations. This recipe may require
environment variables to be set, depending on the large language model that
you’re using. The details of this are explained
in this section
Argument | Type | Description | Default |
---|---|---|---|
config_path | str | Path to the spacy-llm config file. | |
source | str | Path to source data to annotate. The data should at least contain a "text" field. | |
output | str | dataset:name or file path to save the annotations to. | |
--loader , -lo | str | Loader (guessed from file extension if not set). | None |
--resume , -r | bool | Resume fetching from dataset or file on disk. | False |
--segment , -S | bool | Split text into sentences | False |
--component , -c | str | Name of component to use for annotation. | llm |
--overrides | str | Overrides for the spacy-llm config file | None |
textcat.llm.correct
New: 1.13
This recipe marks entity predictions obtained from a large language model configured by spacy-llm and allows you to accept them as correct, or to manually curate them. This recipe may require environment variables to be set, depending on the large language model that you’re using. The details of this are explained in this section
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
config_path | str | Path to the spacy-llm config file. | |
source | str | Path to source data to annotate. The data should at least contain a "text" field. | |
--loader , -lo | str | Loader (guessed from file extension if not set). | None |
--component , -c | str | Name of component to use for annotation. | llm |
--overrides | str | Overrides for the spacy-llm config file | None |
Give me an example configuration file to get started.
Check the spacy-llm documentation to learn more about how to set up these configuration files.
Example spacy-llm config file for textcat
[nlp]lang = "en"pipeline = ["llm"][components][components.llm]factory = "llm"[components.llm.task]@llm_tasks = "spacy.TextCat.v2"labels = ["RECIPE", "QUESTION", "FEEDBACK"][components.llm.model]@llm_models = "spacy.GPT-4.v1"config = {"temperature": 0.3}[components.llm.cache]@llm_misc = "spacy.BatchCache.v1"path = "local-ner-cache"batch_size = 3max_batches_in_mem = 10
textcat.llm.fetch
New: 1.13
The textcat.llm.correct
recipe fetches examples from large language
models while annotating, but this recipe that can fetch a large batch of
examples upfront. After downloading such a batch of examples you can use
textcat.manual
to correct the annotations. This recipe may require
environment variables to be set, depending on the large language model that
you’re using. The details of this are explained
in this section
Argument | Type | Description | Default |
---|---|---|---|
config_path | str | Path to the spacy-llm config file. | |
source | str | Path to source data to annotate. The data should at least contain a "text" field. | |
output | str | dataset:name or file path to save the annotations to. | |
--loader , -lo | str | Loader (guessed from file extension if not set). | None |
--resume , -r | bool | Resume download from dataset or file on disk | False |
--component , -c | str | Name of component to use for annotation. | llm |
--overrides | str | Overrides for the spacy-llm config file | None |
spans.llm.correct
New: 1.13
This recipe marks overlapping span predictions obtained from a large language model configured by spacy-llm and allows you to accept them as correct, or to manually curate them. This recipe may require environment variables to be set, depending on the large language model that you’re using. The details of this are explained in this section.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
config_path | str | Path to the spacy-llm config file. | |
source | str | Path to source data to annotate. The data should at least contain a "text" field. | |
--loader , -lo | str | Loader (guessed from file extension if not set). | None |
--component , -c | str | Name of component to use for annotation. | llm |
--overrides | str | Overrides for the spacy-llm config file | None |
Give me an example configuration file to get started.
Here’s what an example spacy-llm config file might look like for spancat.
Basic spacy-llm config for spans
[nlp]lang = "en"pipeline = ["llm"][components][components.llm]factory = "llm"save_io = true[components.llm.task]@llm_tasks = "spacy.SpanCat.v2"labels = ["DISH", "INGREDIENT", "EQUIPMENT"][components.llm.task.label_definitions]DISH = "Extract the name of a known dish."INGREDIENT = "Extract the name of a cooking ingredient, including herbs and spices."EQUIPMENT = "Extract any mention of cooking equipment. e.g. oven, cooking pot, grill"[components.llm.model]@llm_models = "spacy.GPT-3-5.v1"config = {"temperature": 0.3}[components.llm.task.examples]@misc = "spacy.FewShotReader.v1"path = "span_examples.yaml"[components.llm.cache]@llm_misc = "spacy.BatchCache.v1"path = "local-cached"batch_size = 3max_batches_in_mem = 10
This file refers to a span_examples.yaml
file, which might look like this:
span_examples.yaml
- text: 'Mac and Cheese is a popular American pasta variant.'entities:INGREDIENT: ['Cheese']DISH: ['Mac and Cheese']
spans.llm.fetch
New: 1.13
The spans.llm.correct
recipe fetches examples from large language models
while annotating, but this recipe that can fetch a large batch of examples
upfront. After downloading such a batch of examples you can use
spancat.manual
to correct the annotations. This recipe may require
environment variables to be set, depending on the large language model that
you’re using. The details of this are explained
in this section.
Argument | Type | Description | Default |
---|---|---|---|
config_path | str | Path to the spacy-llm config file. | |
source | str | Path to source data to annotate. The data should at least contain a "text" field. | |
output | str | dataset:name or file path to save the annotations to. | |
--loader , -lo | str | Loader (guessed from file extension if not set). | None |
--resume , -r | bool | Resume download from dataset or file on disk | False |
--component , -c | str | Name of component to use for annotation. | llm |
--overrides | str | Overrides for the spacy-llm config file | None |
terms.llm.fetch
New: 1.13.2
This recipe generates terms and phrases obtained from a large language model. These terms can be curated and turned into patterns files, which can help with downstream annotation tasks. The recipe works by iteratively requesting terms from the LLM and by deduplicating the results on each batch. It can be helpful for this recipe to choose a high temperature setting for the LLM.
This recipe may require environment variables to be set, depending on the large language model that you’re using. The details of this are explained in this section.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Dataset to save annotations into. | |
config_path | str | Path to the spacy-llm config file. | |
topic | str | Description of the topic that you’re interested in. | None |
--n_requests , -r | int | Number of requests to send to LLM. | 5 |
--auto-accept , -a | int | Automatically accept generated examples. | False |
--component , -c | str | Name of the spacy-llm component that generates terms | llm |
--overrides | str | Overrides for the spacy-llm config file | None |
Give me an example configuration file to get started.
Here’s what an example spacy-llm config file might look like for the terms recipe.
Example spacy-llm config for terms
[nlp]lang = "en"pipeline = ["llm"][components][components.llm]factory = "llm"[components.llm.task]@llm_tasks = "prodigy.Terms.v1"batch_size = 50[components.llm.model]@llm_models = "spacy.GPT-3-5.v1"config = {"temperature": 0.3}
ab.llm.tournament
New: 1.13.2
The goal of this recipe is to quickly compare the quality of outputs from a collection of prompts by leveraging a tournament. It uses the Glicko rating system internally to determine the duels as well as the best performing prompt. You can also compare different LLM backends using this recipe.
This recipe may require environment variables to be set, depending on the large language model that you’re using. The details of this are explained in this section.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Dataset to save annotations into. | |
inputs_path | Path | Path to jsonl inputs | |
prompt_path | Path | Path file/folder with jinja2 prompt config(s) | |
config_path | Path | Path file/folder with spacy-llm config(s) | |
display_template_path , -dp | Path | Template for summarizing the arguments | None |
--no-random ,-NR | bool | Don’t randomize which annotation is shown as correct | False |
--resume , -r | bool | Resume from the dataset, replaying the matches before starting | False |
--no-meta ,-nm | bool | Don’t add meta information to the annotation interface | False |
Give me sample configuration files to get started.
Here’s what an example spacy-llm config files might look like for the tournament recipe.
Example spacy-llm config for the tournament recipe
[nlp]lang = "en"pipeline = ["llm"][components][components.llm]factory = "llm"[components.llm.task]@llm_tasks = "prodigy.TextPrompter.v1"[components.llm.model]@llm_models = "spacy.GPT-3-5.v1"config = {"temperature": 0.3}
Example spacy-llm config for the tournament recipe
[nlp]lang = "en"pipeline = ["llm"][components][components.llm]factory = "llm"[components.llm.task]@llm_tasks = "prodigy.TextPrompter.v1"[components.llm.model]@llm_models = "spacy.GPT-4.v1"config = {"temperature": 0.3}
You may also consider these prompts.
prompt1.jinja2
Write a haiku about {{topic}} that rhymes.
prompt2.jinja2
Write a super funny haiku about {{topic}} that rhymes.
Given these prompts, you may also consider such a file with topics to use.
{"topic": "star wars"}{"topic": "python"}{"topic": "stroopwafels"}
ner.openai.correct
New: 1.12
This recipe marks entity predictions obtained from a large language model and
allows you to accept them as correct, or to manually curate them. This allows
you to quickly gather a gold-standard dataset through zero-shot or few-shot
learning. It’s very much like using the ner.correct
recipe, but with
GPT-3 as a backend model to make predictions.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
source | str | Path to source data to annotate. The data should at least contain a "text" field. | |
--labels , -L | str | Comma-separated list defining the NER labels the model should predict. | |
--model , -m | str | GPT-3 model to use for initial predictions. | "text-davinci-003" |
--examples-path , -e | Path | Path to examples to help define the task. The file can be a .yml, .yaml or .json. If set to None , zero-shot learning is applied. | None |
--lang , -l | str | Language of the input data - will be used to obtain a relevant tokenizer. | "en" |
--max-examples , -n | int | Max number of examples to include in the prompt to OpenAI. If set to 0, zero-shot learning is always applied, even when examples are available. | 2 |
--prompt_path , -p | Path | Path to custom .jinja2 prompt template | None |
--batch-size , -b | int | Batch size of queries to send to the OpenAI API. | 10 |
--segment , -S | bool | Flag to set when examples should be split into sentences. By default, the full input article is shown. | False |
--loader , -lo | str | Loader (guessed from file extension if not set). | None |
--verbose , -v | bool | Flag to print extra information to the terminal. | False |
ner.openai.fetch
New: 1.12
The ner.openai.correct
recipe fetches examples from OpenAI while
annotating, but this recipe that can fetch a large batch of examples upfront.
After downloading such a batch of examples loaded you can use ner.manual
to correct the OpenAI annotations.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
source | str | Path to source data to annotate. The data should at least contain a "text" field. | |
output_path | Path | Path to .jsonl file to save OpenAI annotations into. | |
--labels , -L | str | Comma-separated list defining the NER labels the model should predict. | |
--lang , -l | str | Language of the input data - will be used to obtain a relevant tokenizer. | "en" |
--model , -m | str | GPT-3 model to use for initial predictions. | "text-davinci-003" |
--examples-path , -e | Path | Path to examples to help define the task. The file can be a .yml, .yaml or .json. If set to None , zero-shot learning is applied. | None |
--max-examples , -n | int | Max number of examples to include in the prompt to OpenAI. If set to 0, zero-shot learning is always applied, even when examples are available. | 2 |
--prompt_path , -p | Path | Path to custom .jinja2 prompt template | None |
--batch-size , -b | int | Batch size of queries to send to the OpenAI API. | 10 |
--segment , -S | bool | Flag to set when examples should be split into sentences. By default, the full input article is shown. | False |
--resume , -r | bool | Resume fetch from output file. | False |
--loader , -lo | str | Loader (guessed from file extension if not set). | None |
--verbose , -v | bool | Flag to print extra information to the terminal. | False |
textcat.openai.correct
New: 1.12
This recipe enables you to classify texts by correcting the annotations from an OpenAI language model. OpenAI will also provide a “reason” to explain why a particular label was chosen.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
source | Path | Path to source data to annotate. The data should at least contain a "text" field. | |
--labels , -L | str | Comma-separated list defining the text categorization labels the model should predict. | None |
--lang , -l | str | Language of the input data - will be used to obtain a relevant tokenizer. | "en" |
--model , -m | str | GPT-3 model to use for initial predictions. | "text-davinci-003" |
--batch-size , -b | int | Batch size of queries to send to the OpenAI API. | 10 |
--segment , -S | bool | Flag to set when examples should be split into sentences. By default, the full input article is shown. | False |
--prompt-path , -p | Path | Path to custom .jinja2 prompt template. Will use default template if not provided. | None |
--examples-path , -e | Path | Path to examples to help define the task. The file can be a .yml, .yaml or .json. If set to None , zero-shot learning is applied. | None |
--max-examples , -n | int | Max number of examples to include in the prompt to OpenAI. If set to 0, zero-shot learning is always applied, even when examples are available. | 2 |
--exclusive-classes , -E | bool | Flag to make the classification task exclusive. | False |
--loader , -lo | str | Loader (guessed from file extension if not set). | None |
--verbose , -v | bool | Flag to print extra information to the terminal. | False |
textcat.openai.fetch
New: 1.12
The textcat.openai.correct
recipe fetches examples from OpenAI while
annotating, but this recipe that can fetch a large batch of examples upfront.
This is helpful when you are with a highly-imbalanced data and interested only
in rare examples. After downloading such a batch of examples loaded you can use
ner.manual
to correct the OpenAI annotations.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
source | Path | Path to source data to annotate. The data should at least contain a "text" field. | |
output_path | Path | Path to .jsonl file to save OpenAI annotations into. | |
--labels , -L | str | Comma-separated list defining the NER labels the model should predict. | None |
--lang , -l | str | Language of the input data - will be used to obtain a relevant tokenizer. | "en" |
--model , -m | str | GPT-3 model to use for initial predictions. | "text-davinci-003" |
--prompt_path , -p | Path | Path to custom .jinja2 prompt template | None |
--examples-path , -e | Path | Path to examples to help define the task. The file can be a .yml, .yaml or .json. If set to None , zero-shot learning is applied. | None |
--max-examples , -n | int | Max number of examples to include in the prompt to OpenAI. If set to 0, zero-shot learning is always applied, even when examples are available. | 2 |
--batch-size , -b | int | Batch size of queries to send to the OpenAI API. | 10 |
--segment , -S | bool | Flag to set when examples should be split into sentences. By default, the full input article is shown. | False |
--exclusive-classes , -E | bool | Make the classification task exclusive | False |
--resume , -r | bool | Resume fetch from output file. | False |
--loader , -lo | str | Loader (guessed from file extension if not set). | None |
--verbose , -v | bool | Flag to print extra information to the terminal. | False |
terms.openai.fetch
New: 1.12
This recipe generates terms and phrases obtained from a large language model. These terms can be curated and turned into patterns files, which can help with downstream annotation tasks.
Argument | Type | Description | Default |
---|---|---|---|
query | str | Query to send to OpenAI | |
output_path | Path | Path to save the output | |
--seeds ,-s | str | One or more comma-separated seed phrases. | "" |
--n ,-n | int | Minimum number of items to generate | 100 |
--model , -m | str | GPT-3 model to use for completion | "text-davinci-003" |
--prompt-path , -p | Path | Path to custom jinja2 prompt template | None |
--resume , -r | bool | Resume by loading in text examples from output file | False |
--temperature ,-t | float | OpenAI temperature param | 1.0 |
--top-p , --tp | float | OpenAI top_p param | 1.0 |
--best-of , -bo | int | OpenAI best_of param | 10 |
--n-batch ,-nb | int | OpenAI batch size param | 10 |
--max-tokens , -mt | int | Max tokens to generate per call | 100 |
ab.openai.prompts
New: 1.12
The goal of this recipe is to quickly compare the quality of outputs from two prompts in a quantifiable and blind way.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save answers into | |
inputs_path | Path | Path to jsonl inputs | |
display_template_path | Path | Template for summarizing the arguments | |
prompt1_template_path | Path | Path to the first jinja2 prompt template | |
prompt2_template_path | Path | Path to the second jinja2 prompt template | |
--model , -m | str | GPT-3 model to use for completion | "text-davinci-003" |
--batch-size , -b | int | Batch size to send to OpenAI API | 10 |
--temperature ,-t | float | OpenAI temperature param | 1.0 |
--no-random ,-NR | bool | Don’t randomize which annotation is shown as correct | False |
--repeat , -r | int | How often to send the same prompt to OpenAI | 1 |
--verbose ,-v | bool | Print extra information to terminal | False |
--no-meta ,-nm | bool | Don’t add meta information to the annotation interface | False |
ab.openai.tournament
New: 1.12
The goal of this recipe is to quickly compare the quality of outputs from a collection of prompts by leveraging a tournament. It uses the Glicko rating system internally to determine the duels as well as the best performing prompt. Your can read more about the expectations of the jsonl/jinja2 files in the tournaments guide
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save answers into | |
inputs_path | Path | Path to jsonl inputs | |
display_template_path | Path | Template for summarizing the arguments | |
prompt_template_folder | Path | Path to folder with jinja2 prompt templates | |
--model , -m | str | GPT-3 model to use for completion | "text-davinci-003" |
--batch-size , -b | int | Batch size to send to OpenAI API | 1 |
--resume , -r | bool | Resume from the dataset, starting with ratings based on matches from before | False |
--temperature ,-t | float | OpenAI temperature param | 1.0 |
--no-random ,-NR | bool | Don’t randomize which annotation is shown as correct | False |
--verbose ,-v | bool | Print extra information to terminal | False |
--no-meta ,-nm | bool | Don’t add meta information to the annotation interface | False |
Review and Evaluate
review
New: 1.8
Review existing annotations created by multiple annotators and resolve potential
conflicts by creating one final “master annotation”. Can be used for both binary
and manual annotations and supports all interfaces except image_manual
and compare
. If the annotations were created with a manual interface,
the “most popular” version, e.g. the version most sessions agreed on, will be
pre-selected automatically.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset ID to save reviewed annotations. | |
in_sets | str | Comma-separated names of datasets to review. | |
--label , -l | str | Optional comma-separated labels to display in manual annotation mode. | None |
--view-id , -v | str | Interface to use if none present in the task, e.g. ner or ner_manual . | None |
--fetch-media , -FM | bool | New: 1.10 Temporarily replace paths and URLs with base64 string so thex can be reannotated. Will be removed again before examples are placed in the database. | False |
--show-skipped , -S | bool | New: 1.10.5 Include answers that would otherwise be skipped, like annotations with answer "ignore" or annotations with answer "reject" in manual interfaces. | False |
--auto-accept , -A | bool | New: 1.11 Automatically accept annotations with no conflicts and add them to the dataset. | False |
--accept-single , -AS | bool | New: 1.12 Also auto-accept annotations that have only been annotated by a single annotator. | False |
compare
Compare the output of your model and the output of a baseline on the same
inputs. To prevent bias during annotation, Prodigy will randomly decide which
output to suggest as the correct answer. When you exit the application, you’ll
see detailed stats, including the preferred output. Expects two JSONL files
where each entry has an "id"
(to match up the outputs on the same input), and
an "input"
and "output"
object with the content to render, e.g. the
"text"
.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
a_file | str | First file to compare, e.g. system responses. | |
b_file | str | Second file to compare, e.g. baseline responses. | |
--no-random , -nr | bool | Don’t randomize which annotation is shown as the “correct” suggestion (always use the first option). | False |
--diff , -D | bool | Show examples as visual diff. | False |
model_a.jsonl
{"id": 1, "input": {"text": "FedEx von weltweiter Cyberattacke getroffen"}, "output": {"text": "FedEx hit by worldwide cyberattack"}}
model_b.jsonl
{"id": 1, "input": {"text": "FedEx von weltweiter Cyberattacke getroffen"}, "output": {"text": "FedEx from worldwide Cyberattacke hit"}}
metric.iaa.doc
commandNew: 1.14.3
Compute the inter-annotator agreement (IAA) for document-level annotations using
percent agreement,
Krippendorff’s Alpha,
and
Gwet’s AC2
as metrics. The algorithm implemention is ported from:
https://github.com/pmbaumgartner/prodigy-iaa and is benchmarked on Gwet, 2015
paper. The
current implementation supports two types of annotations: multiclass and
multilabel (for binary annotations please see metric.iaa.binary
).
Importantly, the annotations are grouped by the _input_hash
i.e. all
annotations that have the same _input_hash
are considered to be the same
annotation task. For details on other source data assumptions and the
interpretation of the results please see the metrics guide.
The command will output the results to the terminal:
Argument | Type | Description | Default |
---|---|---|---|
source | str | Path to source or dataset:name to load from existing annotations. Directories with JSONL files are also supported. | |
annotation_type | str | Annotation type in the source. multiclass or multilabel | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--labels , -l | str | Comma separated list of labels. If not provided, it will be inferred from the dataset. | None |
--annotators , -a | str | Comma separated list annotators. If not provided, it will be inferred from the dataset. | None |
--output , -o | str | Path to a json file to save the results on disc. | None |
metric.iaa.span
commandNew: 1.14.3
Compute the inter-annotator agreement (IAA) for span-level text annotations
using micro-averaged pairwise F1 score as the metric. For computing IAA on
reject/accept decisions of span annotations see metric.iaa.binary
.
Importantly, the annotations are grouped by the _input_hash
i.e. all
annotations that have the same _input_hash
are considered to be the same
annotation task. For more details on other source data assumptions and the
interpretation of the results please see the metrics guide.
The command will output the results to the terminal:
Argument | Type | Description | Default |
---|---|---|---|
source | str | Path to source or dataset:name to load from existing annotations. Directories with JSONL files are also supported. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--labels , -l | str | Comma separated list of labels. If not provided, it will be inferred from the dataset. | None |
--annotators , -a | str | Comma separated list annotators. If not provided, it will be inferred from the dataset. | None |
--partial , -P | bool | Consider partial span matches as agreement. | False |
--output , -o | str | Path to a json file to save the results on disc. | None |
metric.iaa.binary
commandNew: 1.14.3
Compute the inter-annotator agreement (IAA) for binary annotations using percent
agreement,
Krippendorff’s Alpha,
and
Gwet’s AC2
as metrics. The algorithm implemention is ported from:
https://github.com/pmbaumgartner/prodigy-iaa and is benchmarked on Gwet, 2015
paper.
Importantly, the annotations are grouped by the _input_hash
i.e. all
annotations that have the same _input_hash
are considered to be the same
annotation task. For details on other source data assumptions and the
interpretation of the results please see the metrics guide.
The command will output the results to the terminal:
Argument | Type | Description | Default |
---|---|---|---|
source | str | Path to source or dataset:name to load from existing annotations. Directories with JSONL files are also supported. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--annotators , -a | str | Comma separated list annotators. If not provided, it will be inferred from the dataset. | None |
--output , -o | str | Path to a json file to save the results on disc. | None |
Other Utilities and Commands
mark
binary / manual
Start the annotation server, display whatever comes in with a given interface
and collect binary annotations. At the end of the annotation session, a
breakdown of the answer counts is printed. The --view-id
lets you specify one
of the existing annotation interfaces – just make sure
your input data includes everything the interface needs, since this recipe does
no preprocessing and will just show you whatever is in the data. The recipe is
also very useful if you want to re-annotate data exported with db-out
.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | Label to apply in classification mode or comma-separated labels to show for manual annotation. | '' |
--view-id , -v | str | Annotation interface to use. | None |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
match
binaryNew: 1.9.8
Select examples based on match patterns and
accept or reject the result. Unlike ner.manual
with patterns, this recipe
will only show examples if they contain pattern matches. It can be used for NER
and text classification annotations – for instance, to bootstrap a text category
if the classes are very imbalanced and not enough positive examples are
presented during manual annotation or textcat.teach
. The --label-task
and --label-span
flags can be used to specify where the label should be added.
This will also be reflected via the "label"
property (on the top-level task or
the spans) in the data you create with the recipe. If --combine-matches
is
set, all matches will be presented together. Otherwise, each match will be
presented as a separate task.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
spacy_model | str | Loadable spaCy pipeline for tokenization to initialize the matcher, or blank:lang for a blank model (e.g. blank:en for English). | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
--label , -l | str | Comma-separated label(s) to annotate or text file with one label per line. Only pattern matches for those labels will be shown. | |
--patterns , -pt | str | Path to match patterns file. | |
--label-task , -LT | bool | Whether to add a label to the top-level task if a match for that label was found. For example, if you use this recipe for text classification, you typically want to add a label to the whole task. | False |
--label-span , -LS | bool | Whether to add a label to the matched span that’s highlighted. For example, if you use this recipe for NER, you typically want to add a label to the span but not the whole task. | False |
--combine-matches , -C | bool | Whether to show all matches in one task. If False , the matcher will output one task for each match and duplicate tasks if necessary. | False |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
print-stream
commandNew: 1.9
Pretty-print the model’s predictions on the command line. Supports named
entities and text categories and will display the annotations if the model
components are available. For textcat annotations, only the category with the
highest score is shown if the score is greater than 0.5
.
Argument | Type | Description | Default |
---|---|---|---|
spacy_model | str | Loadable spaCy pipeline. | |
source | str | Path to text source or - to read from standard input. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
print-dataset
commandNew: 1.9
Pretty-print annotations from a given dataset on the command line. Supports
plain text, text classification and NER annotations. If no --style
is
specified, Prodigy will try to infer it from the data via the "_view_id"
that’s automatically added since v1.8.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset ID. | |
--style , -s | str | Dataset type: auto (try to infer from the data, default), text , spans or textcat . | auto |
filter-by-patterns
commandNew: 1.12
Filter data using match patterns in order to
produce a representative subset for downstream tasks. Such subsets can be useful
to jump start a model e.g. in active learning setting, especially, when dealing
with sparse entities. The output dataset will contain entity spans
added to
matching examples. It will also display a progress bar.
Argument | Type | Description | Default |
---|---|---|---|
source | str | Data to filter (file path or ’-’ to read from standard input) | |
output | str | Path to .jsonl file, or dataset, to write subset into | |
spacy_model | str | Loadable spaCy pipeline or blank:lang (e.g. blank:en) | |
--patterns , -pt | str | Path to match patterns file | None |
--label , -l | str | Comma-separated label(s) to select subset of patterns | None |
--loader , -lo | str | Loader (guessed from file extension if not set) | None |
db-out
command
Export annotations in Prodigy’s JSONL format. If the output directory doesn’t exist, it will be created. If no output directory is specified, the data will be printed so it can be redirected to a file.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Dataset ID to import or export. | |
out_dir | str | Optional path to output directory to export annotation file to. | None |
--dry , -D | bool | Perform a dry run and don’t save any files. | False |
db-merge
command
Merge two or more existing datasets into a new set, e.g. to create a final dataset that can be reviewed or used to train a model. Keeps a copy of the original datasets and creates a new set for the merged examples.
Argument | Type | Description | Default |
---|---|---|---|
in_sets | str | Comma-separated names of datasets to merge. | |
out_set | str | Name of dataset to save the merged examples to. | |
--rehash , -R | bool | New: 1.10 Force-update all hashes assigned to examples. | False |
--dry , -D | bool | Perform a dry run and don’t save anything. | False |
db-in
command
Import existing annotations to the database. Can load all file types supported by Prodigy. To import NER annotations, the files should be converted into Prodigy’s JSONL annotation format.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Dataset ID to import or export. | |
in_file | str | Path to input annotation file. | |
--answer | str | Set this answer key if none is present | "accept" |
--rehash , -rh | bool | Update and overwrite all hashes. | False |
--dry , -D | bool | Perform a dry run and don’t save any files. | False |
drop
command
Remove a dataset or annotation session from a project. Can’t be undone. To see
all dataset and session IDs in the database, use prodigy stats -ls
.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Dataset or session ID. | |
--batch-size , -n | int | Delete examples in batches of the given size. Prevents possible database error for large datasets. | None |
stats
command
Print Prodigy and database statistics. Specifying a dataset ID will show detailed stats for the dataset, like annotation counts and meta data. You can also choose to list all available dataset or session IDs.
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Optional Prodigy dataset ID. | |
--list-datasets , -l | bool | List IDs of all datasets in the database. | False |
--list-sessions , -ls | bool | List IDs of all datasets and sessions in the database. | False |
--no-format , -nf | bool | Don’t pretty-print the stats and print a simple dict instead. | False |
progress
commandNew: 1.11
View the annotation progress of one or more datasets over time and optionally compare it against an input source to check the coverage. The command will output the new annotations created during the given intervals, the total annotations at each point, as well as the number of unique annotations if the data contains multiple annotations on the same input data.
Argument | Type | Description | Default |
---|---|---|---|
datasets | str | One or more comma separated dataset names. | |
--interval , -i | str | Time period to calculate progress for. Can be "day" , "week" , "month" , "year" . | "month" |
--source , -s | str | Optional path to text source or - to read from standard input. If set, will be used to calculate percentage of annotated examples based on the input data. | |
--loader , -lo | str | Optional ID of text source loader. If not set, source file extension is used to determine loader. | None |
prodigy
command
Run a built-in or custom Prodigy recipe. The -F
option lets you load a recipe
from a simple Python file, containing one or more recipe functions. All recipe
arguments will be available from the command line. To print usage info and a
list of available arguments, use the --help
flag.
Argument | Type | Description |
---|---|---|
recipe_name | positional | Recipe name. |
*recipe_arguments | Recipe arguments. | |
-F | str | Path to recipe file to load custom recipe. |
--help , -h | bool | Show help message and available arguments. |
recipe.py
import prodigyfrom prodigy.components.stream import get_stream@prodigy.recipe("custom-recipe",dataset=("The dataset", "positional", None, str),source_file=("A positional argument", "positional", None, str),custom_opt=("An option", "option", "co", int))def custom_recipe_function(dataset, source_file, custom_opt=10):stream = get_stream(source_file)print("Custom option pased in via command line:", custom_opt)return {"dataset": dataset,"stream": stream,"view_id": "text"}
Deprecated recipes
The following recipes have been deprecated in favor of newer workflows and
best practices. See the table for details and replacements. The version
numbers indicate when the feature was deprecated (but still available) and when
it was removed. For instance, 1.10
1.11 indicates that the recipe was deprecated but still
available in v1.10 and removed in v1.11. To view the recipe details and
documentation of deprecated recipes, run the recipe command with the --help
flag.
ner.match | 1.10 1.11 This recipe has been deprecated in favor of ner.manual with --patterns , which lets you match patterns and allows editing the results at the same time, and the general purpose match , which lets you match patterns and accept or reect the result. |
ner.eval | 1.10 1.11 This recipe has been deprecated in favor of creating regular gold-standard evaluation sets with ner.manual (fully manual) or ner.correct (semi-automatic). |
ner.print-stream | 1.10 1.11 This recipe has been deprecated in favor of the general-purpose print-stream command that can print streams of all supported types. |
ner.print-dataset | 1.10 1.11 This recipe has been deprecated in favor of the general-purpose print-dataset command that can print datasets of all supported types. |
ner.gold-to-spacy | 1.10 1.11 This recipe has been deprecated in favor of data-to-spacy , which can take multiple datasets of different types (e.g. NER and text classification) and outputs a JSON file in spaCy’s training format that can be used with spacy train . |
ner.iob-to-gold | 1.10 1.11 This recipe has been deprecated because it only served a very limited purpose. To convert IOB annotations, you can either use spacy convert or write a custom script. |
ner.batch-train | 1.10 1.11 This recipe will be deprecated in favor of the general-purpose train recipe that supports all components. |
ner.train-curve | 1.10 1.11 This recipe will be deprecated in favor of the general-purpose train-curve recipe that supports all components. |
textcat.eval | 1.10 1.11 This recipe has been deprecated in favor of creating regular gold-standard evaluation sets with textcat.manual . |
textcat.print-stream | 1.10 1.11 This recipe has been deprecated in favor of the general-purpose print-stream command that can print streams of all supported types. |
textcat.print-dataset | 1.10 1.11 This recipe has been deprecated in favor of the general-purpose print-dataset command that can print datasets of all supported types. |
textcat.batch-train | 1.10 1.11 This recipe will be deprecated in favor of the general-purpose train recipe that supports all components and works with both binary accept/reject annotations and multiple choice annotations out-of-the-box. |
textcat.train-curve | 1.10 1.11 This recipe will be deprecated in favor of the general-purpose train-curve recipe that supports all components. |
pos.gold-to-spacy | 1.10 1.11 This recipe has been deprecated in favor of data-to-spacy , which can take multiple datasets of different types (e.g. POS tags and NER) and outputs a JSON file in spaCy’s training format that can be used with spacy train . |
pos.batch-train | 1.10 1.11 This recipe will be deprecated in favor of the general-purpose train recipe that supports all components. |
pos.train-curve | 1.10 1.11 This recipe will be deprecated in favor of the general-purpose train-curve recipe that supports all components. |
dep.batch-train | 1.10 1.11 This recipe will be deprecated in favor of the general-purpose train recipe that supports all components. |
dep.train-curve | 1.10 1.11 This recipe will be deprecated in favor of the general-purpose train-curve recipe that supports all components. |
terms.train-vectors | 1.10 1.11 This recipe has been deprecated since wrapping word vector training in a recipe only introduces a layer of unnecessary abstraction. If you want to train your own vectors, use GloVe, fastText or Gensim directly and then add the vectors to a spaCy pipeline. |
image.test | 1.10 1.11 This recipe has been deprecated since it was mostly intended to demonstrate the new image capabilities on launch. For a real-world example of using Prodigy for object detection with a model in the loop, see this TensorFlow tutorial. |
pipe | 1.10 1.11 This command has been deprecated since it didn’t provide any Prodigy-specific functionality. To pipe data forward, you can convert the data to JSONL and run `cat data.jsonl |
dataset | 1.10 1.11 This command has been deprecated since it’s mostly redundant. If a dataset doesn’t exist in the database, it’s added automatically. |