Audio and Video New: 1.10

Modern deep learning technologies offer much better performance on multimedia data than previous approaches, so there are lots of opportunities for cool new products and features. Prodigy lets you create training data for a variety of common tasks, such as transcription, classification and speaker diarization. You can also use Prodigy as a library of simple building blocks to construct a custom solution, even if you have to cross-reference audio, video, text and metadata.

Manual audio annotation

The audio.manual recipe lets you load in audio or video files and add labelled regions to them. Under the hood, Prodigy will save the start and end timestamps, as well as the label for each region. You can click and drag to add a region, resize existing regions by dragging the start and end and remove regions by clicking their × button. Annotated regions can also overlap, if needed.

The following command starts the Prodigy server, loads in audio files from a directory ./recordings and allows annotating regions on them for two labels, SPEAKER_1 and SPEAKER_2:

Recipe command

prodigyaudio.manualspeaker_data./recordings--label SPEAKER_1,SPEAKER_2
This live demo requires JavaScript to be enabled.

By default, the audio loader expects to load files from a directory. The files will be encoded as base64 and the encoded data will be removed before the annotations are placed in the database.

Manual video annotation

The audio and audio_manual interfaces also support video files out-of-the-box – all you need to do is load in data with a key "video" containing the URL or base64-encoded data. The easiest way is to use audio.manual with --loader video. The video is now displayed above the waveform and you can annotate regions referring to timestamps of the video. This is especially helpful when annotating who is speaking, as the video can hold a lot of clues.

Recipe command

prodigyaudio.manualspeaker_data./recordings--loader video--label SPEAKER_1,SPEAKER_2
This live demo requires JavaScript to be enabled.

Audio or video transcription

Prodigy’s blocks interface lets you combine multiple different interfaces into one – for example, audio and text_input. The built-in audio.transcribe workflow uses this combination to provide a straightforward audio-transcription interface. The free-form text typed in by the user will be saved to the annotation task as the key "transcript". The following command starts the server with a directory of recordings and saves the annotations to a dataset:

Recipe command

This live demo requires JavaScript to be enabled.

To make it easier to toggle play and pause as you transcribe and to prevent clashes with the text input field (like with the default enter), this recipe lets you customize the keyboard shortcuts. To toggle play/pause, you can press command/option/alt/ctrl+enter or provide your own overrides via --playpause-key, for instance --playpause-key command+w.

Audio or video classification

Custom recipes also let you build your very own workflows for audio or video annotations. For instance, you might want to load in audio recordings and sort them into categories, e.g. to classify the type of noise and whether it’s produced by a car, a plane or something else.

This live demo requires JavaScript to be enabled.

The custom recipe for this workflow is pretty straightforward: using the Audio loader, you can load your files from a directory. You can then add a list of "options" to each incoming example. The "text" value is displayed to the user and the "id" is used under the hood. When you select options, their "id" values will be added to the task as "accept", e.g. "accept": ["PLANE"]. For more details on the available UI settings, check out the interface docs.


import prodigy
from prodigy.components.stream import get_stream
def classify_audio(dataset, source):
def add_options(stream):
# Load the directory of audio files and add options to each task
for eg in stream:
eg["options"] = [
{"id": "CAR", "text": "🚗 Car"},
{"id": "PLANE", "text": "✈️ Plane"},
{"id": "OTHER", "text": "Other / Unclear"}
yield eg
stream = get_stream(source)
return {
"dataset": dataset,
"stream": stream,
"view_id": "choice",
"config": {
"choice_style": "single", # or "multiple"
"choice_auto_accept": True,
"audio_loop": True,
"show_audio_minimap": False

Command-line usage

prodigyclassify-audionoise_data./recordings-F recipe.py