physionet-build icon indicating copy to clipboard operation
physionet-build copied to clipboard

Develop API for integrating with annotation platforms

Open tompollard opened this issue 5 years ago • 17 comments

There are several existing platforms that could be used to gather useful annotations for PhysioNet datasets. This needs a lot more thought, but as a rough idea it would be good to develop a general API that:

  1. allows an external platform to request a data file for annotation, perhaps along with associated metadata (details to be determined, but this might include existing annotations).
  2. allows the external platform to submit a structured annotation back to PhysioNet.

Metadata profile for an annotation

The structure of the annotation will need to be developed. At minimum, the metadata should probably include:

  • name/id of the annotation platform
  • name/id of the user on the platform
  • name/id of the record being annotated
  • the annotation
  • the time when the annotation was made.

Metadata profile for an annotation task

One of the major challenges is understanding how the API can be made generalizable across PhysioNet, ideally to support multiple data types and modalities (images, waveforms, notes, etc). It feels like the annotation task will require a formal definition that would state things like:

  • which files are to be annotated and the file type.
  • a human-readable description of the annotation task (e.g. "The task is to label each R wave peak in the ECG recordings" or "The task is to draw a bounding box around pacemakers in the X-rays."
  • a machine readable description of the annotation task that can be used by the annotation tool to render the task appropriately.

Providing an interface for the annotation functionality

Annotation tasks may be driven by the research question, and there may be multiple annotation tasks for a single dataset. We need to come up with a simple way of allowing PhysioNet users to propose and implement an annotation task. My suggestion is that we do this with the use of a new "annotation" project type (see https://github.com/MIT-LCP/physionet-build/issues/1032).

Summary of tasks

So in summary, some good first steps might be to:

  1. Design a metadata profile for a generalizable annotation. We can probably reuse or build on the format used by existing annotation platforms that we have looked at.
  2. Design a metadata profile for an annotation task.
  3. Review whether using the existing project functionality (https://github.com/MIT-LCP/physionet-build/issues/1032) could be modified to allow a new project type to be used for defining tasks and storing annotations.
  4. Design the API!

tompollard avatar May 13 '20 16:05 tompollard

@tompollard suggested GraphQL as a possible API and it looks great! It also seems like Graphene will be the best way for us to provide an easy interface with Django.

Lucas-Mc avatar Jun 02 '20 13:06 Lucas-Mc

Hey @tompollard, I have begun to write out the annotation model here:

class AnnotationLabel(models.Model):
    """
    A way to save and edit annotation labels for signals.
    """
    project = models.OneToOneField('project.PublishedProject', related_name='ann',
        on_delete=models.CASCADE)
    edited_by = models.ForeignKey('user.User', related_name='ann_editor',
        on_delete=models.CASCADE)
    creation_datetime = models.DateTimeField(auto_now_add=True)
    platform_name = models.CharField(max_length=150, null=True)
    record_name = models.CharField(max_length=150, null=True)

My thoughts are that:

  • The project should be defined as the current project that the user is working on and the user should only have access to published projects.
  • The editor of the annotations will simply be the user who is currently logged in and viewing the project.
  • The creation datetime will be created at the time of submission of the annotation which will call the AnnotationLabel class model.
  • The platform name can be defined in the views based on where the request is submitted from or maybe it will come with a signature of some kind.
  • The record name can also be defined in the views though maybe there's a way to do it in the models based on where the AnnotationLabel creation request is coming from.

I think that this may be the most generalized that we can get when it comes to sharing annotation models. For example, finding similarities between the actual annotation structure of signals and images may be difficult.

Lucas-Mc avatar Jun 09 '20 14:06 Lucas-Mc

As for potential annotation structures, one that is particularly appealing for signals may be the format used by Label Studio. This structure can be used for both region [PR interval, QRS complex, etc.] (setting a start and stop time) and beat [Normal, AFIB, etc.] (setting the stop time to null / same time as the start time) annotations. Of course, we can edit and modify this how we like but this may be a good start.

See an example of input annotations and output labeled annotation JSON here:

Screen Shot 2020-06-09 at 10 45 10 AM

[
    {
        "id": "gyV6XOeyCz",
        "from_name": "label",
        "to_name": "audio",
        "source": "$url",
        "type": "labels",
        "original_length": 3.774376392364502,
        "value": {
            "start": -0.004971698554622573,
            "end": 0.20349676497713773,
            "labels": [
                "Politics"
            ]
        }
    },
    {
        "id": "PJqb8mmmsC",
        "from_name": "label",
        "to_name": "audio",
        "source": "$url",
        "type": "labels",
        "original_length": 3.774376392364502,
        "value": {
            "start": 0.39002117971608113,
            "end": 0.6698078018244963,
            "labels": [
                "Business"
            ]
        }
    },
    {
        "id": "xcHF2NJUcs",
        "from_name": "label",
        "to_name": "audio",
        "source": "$url",
        "type": "labels",
        "original_length": 3.774376392364502,
        "value": {
            "start": 0.867304240959848,
            "end": 3.127541266619986,
            "labels": [
                "Education"
            ]
        }
    }
]

Lucas-Mc avatar Jun 09 '20 14:06 Lucas-Mc

Currently the attributes of the WFDB Annotation class used for writing the WFDB-format annotation files are:

[ 'ann_len', 'aux_note', 'chan', 'contained_labels', 'custom_labels', 'description', 'extension', 'fs',
'label_store', 'num', 'record_name', 'sample', 'subtype', 'symbol']

ann_len : int
    The number of samples in the annotation.
aux_note : list, optional
    A list containing the auxiliary information string (or None for
    annotations without notes) for each annotation.
chan : ndarray, optional
    A numpy array containing the signal channel associated with each
    annotation.
contained_labels : pandas dataframe, optional
    The unique labels contained in this annotation. Same structure as
    `custom_labels`.
custom_labels : pandas dataframe, optional
    The custom annotation labels defined in the annotation file. Maps
    the relationship between the three label fields. The data type is a
    pandas DataFrame with three columns:
    ['label_store', 'symbol', 'description'].
description : list, optional
    A list containing the descriptive string of each annotation label.
extension : str
    The file extension of the file the annotation is stored in.
fs : int, float, optional
    The sampling frequency of the record.
label_store : ndarray, optional
    The integer value used to store/encode each annotation label.
num : ndarray, optional
    A numpy array containing the labelled annotation number for each
    annotation.
record_name : str
    The base file name (without extension) of the record that the
    annotation is associated with.
sample : ndarray
    A numpy array containing the annotation locations in samples relative to
    the beginning of the record.
subtype : ndarray, optional
    A numpy array containing the marked class/category of each annotation.
symbol : list, numpy array, optional
    The symbols used to display the annotation labels. List or numpy array.
    If this field is present, `label_store` must not be present.

These are some of the things that we should consider when building this new annotation model, especially if we decide to incorporate some of the functionality of Label Studio. I think some of these may be able to be cut out, but should we keep them for compatibility if we decide to write a conversion method in the future?

*Some background on the conversion issue, @tompollard suggested, and I agreed, that it would be best to store these labels in XML (possibly JSON) format since it's easier to access and is much more flexible. If someone wanted these annotations in WFDB format, then we could have a conversion method for that.

Lucas-Mc avatar Jun 29 '20 17:06 Lucas-Mc

Label Studio is releasing a time-series dedicated annotation platform which allows the user to make annotations for both ranges of times and singular times. Here is what the demo looks like:

Screen Shot 2020-07-27 at 8 29 29 AM

You'll note that the user can specify the event they wish to annotate and then perform the desired annotation using a double-click for a singular time point annotations and click-and-drag for time range annotations. You can also see the previous completions done which we can use to track multiple user who wish to annotate a single project. Additionally, we have the ability to set a ground truth set of annotations if we ever desire that functionality. Here is the resulting JSON (note single time annotations are saved with the same start and end time):

Result

[
    {
        "id": "QKaimQjoTQ",
        "from_name": "label",
        "to_name": "ts",
        "source": "$csv",
        "type": "timeserieslabels",
        "parent_id": null,
        "value": {
            "start": 1592250821941.2595,
            "end": 1592250831927.112,
            "instant": false,
            "timeserieslabels": [
                "Event 1"
            ]
        }
    },
    {
        "id": "RSj46Dzkhe",
        "from_name": "label",
        "to_name": "ts",
        "source": "$csv",
        "type": "timeserieslabels",
        "parent_id": null,
        "value": {
            "start": 1592250921955.7407,
            "end": 1592250921955.7407,
            "instant": true,
            "timeserieslabels": [
                "Event 1"
            ]
        }
    },
    {
        "id": "RKODZiMgsp",
        "from_name": "label",
        "to_name": "ts",
        "source": "$csv",
        "type": "timeserieslabels",
        "parent_id": null,
        "value": {
            "start": 1592251211907.621,
            "end": 1592251211907.621,
            "instant": true,
            "timeserieslabels": [
                "Event 1"
            ]
        }
    },
    {
        "id": "nkRg1P9L5L",
        "from_name": "label",
        "to_name": "ts",
        "source": "$csv",
        "type": "timeserieslabels",
        "parent_id": null,
        "value": {
            "start": 1592251461993.5276,
            "end": 1592251711941.2742,
            "instant": false,
            "timeserieslabels": [
                "Event 2"
            ]
        }
    },
    {
        "id": "NE7unB1-J1",
        "from_name": "label",
        "to_name": "ts",
        "source": "$csv",
        "type": "timeserieslabels",
        "parent_id": null,
        "value": {
            "start": 1592252101985.5444,
            "end": 1592252101985.5444,
            "instant": true,
            "timeserieslabels": [
                "Event 3"
            ]
        }
    },
    {
        "id": "oHQC4dE7-u",
        "from_name": "label",
        "to_name": "ts",
        "source": "$csv",
        "type": "timeserieslabels",
        "parent_id": null,
        "value": {
            "start": 1592252011979.126,
            "end": 1592252441979.4265,
            "instant": false,
            "timeserieslabels": [
                "Event 1"
            ]
        }
    },
    {
        "id": "M-dMRAbRxu",
        "from_name": "label",
        "to_name": "ts",
        "source": "$csv",
        "type": "timeserieslabels",
        "parent_id": null,
        "value": {
            "start": 1592251341969.1328,
            "end": 1592251341969.1328,
            "instant": true,
            "timeserieslabels": [
                "Event 1"
            ]
        }
    },
    {
        "id": "agpadQD5i_",
        "from_name": "label",
        "to_name": "ts",
        "source": "$csv",
        "type": "timeserieslabels",
        "parent_id": null,
        "value": {
            "start": 1592252721959.5007,
            "end": 1592252851914.7446,
            "instant": false,
            "timeserieslabels": [
                "Event 3"
            ]
        }
    }
]

Lucas-Mc avatar Jul 27 '20 12:07 Lucas-Mc

I think it's worth it to note that WFDB has a function called rr2ann which converts a series of RR Intervals to annotations. I have already developed the reverse, ann2rr in the latest 3.1.0 release of WFDB-Python and will plan to add this functionality in the next release. We can use the beat annotations generated with the Label Studio annotation platform to generate RR intervals and convert them to annotations in WFDB format using WFDB-Python.

Lucas-Mc avatar Jul 27 '20 12:07 Lucas-Mc

Thanks for your comments on this @Lucas-Mc. Very useful! I am going to try to pick this up and work on a first draft of an Annotation API.

tompollard avatar Sep 04 '25 15:09 tompollard

See also: https://github.com/MIT-LCP/physionet-build/issues/127

tompollard avatar Sep 04 '25 15:09 tompollard

@bemoody here is an approach for discussion tomorrow:

Task

Add annotation API service to PhysioNet that will allow users to submit and retrieve annotations for published datasets. We could limit annotation permissions to a select group of users (for example, users who complete a certain training course).

Example use case

Arrhythmia Detection (ECG): We would like to develop an algorithm for detecting ventricular tachycardia.

  1. Define an AnnotationType with start/end indices and event type.
  2. Create an AnnotationCollection, announce the task, and invite contributions.
  3. Annotators mark intervals of ECG records where the arrhythmia occurs, saving these as structured JSON.
  4. Submissions are validated via the API, then used as ground truth for training and evaluation.

Other use cases might include:

  • Pneumonia Detection (X-ray): To train models that localize pneumonia, radiologists draw bounding boxes or masks on chest X-rays.
  • Symptom Extraction (Clinical Text): To build text models that identify patient symptoms from clinical notes, annotators highlight spans of text.

Design

The service would be created in a new annotation app, with the following core models:

  • AnnotationCollection: Groups of annotations (e.g. annotations relating to a community annotation task)
  • AnnotationType: Schema definitions for different annotation types (e.g., cardiac events, sleep stages)
  • Annotation: Individual annotation instances with JSON data and validation

Examples:

class AnnotationCollection(models.Model):
    """A collection of annotations that can span multiple projects"""
    name = models.CharField(max_length=200)
    description = models.TextField(blank=True)
    created_by = models.ForeignKey('user.User', on_delete=models.CASCADE)
    created_datetime = models.DateTimeField(auto_now_add=True)
    updated_datetime = models.DateTimeField(auto_now=True)
    is_active = models.BooleanField(default=True)
    metadata = models.JSONField(default=dict)

class Annotation(models.Model):
    """Individual annotation instances"""
    collection = models.ForeignKey(AnnotationCollection, on_delete=models.CASCADE)
    annotation_type = models.ForeignKey(AnnotationType, on_delete=models.CASCADE)
    
    # Reference to specific project(s) this annotation relates to
    project = models.ForeignKey('project.PublishedProject', on_delete=models.CASCADE, null=True, blank=True)
    
    data = models.JSONField()
    created_by = models.ForeignKey('user.User', on_delete=models.CASCADE)
    created_datetime = models.DateTimeField(auto_now_add=True)
    updated_datetime = models.DateTimeField(auto_now=True)
    is_active = models.BooleanField(default=True)
    
    # Reference to specific files
    file_path = models.CharField(max_length=500, blank=True)

class AnnotationType(models.Model):
    """Defines different types of annotations (e.g., 'event', 'segment', 'label')"""
    name = models.CharField(max_length=100, unique=True)  # e.g., "cardiac_event", "sleep_stage", "artifact"
    description = models.TextField(blank=True)
    schema = models.JSONField()  # JSON schema for validation
    is_active = models.BooleanField(default=True)
    created_datetime = models.DateTimeField(auto_now_add=True)

Annotation definitions

We would define different annotation types over time. We would initially begin with a single type. Examples of annotation types:

# 1. Cardiac Event Annotation

{
  "name": "cardiac_event",
  "description": "Annotations for cardiac events like arrhythmias",
  "schema": {
    "type": "object",
    "properties": {
      "event_type": {"type": "string", "enum": ["afib", "vtach", "bradycardia", "normal"]},
      "start_time": {"type": "string", "format": "date-time"},
      "end_time": {"type": "string", "format": "date-time"},
      "confidence": {"type": "number", "minimum": 0, "maximum": 1},
      "notes": {"type": "string", "maxLength": 500}
    },
    "required": ["event_type", "start_time", "end_time"]
  }
}

# 2. Sleep Stage Annotation

{
  "name": "sleep_stage",
  "description": "Sleep stage classifications",
  "schema": {
    "type": "object",
    "properties": {
      "stage": {"type": "string", "enum": ["wake", "n1", "n2", "n3", "rem"]},
      "start_time": {"type": "string", "format": "date-time"},
      "duration_seconds": {"type": "number", "minimum": 0},
      "quality": {"type": "string", "enum": ["high", "medium", "low"]}
    },
    "required": ["stage", "start_time", "duration_seconds"]
  }
}

# 3. Signal Quality Annotation

{
  "name": "signal_quality",
  "description": "Signal quality assessments",
  "schema": {
    "type": "object",
    "properties": {
      "quality_rating": {"type": "string", "enum": ["excellent", "good", "fair", "poor", "unusable"]},
      "start_time": {"type": "string", "format": "date-time"},
      "end_time": {"type": "string", "format": "date-time"},
      "issues": {"type": "array", "items": {"type": "string"}},
      "annotator_notes": {"type": "string", "maxLength": 1000}
    },
    "required": ["quality_rating", "start_time", "end_time"]
  }
}

API endpoints

Some example API calls are:

# Get all annotations for MIMIC-III v1.4
GET /api/v1/projects/mimiciii/1.4/annotations/

# Add an annotation to a collection
POST /api/v1/annotations/collections/123/
{
  "annotation_type": "cardiac_event",
  "data": {
    "event_type": "afib",
    "start_time": "2023-01-01T10:30:00Z",
    "end_time": "2023-01-01T10:35:00Z",
    "confidence": 0.95
  },
  "file_path": "data/record001.wfdb",
  "record_id": "record001"
}

# Search for all atrial fibrillation annotations
GET /api/v1/annotations/search/?q=afib&annotation_type=cardiac_event

tompollard avatar Sep 05 '25 03:09 tompollard

Discussion with @bemoody:

  • He likes this.
  • Should we move annotation type to models?
    • More efficient search
    • Tighter control of data and stronger standardization
    • Cost of flexibility
  • How do we manage access?
    • Any annotation that is attached to controlled access data needs to be controlled in the same way.
    • We should recognize that without reviewing annotations, we are publishing unreviewed data.
    • Also need to be conscious of user privacy e.g. possibly not reveal the annotator unless this is explicitly part of the arrangement.

tompollard avatar Sep 05 '25 15:09 tompollard

LabelStudio is a popular annotation app, also built in Django: https://github.com/HumanSignal/label-studio/tree/develop. Might be worth looking at the structure they use for annotations.

Possibly Label?: https://github.com/HumanSignal/label-studio/blob/develop/label_studio/labels_manager/models.py (in which case, the details are here: value = models.JSONField('value', null=False, help_text='Label value'))

tompollard avatar Sep 08 '25 19:09 tompollard

@xborrat and team are using LabelStudio for annotation on a research project that we collaborate on. I'm sure they would be willing to share what they have learned about it, if interested. It appeared to be working well in the demo they gave recently.

briangow avatar Sep 08 '25 20:09 briangow

After discussion with @bemoody @emmyxth and @thomas-sounack, my proposal is:

Core annotation concepts

  • Anchor: Which file the annotation applies to (project, file_path, optional file_sha256, file_format)
  • Location (or target): Where within the file (typed models; e.g., timeseries interval, image bbox, text span)
  • Labels (or content): What the label means semantically (JSON validated against an AnnotationType schema)
  • AnnotationType: Declares the "contract" (i.e. the allowed location/target and a JSON Schema for labels/content)

Data models

# apps/annotations/models.py


class AnnotationCollection(models.Model):
    """
    A collection of annotations that can span multiple projects or be project-specific.
    
    Collections provide a way to group related annotations together, whether they're
    from a single PhysioNet project or span multiple datasets. Examples:

        - "Multi-Dataset Sleep Stages" - sleep annotations across multiple datasets
        - "Research Study XYZ Annotations" - all annotations for a specific study
    """
    name = models.CharField(max_length=200)
    description = models.TextField(blank=True)
    created_by = models.ForeignKey('user.User', on_delete=models.CASCADE)
    created_datetime = models.DateTimeField(auto_now_add=True)
    updated_datetime = models.DateTimeField(auto_now=True)


class LocationKind(Enum):
    """
    Supported location types for annotations.
    
    Each location type represents a different way of specifying "where" within
    a file/record. The choice of location type depends on the data modality
    and the nature of the annotation, e.g.:
    
    - TIMESERIES_INTERVAL: Time-based intervals for signals (ECG, EEG, etc.)
    - IMAGE_BBOX: Rectangular bounding boxes in images
    - TEXT_SPAN: Character ranges in text documents
    
    Each AnnotationType specifies which location kind(s) it supports via the
    allowed_location_kind field. This ensures type safety and validation
    when creating annotations.
    
    Examples:
        >>> LocationKind.TIMESERIES_INTERVAL.value
        'timeseries_interval'
        >>> LocationKind.choices()
        [('timeseries_interval', 'Timeseries Interval'), ...]
    """
    TIMESERIES_INTERVAL = 'timeseries_interval'
    IMAGE_BBOX = 'image_bbox'
    TEXT_SPAN = 'text_span'

    @classmethod
    def choices(cls):
        return [(choice.value, choice.value.replace('_', ' ').title()) for choice in cls]


class AnnotationType(models.Model):
    """
    Defines the contract/schema for a specific type of annotation.
    
    This model acts as a formal contract that specifies:
    - What data structure is required for annotation labels (via label_schema)
    - What location type must be used (via allowed_location_kind)
    - What validation rules apply (via location_schema)
    - What the annotation represents semantically (via name, description)
    
    All annotations of this type must follow this contract. The system enforces
    the contract during validation to ensure consistency and data integrity.
    
    Example:
        An "ECG arrhythmia interval" type might require:
        - TimeseriesIntervalLocation for the "where"
        - Labels with event_type (enum), confidence (0-1), notes (optional)
        - Validation that start < end and coordinates are non-negative
    """
    slug = models.SlugField(max_length=100, unique=True). # e.g., "ecg_interval_label"
    name = models.CharField(max_length=120)
    description = models.TextField(blank=True)

    # JSON Schema for Annotation.labels (semantic labels)
    label_schema = models.JSONField()

    # Which Location model must be used
    allowed_location_kind = models.CharField(
        max_length=40,
        choices=LocationKind.choices(),
        default=LocationKind.TIMESERIES_INTERVAL.value
    )
    version = models.CharField(max_length=20, default='1.0.0')
    created_datetime = models.DateTimeField(auto_now_add=True)


class Annotation(models.Model):
    """
    An individual annotation instance that anchors to a specific file/record.
    
    An annotation represents a single labeled piece of data, consisting of:
    - An anchor (which file/record it applies to)
    - A location (where within that file/record)
    - Label (what the label means)
    
    The annotation must follow the contract defined by its AnnotationType, which
    specifies the required data structure and validation rules.
    
    Key Components:
    - Anchor: Links to a specific file_path within a project (optional)
    - Location: One-to-one relationship with a concrete Location model (e.g., 
      TimeseriesIntervalLocation for time-based annotations)
    - Labels: JSON data validated against the AnnotationType's label_schema
    - Provenance: Metadata about who created it, using what tool, etc.
    
    Examples:
        - ECG arrhythmia annotation: "afib from sample 1000 to 2000 in record001.wfdb"
        - Image bounding box: "dense region at (x,y) with width/height in scan.dcm"
        - Text span: "medical term from character 150 to 200 in report.txt"
    """
    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)

    collection = models.ForeignKey(AnnotationCollection, on_delete=models.CASCADE, related_name='annotations')
    annotation_type = models.ForeignKey(AnnotationType, on_delete=models.PROTECT, related_name='annotations')

    # Anchor to the file
    project = models.ForeignKey('project.PublishedProject', on_delete=models.CASCADE, null=True, blank=True)
    file_path = models.CharField(max_length=500)
    file_sha256 = models.CharField(max_length=64, blank=True)  # optional content hash
    file_format = models.CharField(max_length=32, blank=True)  # e.g., "wfdb", "dicom", "png", "txt"

    # Labels: validated by AnnotationType.label_schema
    labels = models.JSONField(default=dict, blank=True)

    created_by = models.ForeignKey('user.User', on_delete=models.CASCADE, related_name='created_annotations')
    created_datetime = models.DateTimeField(auto_now_add=True)
    updated_datetime = models.DateTimeField(auto_now=True)


class BaseLocation(models.Model):
    """
    Abstract base class for all Location types that define "where" within a file.
    
    Locations specify the spatial, temporal, or textual position of an annotation
    within its anchored file. Each Location type represents a different
    way of describing position, e.g.:
    
    - TimeseriesIntervalLocation: Time-based intervals (e.g., ECG segments)
    - ImageBBoxLocation: Rectangular regions in images (e.g., bounding boxes)
    - TextSpanLocation: Character ranges in text (e.g., named entity spans)
    
    Common fields:
    - coord_system: The coordinate system used (e.g., 'samples', 'seconds', 'pixels')
    - channel: Optional channel identifier (useful for multi-channel data like ECG leads)
    
    Each annotation must have exactly one Location instance that matches the
    AnnotationType's allowed_location_kind. The Location provides the "where"
    component that, combined with the annotation's labels (the "what"),
    forms a complete annotation.
    
    Examples:
        - TimeseriesIntervalLocation: "from sample 1000 to 2000 in lead II"
        - ImageBBoxLocation: "rectangle at (50,100) with size 200x150 pixels"
        - TextSpanLocation: "characters 150-200 in the diagnosis section"
    """
    annotation = models.OneToOneField(
        Annotation, on_delete=models.CASCADE, related_name='location'
    )
    coord_system = models.CharField(max_length=24, blank=True)  # e.g., 'samples','seconds','pixels','char_offset'
    created_datetime = models.DateTimeField(auto_now_add=True)

    class Meta:
        abstract = True


class TimeseriesIntervalLocation(BaseLocation):
    coord_system = models.CharField(max_length=24, default='samples')
    channel = models.CharField(max_length=32, blank=True)
    start = models.BigIntegerField()
    end = models.BigIntegerField()

class ImageBBoxLocation(BaseLocation):
    coord_system = 'pixels'
    x = models.IntegerField()
    y = models.IntegerField()
    width = models.IntegerField()
    height = models.IntegerField()

class TextSpanLocation(BaseLocation):
    coord_system = 'char_offset'
    begin = models.IntegerField()
    end = models.IntegerField()
    encoding = models.CharField(max_length=16, default='utf-8')

Example of usage for ECG Arrhythmia Annotation

Step 1: Create an Annotation Collection

The PhysioNet team create a new collection for their project:

collection = AnnotationCollection.objects.create(
    name="Cardiac Arrhythmias",
    description="Annotations for cardiac arrhythmia events",
    created_by=admin
)

Step 2: Define Annotation Type (Schema Selection)

The PhysioNet team defines what types of annotations they want to collect.

# Create a new annotation type with custom schema
arrhythmia_type = AnnotationType.objects.create(
    slug="ecg_arrhythmia_interval",
    name="ECG Arrhythmia Interval",
    description="Time intervals containing cardiac arrhythmia events",
    allowed_location_kind=LocationKind.TIMESERIES_INTERVAL.value,
    version="1.0.0",
    label_schema={
        "type": "object",
        "properties": {
            "event_type": {
                "type": "string",
                "enum": ["afib", "vtach", "pvc", "asystole", "normal", "bradycardia"]
            },
            "confidence": {
                "type": "number",
                "minimum": 0.0,
                "maximum": 1.0
            },
            "severity": {
                "type": "string",
                "enum": ["mild", "moderate", "severe"]
            },
            "notes": {
                "type": "string",
                "maxLength": 500
            }
        },
        "required": ["event_type"],
        "additionalProperties": false
    }
)

Step 3: Receive annotations via the Annotations API

POST /api/v1/annotations/collections/{collection_id}/annotations/
{
    "annotation_type": "ecg_arrhythmia_interval",
    "project": "mimiciii/1.4",  # Project slug/version
    "file_path": "waveforms/p000001/p000001-2133-01-01-00-00-00.wfdb",
    "labels": {
        "event_type": "afib",
        "confidence": 0.95
    },
    "location": {
        "type": "timeseries_interval",
        "coord_system": "samples",
        "channel": "lead_II",
        "start": 50000,
        "end": 55000
    }
}

Step 4: Validation

The system automatically validates incoming annotations against the contract:

# Valid annotation - follows the contract
valid_annotation = {
    "labels": {
        "event_type": "afib",        # Required, valid enum value
        "confidence": 0.95,          # Optional, within range
        "notes": "Clear A-fib pattern"  # Optional, under 500 chars
    },
    "location": {
        "type": "timeseries_interval",  # Matches allowed_location_kind
        "coord_system": "samples",      # Valid coordinate system
        "start": 1000,                  # Required for timeseries
        "end": 2000                     # Required for timeseries
    }
}

# Invalid annotation - violates the contract
invalid_annotation = {
    "labels": {
        "event_type": "invalid_type",  # Not in allowed enum
        "confidence": 1.5              # Outside 0-1 range
    },
    "location": {
        "type": "image_bbox",          # Wrong type - contract says timeseries_interval
        "x": 10, "y": 20               # Wrong fields for timeseries
    }
}

tompollard avatar Sep 11 '25 20:09 tompollard

Hi @tompollard looks great so far. Questions I'd love to discuss are:

  1. How are we thinking about authentication here? Would users be passing in a token when making a POST request?
  2. We're using serializers to validate schemas, correct? Or pydantic as we were flirting with that notion before.

emmyxth avatar Sep 12 '25 05:09 emmyxth

How are we thinking about authentication here? Would users be passing in a token when making a POST request?

We could use OAuth tokens for authentication. Users would go to /settings/tokens and then generate a personal token, and then this token would be passed in the header when making a request. We do need to think more about how permissions are managed (e.g. who is allowed to submit an annotation? who is allowed to search annotations?)

tompollard avatar Sep 12 '25 14:09 tompollard

We're using serializers to validate schemas, correct? Or pydantic as we were flirting with that notion before.

I'm open to suggestions but we could use the standard Django REST serializers I think, e.g.

class AnnotationSerializer(serializers.ModelSerializer):
    class Meta:
        model = Annotation
        fields = ['id', 'collection', 'annotation_type', 'labels', 'created_datetime']
    
    def validate_labels(self, value):
        # Validate against JSON schema
        annotation_type = self.initial_data.get('annotation_type')
        if annotation_type:
            schema = annotation_type.label_schema
            jsonschema.validate(value, schema)
        return value

tompollard avatar Sep 12 '25 14:09 tompollard

For consistency, LocationKind probably makes more sense as LocationType

tompollard avatar Sep 12 '25 15:09 tompollard

Feedback from our community suggests that it would be helpful to be able to capture relationships between annotations. In particular, we should be able to support the annotation of a clinical diagnosis or treatment decision (an inference) and link this to the underlying evidence/reason.

For example: a clinician might annotate that a patient has sepsis (the inference) and then provide evidence such as elevated white blood cell count, elevated temperature etc. We should think about how to incorporate this into the annotation structure.

tompollard avatar Nov 05 '25 22:11 tompollard