gosling.js icon indicating copy to clipboard operation
gosling.js copied to clipboard

feat: Make BED v1 a primitive data format

Open manzt opened this issue 2 years ago • 4 comments

Motivation

BED (Browser Extensible Data) format provides a flexible way to define the data lines that are displayed in an annotation track. It has recently been formalized in the v1 specification.

Gosling currently support BED via CSV, but it is quite verbose and users can define any field names they'd like for standard BED fields:

Specifying BED12+1 in Gosling as CSV
{
  "type": "csv",
  "url": "https://localhost:8080/data.bed",
  "headerNames": ["chrom", "chromStart", "chromEnd", "name", "score", "strand", "thickStart", "thickEnd", "itemRgb", "blockCount", "blockSizes", "myField"],
  "chromosomeField": "chrom",
  "genomicFields": ["chromStart", "chromEnd"],
  "quantitativeFields": ["score", "thickStart", "thickEnd", "blockCount"],
  "separator": "\t"
}

Proposal

Add BED as a new data-type in Gosling. BED is designed for this exact use case, and should be the preferred format for representing text-based genomic annotation data (over a custom CSV capturing identical information). Using BED will make specifications less verbose and more reusable. Using BED has the additional side-effect of ensuring datasets behind a Gosling visualization are more likely to be interoperable with other genomics tools.

interface BED {
  type: "bed";
  url: string;
  customFields?: string;
  separator?: string;
}
Specifying BED12+1 in Gosling as CSV
{
  "type": "bed",
  "url": "https://localhost:8080/data.bed",
  "customFields": ["myField"]
}

manzt avatar Nov 16 '21 16:11 manzt

Thank you for creating this issue! This will be a helpful update to make our grammar more genomic-specific.

One quick clarification - By the length of customeFields, we will infer the number of standard and custom fields, i.e., if the length is 1, then we consider the last column to be the custom one while the other fields are standard ones.

sehilyi avatar Nov 16 '21 16:11 sehilyi

One quick clarification - By the length of customFields, we will infer the number of standard and custom fields, i.e., if the length is 1, then we consider the last column to be the custom one while the other fields are standard ones.

Yes exactly. We can determine BEDn+m from the custom fields alone (n = total # of columns - m). Custom fields can only follow standard fields, so the order of customFields matters and the number of custom fields tells us how many of standard fields are present.

e.g.

For a TSV with 4 columns

{
  "type": "bed",
  "url": "https://localhost:8080/data.bed",
}

Interpretation is BED4 (chrom, chromStart, chromEnd, score)

{
  "type": "bed",
  "url": "https://localhost:8080/data.bed",
  "customFields": ["custom"]
}

Interpretation is BED3+1 (chrom, chromStart, chromEnd, custom)

manzt avatar Nov 16 '21 16:11 manzt

The final thing here is whether types need to be defined for the custom fields. This is similar to part of the discussion in #579, and I'd argue for a similar reason they are not necessary.

manzt avatar Nov 16 '21 16:11 manzt

This is similar to part of the discussion in #579, and I'd argue for a similar reason they are not necessary.

I assume the custom fields will be either nominal or quantitative. If so, I agree with not requiring users to specify the field types.

sehilyi avatar Nov 16 '21 17:11 sehilyi