gosling.js icon indicating copy to clipboard operation
gosling.js copied to clipboard

API change (CsvData): A single way to define genomic fields

Open etowahadams opened this issue 1 year ago • 3 comments

Background

Genomic positions are often defined in terms of a chromosome and a chromosome position (chrom, chromPos).

In order to show multiple chromosomes on the same linear or circular axis, however, these chromosome positions need to be converted to "absolute" position, based on some fixed ordering of the chromosomes. For the human genome, there is a conventional ordering of the chromosomes.

To convert a relative genomic position to an absolute genomic position, you need to know the order of the chromosomes and the size of each chromosome. This information is used to compute the absolute position from a given (chrom, chromPos) pair. For example, the absolute position of position 200 on chromosome 2 would be the length of chromosome 1, plus 200 len(chrom1) + 200.

Given that a CSV file contains chromosome fields and chromosome position fields, there needs to be some way of associating the right pairs together such that Gosling can calculate the correct absolute position.

Current API

Currently there are two different ways to associate chromosome fields with chromosome position fields, depending on the number of chromosome fields.

Single chromosome field

Most CSV files will have a single chromosome field and one or more position fields. In the below example, we want Gosling to use the CHROM and POS column to determine the absolute position.

CHROM    POS
chr2    100
chr2    200
chr3    150
data: {
   url: 'my_csv.csv',
   chromosomeField: 'CHROM',
   genomicFields: ['POS']
}

Multiple chromosome fields

There are more complex CSV files that have multiple chromosome fields.genomicFieldsToConvert is a way to associate different position fields with different chromosome fields.

CHROMa   POSa  CHROMb   POSb
chr2    100    chr3    120
chr2    200    chr1    700
chr3    150    chr2    200
data: {
   url: 'my_csv.csv',
   genomicFieldsToConvert: [{
       chromosomeField: "CHROMa"
       genomicFields: ["POSa"]
   },
   {
       chromosomeField: "CHROMb"
       genomicFields: ["POSb"]
   }]
}

Proposed change: A single way to define genomic fields

Rather than having different ways to define these two use cases, we would like to have a single way to associate the chromosome fields with the chromosome position fields.

Option 1: Keep current way to define multiple chromosomes together

@sehilyi

The explicit use of key names (e.g., chromosomeField ), while can result in an error, makes it clear what that is for to users and is little more consistent to other parts of the grammar

"genomicFieldsToConvert": [
   genomicFieldsToConvert: [{
       chromosomeField: "CHROMa"
       genomicFields: ["POSa"]
   },
   {
       chromosomeField: "CHROMb"
       genomicFields: ["POSb"]
   }]
]

Option 2: Represent chromosome name and positions as key:value pairs

Proposed by @manzt

"genomicFieldsToConvert": {
   "CHROMa": ["POSa"],
   "CHROMb":["POSb"]
}

Another side-effect of this design is that if chromosomeField is mutually exclusive with others when there are multiple, then a map makes an invalid state un-representable (whereas we would need to handle duplicates in a list).

Option 3: A data transform

Proposed by @sehilyi

Rather than the relative to absolute data transform be implicit inside of Gosling, it could be made more explicit to the user. The user could configure a data transform which creates a new field that is the absolute chromosomal position. This option is probably the most verbose but also the most flexible.

dataTransform: [{
	{ "type": "relToAbsCoordinates", "chromosomeField": "CHROMa", "genomicField": "POSa", "newField": "POSa_absolute" },
    { "type": "relToAbsCoordinates", "chromosomeField": "CHROMb", "genomicField": "POSb", "newField": "POSb_absolute" }
}]

etowahadams avatar May 05 '23 19:05 etowahadams