jbrowse-components icon indicating copy to clipboard operation
jbrowse-components copied to clipboard

UCSC data loading

Open cmdcolin opened this issue 2 years ago • 2 comments

UCSC is a great data resource, they do a lot of curation and such, but it is encoded in such a way that "not all the interesting data will be in a simple GTF"

An example of a non-GTF file that contains interesting curated info is their "kgXref" table. This shows the cross references to other databases http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/kgXref.txt.gz

It may be useful to work on continued UCSC compatibility efforts (the ucsc-to-json of jbrowse 1 was quite powerful) to download the data either statically or dynamically from their REST API

inspired by their blog post here http://genome.ucsc.edu/goldenPath/newsarch.html#090723

cmdcolin avatar Sep 08 '23 21:09 cmdcolin

more stuff in "linked tables" beyond even the kgXref too...screenshot from their table browser

Screenshot 2023-09-08 at 17-13-18 Select Fields from hg38 knownGene

key value: uses "select fields from primary and related tables"

image

cmdcolin avatar Sep 08 '23 21:09 cmdcolin

might try to create a new ucsc-to-json script following discussions from yesterday. the alternative is using their rest api but i think it will be slower and less reliable than bulk loading. will try to get a handle on how much data (gigabytes, etc) is used in the process

cmdcolin avatar Mar 22 '24 18:03 cmdcolin

we now have a ucsc browser at http://s3.amazonaws.com/jbrowse.org/code/jb2/main/index.html?config=%2Fjbrowse.org%2Fdemos%2Fucsc%2Fconfig.json

it uses https://github.com/cmdcolin/ucsc2jbrowse to bulk load files

it can be improved on (including potentially things like the kgXref mentioned above to access a bunch of extra feature metadata)

cmdcolin avatar Apr 17 '24 15:04 cmdcolin