Shock icon indicating copy to clipboard operation
Shock copied to clipboard

Indexing on uploaded files of certain types

Open pranjan77 opened this issue 11 years ago • 0 comments

In shock, when the user uploads a gff file and only wants a subset of data from the file, it is not possible without downloading the complete file (76 Mb) and parsing it. It would be great if indexing was done on the uploaded file and the index was saved as part of the metadata. Uploaded vcf file for a large sequencing project could be in the range ~100Gb - 600Gb.

Tabix is a software used for indexing files of certain formats including: gff, bed, sam, vcf and psltab and lets user get a subset of data from the file. eg. gff file has the following format and is used to store information about features on the genome. (See ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Ptrichocarpa/annotation/Ptrichocarpa_210_gene.gff3.gz for a sample file)

Chr01 phytozome9_0 gene 1660 2502 . - . ID=Potri.001G000100;Name=Potri.001G000100 Chr01 phytozome9_0 mRNA 1660 2502 . - . ID=PAC:27043735;Name=Potri.001G000100.1;pacid=27043735;longest=1;Parent=Potri.001G000100 Chr01 phytozome9_0 CDS 1660 2502 . - 0 ID=PAC:27043735.CDS.1;Parent=PAC:27043735;pacid=27043735 Chr01 phytozome9_0 gene 2906 6646 . - . ID=Potri.001G000200;Name=Potri.001G000200 Chr01 phytozome9_0 mRNA 2906 6646 . - . ID=PAC:27045395;Name=Potri.001G000200.1;pacid=27045395;longest=1;Parent=Potri.001G000200 Chr01 phytozome9_0 CDS 6501 6644 . - 0 ID=PAC:27045395.CDS.1;Parent=PAC:27045395;pacid=27045395

Following is the way I would do it if gff file was on my local system but not sure how to do this in shock.

(grep ^"#" in.gff; grep -v ^"#" in.gff | sort -k1,1 -k4,4n) | bgzip > sorted.gff.gz; tabix -p gff sorted.gff.gz; tabix sorted.gff.gz chr01:6644;

pranjan77 avatar Aug 08 '13 18:08 pranjan77