gat
gat copied to clipboard
Not an issue, but I am confused ...
Hi AndreasHeger,
Problem:
- I want to calculate whether certain annotation features (genes, repeats, etc) are enriched/depleted in a particular subset of contigs in an assembly
--workspace: BED file of all regions in genome (excluding regions composed of N's) --segments: BED file of annotations in subset of contigs
contig_1001 21 792 RepeatMasker
contig_1001 27 34 dust
contig_1001 93 159 dust
contig_1001 246 255 dust
contig_1001 266 339 dust
contig_1001 415 422 dust
--annotation: BED file of annotations across the whole genome (same as above but for whole genome)
The output I get when running:
gat-run.py --ignore-segment-tracks --segments=segments.bed --annotations=annotations.bed --workspace=workspace.bed --num-samples=100 --log=gat.log --num-threads=8 > gat.out
is
track annotation observed expected CI95low CI95high stddev fold l2fold pvalue qvalue track_nsegments track_size track_density annotation_nsegments annotation_size annotation_density overlap_nsegments overlap_size overlap_density percent_overlap_nsegments_track percent_overlap_size_track percent_overlap_nsegments_annotation percent_overlap_size_annotation
merged ncrnas_predicted 2913 1709.1200 1300.0000 1994.0000 209.0009 1.7040 0.7689 1.0000e-02 1.0000e-02 62983 6935174 6.6911e+00 1025 163283 1.5754e-01 30 2913 2.8105e-03 0.0476 0.0420 2.9268 1.7840
merged gene 389744 170648.2000 163172.0000 177856.0000 5359.9760 2.2839 1.1915 1.0000e-02 1.0000e-02 62983 6935174 6.6911e+00 18574 37934616 3.6599e+01 278 389744 3.7603e-01 0.4414 5.6198 1.4967 1.0274
merged tandem 368130 158513.4400 154952.0000 162625.0000 2399.6840 2.3224 1.2156 1.0000e-02 1.0000e-02 62983 6935174 6.6911e+00 47134 4562430 4.4018e+00 4994 368130 3.5517e-01 7.9291 5.3082 10.5953 8.0687
merged RepeatMasker 1492404 610641.4800 602042.0000 620429.0000 6353.3404 2.4440 1.2892 1.0000e-02 1.0000e-02 62983 6935174 6.6911e+00 117147 21502336 2.0745e+01 8705 1492404 1.4399e+00 13.8212 21.5193 7.4308 6.9407
merged dust 3200967 1182955.4000 1172992.0000 1190872.0000 4343.2429 2.7059 1.4361 1.0000e-02 1.0000e-02 62983 6935174 6.6911e+00 382880 14706492 1.4189e+01 63463 3200967 3.0883e+00 100.7621 46.1555 16.5752 21.7657
I am confused:
- shouldn't
percent_overlap_size_track
and co be 100% for all?
Thank you in advance.
cheers,
dom
Good question. From memory, I think percent_overlap_size_track is the proportion of nucleotides in 'segments' that overlap annotations within the workspace.
It might well be a bug, are your segments non-overlapping?
There is also the --ignore-segment-tracks option, which merges all the segments. The 46% might mean that 46% of the nucleotides are in DUST segments, though I then would assume the total to be 100%. Need to go through the code to remember what happened.