tombo
tombo copied to clipboard
Levelstats statistics file raw data extract to csv
I ran the following command on my direct RNA sequencing data
tombo detect_modifications level_sample_compare \
--fast5-basedirs /native/singlefast5/ \
--alternate-fast5-basedirs /ivt/singlefast5 \
--statistics-file-basename level_testing_strain \
--store-p-value \
--statistic-type ks --processes 30
I want to extract the data from the resulting stats file and have used tombo api as follows
from tombo import tombo_helper, tombo_stats, resquiggle
import pandas as pd
sample_level_stats = tombo_stats.LevelStats('/data/level_testing_strain.tombo.stats')
reg_level_stats = sample_level_stats.get_reg_stats('chrm', '+', 1, 1525)
pd.DataFrame(reg_level_stats).to_csv("/results/tombotest.csv")
and the resulting csv looks like this.
,stat,pos,cov,control_cov
0,2.6928592163926735e-28,2,219,456
1,1.1185329170881968e-21,3,226,463
2,4.624989306606759e-18,4,261,529
3,1.7881359403179843e-25,5,306,533
4,2.540133370261695e-69,6,880,567
5,9.020681930756034e-76,7,1391,574
6,2.3636818898014833e-85,8,1754,578
7,1.1817788672225994e-58,9,2731,582
8,4.566511057754994e-49,10,3743,586
the first column I assume is just the index.
How do I interpret the statistic in 2nd column -> closer to one as modified (guessing this is probably the case) or most significant (<0.05)?
3rd column is the position of the nucleotide in the reference, coverage for sample and control in 4th and 5th columns
Am I correct in the steps I did for extracting statistics info from the level_sample_compare command?
The first column is left by the to_csv
method. Use index=False
to get rid of it. The second column is the p-value from a Kolmogorov-Smirnov test of two populations of current levels, one from the sample and the other from the control. The p-values are lower when the sample and control differ more.