TreeToReads
TreeToReads copied to clipboard
Support for threading and additional read simulation parameters
I'm not sure if there's any interest in this PR but I use TreeToReads quite a bit to simulate data for courses I teach and try and incorporate additional errors and variance to make read sets more realistic. This PR adds support for all the relevant ART parameters for more control over the simulation (and adding "randomness"). This also adds supports for threads (via the threads
config item) and pigz
, if available, for compression. All the updates are backwards compatible, so existing TTR config files will work the exact same way.
New or updated config options:
threads
: Number of threads for read simulation, if not set or set to {0,1} only 1 thread is used
read_length
: Now supports comma separated lists of read length (e.g. 150,250) to generate reads with variable lengths
Fragment length can be randomized using the frag_{low,high,step} options
frag_low
: Start of random frag length range
frag_high
: End of random frag length range
frag_step
: Step size for binning frag length (e.g. 5 = count from low to high by 5's)
Fragment length standard deviation can be randomized using the stdev_frag_{low,high,step} options
stdev_frag_low
: Start of random frag length stdev range
stdev_frag_high
: End of random frag length stdev range
stdev_frag_step
: Step size for binning frag length stdev (e.g. 5 = count from low to high by 5's)
Average genomic read coverage. Can be uniform [coverage] across dataset or randomly distributed between [coverage_low, coverage_high].
coverage
= 20
coverage_low
: Start of random coverage range
coverage_high
: End of random coverage range
coverage_step
: Step size for binning random coverage
Average QS2 bias to introduce. Can be 0 (no value provided) across dataset or randomly distributed between [qs2_low, qs2_high]
qs2_low
: Start of QS2 bias range
qs2_high
: End of QS2 bias range
qs2_step
: Step size for binning QS2 bias
Platform read error profile used to generate reads. Supports the following (between read length), and defaults to MiSeq v3 to allow reads up to 250bp. If read lengths are provided that are greater than the profile(s) select support, they'll be truncated to the maxmin of the profile(s) selected
GA1 - GenomeAnalyzer I (1-44bp), GA2 - GenomeAnalyzer II (1-75bp)
HS10 - HiSeq 1000 (1-100bp), HS20 - HiSeq 2000 1-100bp), HS25 - HiSeq 2500 (1-150bp)
HSXn - HiSeqX PCR free (1-150bp), HSXt - HiSeqX TruSeq (1-150bp), MinS - MiniSeq TruSeq (1-50bp)
MSv1 - MiSeq v1 (1-250bp), MSv3 - MiSeq v3 (1-250bp), NS50 - NextSeq500 v2 (1-75bp)
Thanks so much! It makes me happy to hear you are using it, and I appreciate the PR. It'll take a look over the next few days, and merge soon!
I started taking a look at this - but the example and tests aren't running. When I clone your version and run the example I get an error.
$python treetoreads.py example.config
Random seed is 2860921143519957845
Running TreetoReads using configuration file example.config
Arguments read
Using 10 threads for read simulation
Number of variable sites is 20
clustering proportion is 0.25
exponential_mean is 2
Using uniform coverage, coverage = 20
Using uniform fragment size
Using fixed-size stdev
No QS2 bias applied
Invalid read profile provided, M, setting to MSv3
Traceback (most recent call last):
File "treetoreads.py", line 1091, in <module>
ttr = TreeToReads(configfi=args.config_file, main=1)
File "treetoreads.py", line 44, in __init__
self._check_args()
File "treetoreads.py", line 330, in _check_args
self.argdict['readProfile'][self.argdict['readProfile'].index(profile)] = 'MSv3'
TypeError: 'str' object does not support item assignment
I can dig into after the semester is over - but in the meantime, any ideas?
Alright, this should fix that error. All the ART params that are/can be randomized have to be stored in lists and I was failing to do that when no readProfile config item was provided.
Thanks for the update and sorry for my super slow replies! I'm still getting an error running the example.config
Simulating FASTQ readsTraceback (most recent call last):
File "treetoreads.py", line 1088, in <module>
ttr = TreeToReads(configfi=args.config_file, main=1)
File "treetoreads.py", line 52, in __init__
self.run_art()
File "treetoreads.py", line 932, in run_art
with Pool(processes=self.threads) as pool:
AttributeError: __exit__
Potentially because I am running an older version of Art?
AR`T_Illumina (2008-2016)
Q Version 2.5.8 (June 6, 2016)
It's no problem, we're all busy! By chance are you using Python 2.x or an early 3.x? As far as I'm aware multiprocessing.Pool doesn't support context management (with ...
statements) pre-Python 3.4 (or there abouts). If you want to maintain Py2.x compatibility, I can rewrite the threading to be compatible.
Ah right. I was running it in a python2 virtual env. Even though it is 2020 it is probably worth maintaining compatibility. I made a code on comment on a simple way to do that, as well as a fix to the zipping with pigz. Otherwise looks great and is much quicker!!
OK, this should give us Python 2.7.x and 3.y (where x > 6 and y > 4) compatibility with threads still.
python2 --version; python2 treetoreads.py example.config > py2.log; diskus example_out; rm -rf example_out
Python 2.7.11
Using 10 threads for read simulation
Using uniform coverage, coverage = 20
Using uniform fragment size
Using fixed-size stdev
No QS2 bias applied
Read length greater than allowed by profile, setting to profile max
2.83 MB (2,834,432 bytes)
And Python3
python3 --version; python3 treetoreads.py example.config > py3.log; diskus example_out; rm -rf example_out
Python 3.7.3
Using 10 threads for read simulation
Using uniform coverage, coverage = 20
Using uniform fragment size
Using fixed-size stdev
No QS2 bias applied
Read length greater than allowed by profile, setting to profile max
2.77 MB (2,772,992 bytes)