TreeToReads Support for threading and additional read simulation parameters

I'm not sure if there's any interest in this PR but I use TreeToReads quite a bit to simulate data for courses I teach and try and incorporate additional errors and variance to make read sets more realistic. This PR adds support for all the relevant ART parameters for more control over the simulation (and adding "randomness"). This also adds supports for threads (via the threads config item) and pigz, if available, for compression. All the updates are backwards compatible, so existing TTR config files will work the exact same way.

New or updated config options:

threads: Number of threads for read simulation, if not set or set to {0,1} only 1 thread is used read_length: Now supports comma separated lists of read length (e.g. 150,250) to generate reads with variable lengths

Fragment length can be randomized using the frag_{low,high,step} options frag_low : Start of random frag length range frag_high: End of random frag length range frag_step: Step size for binning frag length (e.g. 5 = count from low to high by 5's)

Fragment length standard deviation can be randomized using the stdev_frag_{low,high,step} options stdev_frag_low : Start of random frag length stdev range stdev_frag_high: End of random frag length stdev range stdev_frag_step: Step size for binning frag length stdev (e.g. 5 = count from low to high by 5's)

Average genomic read coverage. Can be uniform [coverage] across dataset or randomly distributed between [coverage_low, coverage_high].
coverage = 20 coverage_low: Start of random coverage range coverage_high: End of random coverage range coverage_step: Step size for binning random coverage

Average QS2 bias to introduce. Can be 0 (no value provided) across dataset or randomly distributed between [qs2_low, qs2_high] qs2_low: Start of QS2 bias range qs2_high: End of QS2 bias range qs2_step: Step size for binning QS2 bias

Platform read error profile used to generate reads. Supports the following (between read length), and defaults to MiSeq v3 to allow reads up to 250bp. If read lengths are provided that are greater than the profile(s) select support, they'll be truncated to the maxmin of the profile(s) selected

GA1 - GenomeAnalyzer I (1-44bp), GA2 - GenomeAnalyzer II (1-75bp)
HS10 - HiSeq 1000 (1-100bp), HS20 - HiSeq 2000 1-100bp), HS25 - HiSeq 2500 (1-150bp)
HSXn - HiSeqX PCR free (1-150bp), HSXt - HiSeqX TruSeq (1-150bp),  MinS - MiniSeq TruSeq (1-50bp)
MSv1 - MiSeq v1 (1-250bp), MSv3 - MiSeq v3 (1-250bp), NS50 - NextSeq500 v2 (1-75bp)

Nov 25 '19 14:11 ar0ch

Thanks so much! It makes me happy to hear you are using it, and I appreciate the PR. It'll take a look over the next few days, and merge soon!

Nov 25 '19 19:11 snacktavish

I started taking a look at this - but the example and tests aren't running. When I clone your version and run the example I get an error.

$python treetoreads.py example.config
Random seed is 2860921143519957845
Running TreetoReads using configuration file example.config
Arguments read
Using 10 threads for read simulation
Number of variable sites is 20
clustering proportion is 0.25
exponential_mean is 2
Using uniform coverage, coverage = 20
Using uniform fragment size
Using fixed-size stdev
No QS2 bias applied
Invalid read profile provided, M, setting to MSv3
Traceback (most recent call last):
  File "treetoreads.py", line 1091, in <module>
    ttr = TreeToReads(configfi=args.config_file, main=1)
  File "treetoreads.py", line 44, in __init__
    self._check_args()
  File "treetoreads.py", line 330, in _check_args
    self.argdict['readProfile'][self.argdict['readProfile'].index(profile)] = 'MSv3'
TypeError: 'str' object does not support item assignment

I can dig into after the semester is over - but in the meantime, any ideas?

Dec 05 '19 19:12 snacktavish

Alright, this should fix that error. All the ART params that are/can be randomized have to be stored in lists and I was failing to do that when no readProfile config item was provided.

Dec 05 '19 19:12 ar0ch

Thanks for the update and sorry for my super slow replies! I'm still getting an error running the example.config

Simulating FASTQ readsTraceback (most recent call last):
  File "treetoreads.py", line 1088, in <module>
    ttr = TreeToReads(configfi=args.config_file, main=1)
  File "treetoreads.py", line 52, in __init__
    self.run_art()
  File "treetoreads.py", line 932, in run_art
    with Pool(processes=self.threads) as pool:
AttributeError: __exit__

Potentially because I am running an older version of Art?

AR`T_Illumina (2008-2016)          
          Q Version 2.5.8 (June 6, 2016)

Jan 18 '20 01:01 snacktavish

It's no problem, we're all busy! By chance are you using Python 2.x or an early 3.x? As far as I'm aware multiprocessing.Pool doesn't support context management (with ... statements) pre-Python 3.4 (or there abouts). If you want to maintain Py2.x compatibility, I can rewrite the threading to be compatible.

Jan 20 '20 13:01 ar0ch

Ah right. I was running it in a python2 virtual env. Even though it is 2020 it is probably worth maintaining compatibility. I made a code on comment on a simple way to do that, as well as a fix to the zipping with pigz. Otherwise looks great and is much quicker!!

Jan 20 '20 18:01 snacktavish

OK, this should give us Python 2.7.x and 3.y (where x > 6 and y > 4) compatibility with threads still.

python2 --version; python2 treetoreads.py example.config > py2.log; diskus example_out; rm -rf example_out
Python 2.7.11
Using 10 threads for read simulation
Using uniform coverage, coverage = 20
Using uniform fragment size
Using fixed-size stdev
No QS2 bias applied
Read length greater than allowed by profile, setting to profile max
2.83 MB (2,834,432 bytes)

And Python3

python3 --version; python3 treetoreads.py example.config > py3.log; diskus example_out; rm -rf example_out
Python 3.7.3
Using 10 threads for read simulation
Using uniform coverage, coverage = 20
Using uniform fragment size
Using fixed-size stdev
No QS2 bias applied
Read length greater than allowed by profile, setting to profile max
2.77 MB (2,772,992 bytes)

Jan 21 '20 13:01 ar0ch

TreeToReads TreeToReads copied to clipboard

Support for threading and additional read simulation parameters

TreeToReads
TreeToReads copied to clipboard