anvio
anvio copied to clipboard
[FEATURE REQUEST] Make a better anvi-script-gen-short-reads
The need
A better anvi-script-gen-short-reads
to generate more realistic short-reads from longer sequence.
I want to turn long-read metagenomes into short-reads metagenomes.
The solution
Currently anvi-script-gen-short-reads
requires a config file like this one:
[general]
short_read_length = 10
error_rate = 0.05
coverage = 1
contig = CTGTGGTTACGCCACCTTGAGAGATATTAGTCGCGTATTGCATCCGTGCCGACAAATTGCCCAACGCATCGTTCCTTCTCCTAAGTAATTTAACATGCGT
You can see what it does: generate reads of 10bp, final cov of 100x and error_rate of 0.05. And everything is written in a single fasta file.
Here is a list of improvement:
- Use a fasta.txt as input sequence
- Other parameters as command line arguments
- Create R1 and R2 with insert size parameter
We could include variable insert sizes, or even a two column file to specify a different coverage for each contigs/sequence. But that's optional IMO.
PS: to turn a long-read metagenome into a short-reads one, I would use a coverage of 1 but I would also like to have a detection of 1 for the same contigs/long-read, which could be fun to code for paired-end reads :)
Beneficiaries
The few people who wants to compare read recruitment pattern between long and short-reads.
Also, that's how we could leverage short-reads based analysis (like anvi-report-inversions
) for long-reads only samples.
I'll think about how to do this best. I think we may need to get rid of this script and incorporate https://github.com/merenlab/reads-for-assembly into anvi'o, and go from there :/
Oh yes! That would work great!