anvio icon indicating copy to clipboard operation
anvio copied to clipboard

[FEATURE REQUEST] Make a better anvi-script-gen-short-reads

Open FlorianTrigodet opened this issue 1 year ago • 2 comments

The need

A better anvi-script-gen-short-reads to generate more realistic short-reads from longer sequence. I want to turn long-read metagenomes into short-reads metagenomes.

The solution

Currently anvi-script-gen-short-reads requires a config file like this one:

[general]
short_read_length = 10
error_rate = 0.05
coverage = 1
contig = CTGTGGTTACGCCACCTTGAGAGATATTAGTCGCGTATTGCATCCGTGCCGACAAATTGCCCAACGCATCGTTCCTTCTCCTAAGTAATTTAACATGCGT

You can see what it does: generate reads of 10bp, final cov of 100x and error_rate of 0.05. And everything is written in a single fasta file.

Here is a list of improvement:

  • Use a fasta.txt as input sequence
  • Other parameters as command line arguments
  • Create R1 and R2 with insert size parameter

We could include variable insert sizes, or even a two column file to specify a different coverage for each contigs/sequence. But that's optional IMO.

PS: to turn a long-read metagenome into a short-reads one, I would use a coverage of 1 but I would also like to have a detection of 1 for the same contigs/long-read, which could be fun to code for paired-end reads :)

Beneficiaries

The few people who wants to compare read recruitment pattern between long and short-reads. Also, that's how we could leverage short-reads based analysis (like anvi-report-inversions) for long-reads only samples.

FlorianTrigodet avatar May 22 '23 12:05 FlorianTrigodet

I'll think about how to do this best. I think we may need to get rid of this script and incorporate https://github.com/merenlab/reads-for-assembly into anvi'o, and go from there :/

meren avatar May 22 '23 13:05 meren

Oh yes! That would work great!

FlorianTrigodet avatar May 22 '23 13:05 FlorianTrigodet