serratus icon indicating copy to clipboard operation
serratus copied to clipboard

all .pro assemblies

Open rchikhi opened this issue 3 years ago • 10 comments

This thread will be for updates of the .pro assemblies.

number of .pro.gz files analyzed (all of s3://serratus-public/out/21* except *r1p*):

5,726,283

number of .fasta.gz obtained after converting .pro to FASTA and discarding empty files:

3,379,127

rchikhi avatar Jan 23 '21 19:01 rchikhi

Assemblies done (measured by before_rr.fasta existing):

3,378,813

(no idea why in ~300 cases, no before_rr.fasta was created)

Number of empty assemblies:

2,890,521

Thus, non-empty assemblies (i.e. both before_rr.fasta and contigs.fasta exist and are non-empty):

488,292 (14.4%)

For reference, 19% of the rVert assemblies were non-empty.

rchikhi avatar Jan 24 '21 11:01 rchikhi

@rchikhi

(no idea why in ~300 cases, no before_rr.fasta was created)

Likely the assembly failed. Can you collect few logs out there?

asl avatar Jan 24 '21 13:01 asl

Can do, let me just finish with the bulk of the results first.

Number of non-empty trim.LHF.fa motifator files:

168,460

rchikhi avatar Jan 24 '21 13:01 rchikhi

Hi @rchikhi Minor feature request/suggestion for future runs: can you combine all micro-assemblies into one FASTA file? This file should not be too big, only around 1 Gb or so. This would be easier to process on Linux than millions of small FASTAs or millions of directories, each with a small/empty FASTA. This would require embedding the SRA identifier in the sequence label a.k.a. FASTA defline, e.g. as a prefix >SRA1234567|NODE_1..., something like that.

rcedgar avatar Jan 24 '21 17:01 rcedgar

Data availability

Individual assemblies (excluding empty files):

s3://serratus-rayan/pro-assembly/individual/

Individual motifator analyses of the above assemblies:

s3://serratus-rayan/pro-assembly/individual_motifator/

For download convenience, the above two folders (assemblies and motifator analyses) are packaged into a tar.gz file each:

s3://serratus-rayan/pro-assembly/individual_assemblies.tar.gz s3://serratus-rayan/pro-assembly/individual_motifator.tar.gz

All these folders are relatively small (~10GB) but have in the order of millions of files.

rchikhi avatar Jan 24 '21 18:01 rchikhi

In addition, for @rcedgar, here are all the motifator outputs (just the LHF files) concatenated into a single file:

s3://serratus-rayan/pro-assembly/all.before_rr.LHF.fasta s3://serratus-rayan/pro-assembly/all.contigs.LHF.fasta

SRR id is added as follows: >[SRR id][a single space][contig name] e.g. >SRR0123123 NODE_1_xxx.

rchikhi avatar Jan 24 '21 19:01 rchikhi

And concatenated unitigs/contigs:

s3://serratus-rayan/pro-assembly/all.before_rr.fasta s3://serratus-rayan/pro-assembly/all.contigs.fasta

rchikhi avatar Jan 24 '21 19:01 rchikhi

For reference, these assemblies were performed using that script:

https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-rdrp-analysis/-/blob/master/assemble_individually.sh

and motifator was run using that script:

https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-rdrp-analysis/-/blob/master/motifator_analyses/run_indiv.sh

rchikhi avatar Jan 24 '21 23:01 rchikhi

here's an exhaustive list of "reads" that are above 600 bp among the single-end libraries:

https://serratus-rayan.s3.amazonaws.com/rdrp-pan-assembly/prelim/all_se.above_600bp.txt

from that list I extracted the set of 719 accessions that are deemed not to be Illumina short reads:

https://serratus-rayan.s3.amazonaws.com/rdrp-pan-assembly/prelim/nonILMN.txt

rchikhi avatar Jan 25 '21 18:01 rchikhi

Coverage analysis of the motifator hits within the .pro assemblies

s3://serratus-rayan/pro-assembly/depth_summary.csv

schema: sra, header, contig_type, p_cvg1, p_cvg2, p_cvg3-4, p_cvg5-8, p_cvg9plus

where p_cvgX is the percentage of bases of the region where coverage is >= X

code used to generate those results https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-rdrp-analysis/-/blob/master/bed_analysis.sh https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-rdrp-analysis/-/blob/master/depth_analysis.sh https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-rdrp-analysis/-/blob/master/depth_summary.py

rchikhi avatar Feb 04 '21 19:02 rchikhi