RNA-Bloom icon indicating copy to clipboard operation
RNA-Bloom copied to clipboard

RNA-Bloom Generates Empty FASTA Without Error

Open schorlton opened this issue 1 year ago • 8 comments

As per title. Input file: test.fastq.gz

Command:

rnabloom -t 2 -outdir test_out -long test.fastq -ntcard

It should probably again report too little input data? Big thanks for all of your help!!



RNA-Bloom v2.0.0

java --version
openjdk 17.0.3-internal 2022-04-19
OpenJDK Runtime Environment (build 17.0.3-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 17.0.3-internal+0-adhoc..src, mixed mode, sharing)

schorlton avatar Aug 09 '22 01:08 schorlton

Thanks for reporting this! Yes, this happens when there are too few reads.

kmnip avatar Aug 09 '22 23:08 kmnip

I was able to replicate this, but this is not a bug. The assembled sequences are too short and they all end up in rnabloom.transcripts.short.fa (instead of rnabloom.transcripts.fa).

I have added a warning message for this scenario. The changes will be incorporated in the next release!

kmnip avatar Aug 13 '22 03:08 kmnip

What is the difference between these files besides above/below length threshold? Is there evidence that the longer transcripts are better supported/higher quality?

schorlton avatar Aug 13 '22 03:08 schorlton

Not at all. The length threshold is the only determining factor for assigning sequences to these two files.

kmnip avatar Aug 13 '22 03:08 kmnip

Not at all. The length threshold is the only determining factor for assigning sequences to these two files.

Cool. If that's the case, why separate the files at all? Why not have a single assembly output file, with an optional param to filter contigs shorter than x length, with default x=0?

schorlton avatar Aug 13 '22 04:08 schorlton

There is already an option for that (i.e. -length) and its default value is 200, which is what separates the sequences in the two files. All RNA-seq assemblers I can think of have a similar length cutoff option and its default is 100~200 nt. It is not set to zero because very short sequences can potentially be noise.

kmnip avatar Aug 13 '22 07:08 kmnip

Thanks for explaining. Contrary to your earlier answer then, it does sound like there is evidence that the longer transcripts are likely higher quality. I guess a warning message will suffice if the non-short transcripts file is empty. Thanks again!

schorlton avatar Aug 13 '22 16:08 schorlton

Sorry, I thought you were asking whether RNA-Bloom use any evidence to determine that threshold.

kmnip avatar Aug 13 '22 17:08 kmnip