RNA-Bloom icon indicating copy to clipboard operation
RNA-Bloom copied to clipboard

Transcript headers follow different formats

Open schorlton opened this issue 1 year ago • 3 comments

Please report

  • [x] version of RNA-Bloom with java -jar RNA-Bloom.jar -version
  • [x] version of java with java -version
  • [x] exact command used to run RNA-Bloom

Trying to run RNA-Bloom indiscriminately on input files to see if they assemble. I don't check the files before as I want to leave it to RNA-Bloom to decide if it can assemble anything. Interestingly, RNA-Bloom produces different header formats in FASTA for different outputs.

Sometimes I get: >3 l=228 c=1.1 s=8 other times I get: >s1

Note that these are with different inputs. Is it possible to output the same header format each time? In the latter format, does coverage=1?

Thanks!!

RNA-Bloom v2.0.0

java --version
openjdk 17.0.3-internal 2022-04-19
OpenJDK Runtime Environment (build 17.0.3-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 17.0.3-internal+0-adhoc..src, mixed mode, sharing)

Command:

rnabloom -outdir rnabloom_out -t 8 -long input.fastq -ntcard

Sample input read to reproduce single-element header:

@read1
AATTTGGGTGTTTAACCAGTCATCGCCTACCGTGACTTCGGATTCATCGTGTTTCGTTTTCGTGCGCCGCTTCAACATGGGGCTAATCATTGCTTTCGTGCGCCATTCAACATGGAATAATCATTGCTTTTTCGTGCGCCGCTTCAACATGGGGGGCCACGCGCGCGTCCCCCGAAGGCGCGTAACGCTGTGGCGGCCTGCTT
+
%*'('((,./;:3,''%%&#$%(*$$&(*-30441004/*.1110)*.06{?;?<)57??@76341{9334?C9B@:999JA?;88<@::7610/--+224.,,'&&''-612105'&&,127<<820.-:::34475{;545-?8454;==??8877...F{{{{<//101/.*,/12{{1.'&&$$$$%$'('''$%&&&'

schorlton avatar Aug 08 '22 19:08 schorlton

Hi @schorlton,

Are you seeing different FASTA header formats in the final output (i.e. rnabloom.transcripts.fa) of different assemblies? Or, you mean different output FASTA files from the same assembly have different FASTA header formats?

If it is the latter, then it is actually intentional.

Ka Ming

kmnip avatar Aug 08 '22 20:08 kmnip

Are you seeing different FASTA header formats in the final output (i.e. rnabloom.transcripts.fa) of different assemblies?

Yes this. Different reads used as input leads to differently formatted FASTA headers. Sorry that wasn't clear. I like the

 >3 l=228 c=1.1 s=8

header format as I use the coverage and length information. However, not all transcripts have this information in the header, eg. if you run RNA-Bloom on the example read above, you'll only get a FASTA header with a sequence identifier, no coverage or length information.

schorlton avatar Aug 09 '22 01:08 schorlton

Ah, ok. The reason why you see this header style in some but not others is because some assemblies may have ended at an earlier stage.

To resolve this issue, I will try to standardize the final output FASTA regardless of the assembly endpoint.

kmnip avatar Aug 09 '22 23:08 kmnip