NanoSim icon indicating copy to clipboard operation
NanoSim copied to clipboard

NanoSim produces invalid fastq files

Open jkomyno opened this issue 5 years ago • 11 comments

Hi, I've characterized and later simulated 20000 reads from the E. Coli genome. It seems that the simulated_aligned_reads.fastq file generated in the simulation phase isn't a valid fastq file, according to fqtools's validate command.

The characterization phase command is:

read_analysis.py genome \
  -i "/data/original/ecoli_R73_2D.fasta" \
  -rg "/data/original/ecoli_K12_MG1655_ref.fa" \
  -o "/data/training" \
  -a minimap2 \
  -t 4

The simulation phase command is:

simulator.py genome \
  -rg "./data/original/ecoli_K12_MG1655_ref.fa" \
  -c "./data/training/training" \
  -o "./data/simulated/simulated" \
  -n 20000 \
  -max 10000 \
  -min 100 \
  -b albacore \
  --seed 42 \
  -dna_type circular \
  --fastq \
  -t 4

fqtools command and validation error:

./fqtools validate ./data/simulated/simulated_aligned_reads.fastq 
ERROR [line 5]: expected header sequence

On the other hand, unaligned reads are ok:

./fqtools validate ./data/simulated/simulated_unaligned_reads.fastq 
OK

jkomyno avatar Nov 27 '20 21:11 jkomyno

Hi @jkomyno , it seems that the error lies in line 5, so could you check what does line 5 look like?

cheny19 avatar Nov 28 '20 06:11 cheny19

I've added the simulated fastq file here (I'm sorry, I thought I had already linked it in the original issue, but I forgot).

Line 5 is the following:

@ENA|U00096|U00096_2138149;aligned_4_R_13_2748_29

jkomyno avatar Nov 28 '20 12:11 jkomyno

It looks like a normal header generated by NanoSim. My intuition is that the ; is causing the problem. I quickly checked fqtools manual and it seems you can specify which character is expected. So if ; is not in the default list, the header is considered invalid. That being said, I'm not entirely sure what went wrong. And since I'm busy with my thesis these days, could you help try that and let me know how it works? Thanks!

cheny19 avatar Nov 30 '20 02:11 cheny19

Hi, I ran fqtools -p ';' validate ./data/simulated/simulated_aligned_reads.fastq, but I get the same error.

jkomyno avatar Nov 30 '20 23:11 jkomyno

I thought you said there was no error with unaligned reads before?

cheny19 avatar Dec 01 '20 05:12 cheny19

That was a typo, sorry. I edited the comment so it's clearer.

jkomyno avatar Dec 02 '20 19:12 jkomyno

Hi @cheny19, any update?

jkomyno avatar Dec 08 '20 15:12 jkomyno

Hi @jkomyno, sorry for no update recently. I don't know much about the validity criteria about fqtools. Based on your comment in isONclust, it seems that the tool didn't read the quality score properly.

@theottlo, do you have any thoughts about this?

cheny19 avatar Dec 10 '20 10:12 cheny19

Hi @jkomyno, I apologize for the delay! I was wondering which version of NanoSim you were using to simulate the reads. It looks like the sequence and quality score lengths are different in the aligned fastq file, which is a known bug in NanoSim v2.6.0 and is fixed in the v3.0.0 pre-release.

theottlo avatar Dec 15 '20 11:12 theottlo

Hi @theottlo, I believe you have access to the fastq file. I have cloned the NanoSim repository some days after v3.0.0 was released.

jkomyno avatar Dec 15 '20 14:12 jkomyno

Hi @jkomyno,

Sorry for the late reply. I finally got time to install fqtools now. I repeated your simulation command but with the pre-trained human DNA dataset models as input. I couldn't re-produce the error unfortunately. The validate results are OK for both aligned reads and unaligned reads. Could you make sure you are using the latest commit and try simulating with that pre-trained model again and see how it goes?

Cheers, Chen

cheny19 avatar Feb 01 '21 14:02 cheny19