varsim
varsim copied to clipboard
VarSim generated readnames too long for SAM/BAM
VarSim may generate read names that cannot be written into SAM or BAM format. bwa-mem
fails to perform alignment on these reads.
chr13_31453404_33401170-1248109-,chr13_31453404_33401170-1248192-I,chr13_31453404_33401170-1248107-,chr13_31453404_33401170-1248218-D,chr13_31453404_33401170-1248109-:chr13_31453404_33401170-1248218-D-,chr13_31453404_33401170-1248217--:::::::::::622992MjYyMDg=:0
samtools import
can be used to check validity of read names.
I used seqkit
to rename all reads.
seqkit replace -p .+ -r "seq_{nr}"
Hi @gabeng what commands did you use? Also can you provide a small example to reproduce the issue?
Hi,
these are the steps to reproduce the issue:
Create a small contig from hg38.
This was my region.bed
file.
chr13 31453404 33401170
I used the helper script to extract variants and sequences:
/usr/bin/python /opt/varsim/generate_small_test_ref.py
--regions region.bed
--vcfs hg001.vcf.gz
--outdir small_ref_out
--reference GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fa
Simulate reads
Then, reads were simulated with this command:
varsim.py \
--reference small_ref_out/ref.fa \
--id simulated-sample \
--read_length 150 \
--disable_rand_vcf \
--disable_rand_dgv \
--vcfs small_ref_out/hg001.vcf.gz \
--mean_fragment_size 350 \
--sd_fragment_size 50 \
--nlanes 4 \
--total_coverage 1000 \
--java_max_mem 20g \
--simulator_executable /opt/varsim/opt/ART/art_bin_VanillaIceCream/art_illumina \
--out_dir ~/sim-1000 \
--log_dir log \
--work_dir work \
--simulator art \
--vc_prop_het 0 \
--seed 200
I ran about 20 different simulations with different coverage. A single simulation did not contain invalid FASTQ names. Not every FASTQ name is invalid.