varsim icon indicating copy to clipboard operation
varsim copied to clipboard

VarSim generated readnames too long for SAM/BAM

Open gabeng opened this issue 10 months ago • 2 comments

VarSim may generate read names that cannot be written into SAM or BAM format. bwa-mem fails to perform alignment on these reads.

chr13_31453404_33401170-1248109-,chr13_31453404_33401170-1248192-I,chr13_31453404_33401170-1248107-,chr13_31453404_33401170-1248218-D,chr13_31453404_33401170-1248109-:chr13_31453404_33401170-1248218-D-,chr13_31453404_33401170-1248217--:::::::::::622992MjYyMDg=:0

samtools import can be used to check validity of read names.

I used seqkit to rename all reads.

seqkit replace -p .+ -r "seq_{nr}"

gabeng avatar Apr 24 '24 14:04 gabeng

Hi @gabeng what commands did you use? Also can you provide a small example to reproduce the issue?

yunfeiguo avatar Apr 25 '24 02:04 yunfeiguo

Hi,

these are the steps to reproduce the issue:

Create a small contig from hg38.

This was my region.bed file.

chr13	31453404	33401170

I used the helper script to extract variants and sequences:

/usr/bin/python /opt/varsim/generate_small_test_ref.py 
  --regions region.bed 
  --vcfs hg001.vcf.gz 
  --outdir small_ref_out 
  --reference GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fa

Simulate reads

Then, reads were simulated with this command:

varsim.py \
	--reference small_ref_out/ref.fa \
	--id simulated-sample \
	--read_length 150 \
	--disable_rand_vcf \
	--disable_rand_dgv \
	--vcfs small_ref_out/hg001.vcf.gz \
	--mean_fragment_size 350 \
	--sd_fragment_size 50 \
	--nlanes 4 \
	--total_coverage 1000 \
	--java_max_mem 20g \
	--simulator_executable /opt/varsim/opt/ART/art_bin_VanillaIceCream/art_illumina \
	--out_dir ~/sim-1000 \
	--log_dir log \
	--work_dir work \
	--simulator art \
	--vc_prop_het 0 \
	--seed 200

I ran about 20 different simulations with different coverage. A single simulation did not contain invalid FASTQ names. Not every FASTQ name is invalid.

gabeng avatar Apr 25 '24 07:04 gabeng