needlestack
needlestack copied to clipboard
`--output_vcf` parameter doesn't work when including a relative path
Hi, i am trying to run the program using:
nextflow run iarcbioinfo/needlestack -with-docker --bed bed/SOPHIA_targetregions_hg19.bed --input_bams bam/SOPHIA/ --ref ref/ucsc.hg19.fasta --output_vcf output/SOPHIA_needlestack.vcf
The bed file is sorted, the bam files are in the corresponding folder with their respective .bai, and in the reference folder are the .fasta, .fai, .dict, .amb, .ann, .bwt, . pac and .sa.
But I get an error in the execution process. I would appreciate your response, thank you.
Error message:
executor > local (3)
[76/9e2dc1] process > bed [100%] 1 of 1 ✔
[18/a87fe0] process > split_bed [100%] 1 of 1 ✔
[91/4ac861] process > mpileup2vcf (chr1_17345370-chrX_14883636... [ 0%] 0 of 1
[- ] process > collect_vcf_result -
Error executing process > 'mpileup2vcf (chr1_17345370-chrX_14883636_regions)'
Caused by:
Process `mpileup2vcf (chr1_17345370-chrX_14883636_regions)` terminated with an error exit status (1)
Command executed:
for cur_bam in BAM/*.bam
do
if [ "BAM" == "FILE" ]; then
# use bam file name as sample name
bam_file_name=$(basename "${cur_bam%.*}")
# remove whitespaces from name
SM1="$(echo -e "${bam_file_name}" | tr -d '[[:space:]]')"
SM2="$(echo -e "${bam_file_name}" | tr -d '[[:space:]]')"
else
# get bam file names
bam_file_name=$(basename "${cur_bam%.*}")
# remove whitespaces from name
SM1="$(echo -e "${bam_file_name}" | tr -d '[[:space:]]')"
# extract sample name from bam file read group info field
SM2=$(samtools view -H $cur_bam | grep "^@RG" | tail -n1 | sed "s/.*SM:\([^ ]*\).*/\1/" | tr -d '[:space:]')
fi
printf "$SM1 $SM2\n" >> names.txt
done
if [ "FALSE" != "FALSE" ]; then
abs_pairs_file=$(readlink -f FALSE)
else
abs_pairs_file="FALSE"
fi
set -o pipefail
i=1
...
{ while read bed_line; do
samtools mpileup --fasta-ref ucsc.hg19.fasta --region $bed_line --ignore-RG --min-BQ 13 --min-MQ 0 --max-idepth 1000000 --max-depth 50000 BAM/*.bam | sed 's/ / * */g'
i=$((i+1))
done < chr1_17345370-chrX_14883636_regions
} | mpileup2readcounts 0 -5 false 3 0 | Rscript /home/fer/.nextflow/assets/iarcbioinfo/needlestack/bin/needlestack.r --pairs_file=${abs_pairs_file} --source_path=/home/fer/.nextflow/assets/iarcbioinfo/needlestack/bin/ --out_file=chr1_17345370-chrX_14883636_regions.vcf --fasta_ref=ucsc.hg19.fasta --input_bams=BAM/ --ref_genome=null --GQ_threshold=50 --min_coverage=30 --min_reads=3 --min_af=0 --SB_type=SOR --SB_threshold_SNV=100 --SB_threshold_indel=100 --output_all_SNVs=false --do_plots=ALL --do_alignments=false --plot_labels=true --add_contours=true --extra_rob=false --min_af_extra_rob=0.2 --min_prop_extra_rob=0.1 --max_prop_extra_rob=0.5 --afmin_power=-1 --sigma=0.1
Command exit status:
1
...
Command error:
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
...
[warning] samtools mpileup option `L` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
..
[mpileup] fail to load index for BAM/ISEM114.recal.reads.bam
Error : Pileup file is empty
Work dir:
/home/fer/Documents/work/91/4ac861e494a55ec631f77ea31b0ce6
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
Hi,
These are errors produced by samtools when trying to extract read counts information. There might be an issue with your BAM file(s):
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
Maybe try to do a samtools quickcheck
? (https://www.htslib.org/doc/samtools-quickcheck.html)
Can you also check this particular .bai
is present:
[mpileup] fail to load index for BAM/ISEM114.recal.reads.bam
Thank you very much for your reply. Download the bam files again, and run the program again. Apparently the previous problem was solved, but it shows me another error when collecting the results of the vcf
Error message:
executor > local (4)
[f5/bc2a25] process > bed [100%] 1 of 1 ✔
[45/b0edda] process > split_bed [100%] 1 of 1 ✔
[7b/5fbeff] process > mpileup2vcf (chr1_17345370-chrX_14883636... [100%] 1 of 1 ✔
[84/dfbf6c] process > collect_vcf_result [ 0%] 0 of 1
Error executing process > 'collect_vcf_result'
Caused by:
Process `collect_vcf_result` terminated with an error exit status (1)
Command executed:
mkdir VCF
mv *.vcf VCF
# Extract the header from the first VCF
sed '/^#CHROM/q' VCF/chr1_17345370-chrX_14883636_regions.vcf > header.txt
# Add contigs in the VCF header
cat ucsc.hg19.fasta.fai | cut -f1,2 | sed -e 's/^/##contig=<ID=/' -e 's/[ ][ ]*/,length=/' -e 's/$/>/' > contigs.txt
...
Command executed:
mkdir VCF
mv *.vcf VCF
# Extract the header from the first VCF
sed '/^#CHROM/q' VCF/chr1_17345370-chrX_14883636_regions.vcf > header.txt
# Add contigs in the VCF header
cat ucsc.hg19.fasta.fai | cut -f1,2 | sed -e 's/^/##contig=<ID=/' -e 's/[ ][ ]*/,length=/' -e 's/$/>/' > contigs.txt
sed -i '/##reference=.*/ r contigs.txt' header.txt
# Add version numbers in the VCF header
echo '##command=nextflow run iarcbioinfo/needlestack -with-docker --bed bed/SOPHIA_targetregions_hg19_ORDERED.bed --input_bams bam/SOPHIA/ --ref ref/ucsc.hg19.fasta --output_vcf output/SOPHIA_needlestack.vcf' > versions.txt
...
# Check if sort command allows sorting in natural order (chr1 chr2 chr10 instead of chr1 chr10 chr2)
if [ `sort --help | grep -c 'version-sort' ` == 0 ]
then
sort_ops="-k1,1d"
else
sort_ops="-k1,1V"
fi
# Add all VCF contents and sort
grep --no-filename -v '^#' VCF/*.vcf | LC_ALL=C sort -t ' ' $sort_ops -k2,2n >> header.txt
mv header.txt output/SOPHIA_needlestack.vcf
...
``
I think the actual error message has been cut, could copy the content of the .command.err
file in the work/84/dfbf6c...
folder?
Also can you check the content of the VCF/chr1_17345370-chrX_14883636_regions.vcf
file in the same folder? This last process is simply merging all VCF produced when splitting your bed into multiple regions, but in your case you have only one .
When reviewing the .command.sh
file, the last command could not be executed (mv header.txt output/all_variants.vcf
) because at the time of executing that command it was in the path work/84/dfbf6c..
and therefore it could not find the specified output path, since it had used a relative path at the time of executing the program. By correcting that, the program worked perfectly. Thank you very much.
Indeed the --output_vcf
parameter is supposed to be only the name of the VCF file (not including its path) but indeed that's confusing. You can use it together with the --output_folder
parameter. We made this choice because there are other outputs but we could easily also allow the --output_vcf
parameter to include a relative path.
Thanks for spotting this, I hope you will find needlestack useful!
Thank you very much for everything, excellent program