aviary
aviary copied to clipboard
EmptyDataError in Aviary recover with long and short reads
Aviary v0.5.3 error in finalize_stats rule. 27/29 steps done, so I guess this is the last job and the other results are fine to use?
Simplified command (recovery from long-read assembly using 20 short reads and 2 long reads):
aviary recover --assembly 719_E1_20-24.ccs.filter.fasta -1 MainAutochamber.201907_E_1_30to34.1.fq.gz ... -2 MainAutochamber.201907_E_1_30to34.2.fq.gz ... --longreads 719_E1_1-5.ccs.filter.fastq.gz 719_E1_20-24.ccs.filter.fastq.gz --longread-type ccs --output results/aviary/binning/long/20221013/719_E1_20-24.ccs.filter -n 64 -m 500
Error:
rule finalize_stats:
input: bins/checkm.out, bins/checkm2_output/quality_report.tsv, data/coverm_abundances.tsv, data/gtdbtk/done
output: bins/bin_info.tsv, bins/checkm_minimal.tsv
jobid: 1
reason: Missing output files: bins/bin_info.tsv; Input files updated by another job: data/coverm_abundances.tsv, bins/checkm2_output/quality_report.tsv, data/gtdbtk/done, bins/checkm.out
resources: mem_mb=1000, disk_mb=1000, tmpdir=/data1/tmp
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 64
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=1000, disk_mb=1000
Select jobs to execute...
[Fri Oct 14 07:38:30 2022]
Error in rule finalize_stats:
jobid: 0
output: bins/bin_info.tsv, bins/checkm_minimal.tsv
RuleException:
EmptyDataErrorin line 715 of /mnt/hpccs01/work/microbiome/sw/aviary_repos/aviary-v0.5.3/aviary/aviary/modules/binning/binning.smk:
No columns to parse from file
File "/mnt/hpccs01/work/microbiome/sw/aviary_repos/aviary-v0.5.3/aviary/aviary/modules/binning/binning.smk", line 715, in __rule_finalize_stats
File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/util/_decorators.py", line 317, in wrapper
File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1747, in _make_engine
File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 92, in __init__
File "pandas/_libs/parsers.pyx", line 554, in pandas._libs.parsers.TextReader.__cinit__
File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Were the CheckM results empty?
No. They were as expected
Oh this looks like it is complaining about the coverm_abundances.tsv
file, was that empty?
Yes, coverm_abundances.tsv
is indeed empty.
Also, coverm.cov
, coverm.filt.cov
, long_abundances.tsv
, long_cov.tsv
and short_cov.tsv
are not empty. But short_abundances.tsv
is empty.
Does coverm.cov
have the short read information?
And can you find any error information for the get_abundances rule in the snakemake log?
Yes, coverm.cov
does have short read information.
I can't see any error information for get_abundances.
[Fri Oct 14 06:31:13 2022]
rule get_abundances:
input: bins/checkm.out
output: data/coverm_abundances.tsv
jobid: 25
reason: Missing output files: data/coverm_abundances.tsv; Input files updated by another job: bins/checkm.out
threads: 8
resources: mem_mb=512000, disk_mb=1000, tmpdir=/data1/tmp
Activating conda environment: ../../../../../../../../../mnt/hpccs01/work/microbiome/conda/66a8b59755f121e40e3a82a9714b3ad5
[Fri Oct 14 06:50:20 2022]
Finished job 25.
25 of 29 steps (86%) done
Select jobs to execute...
Has it happened with any other samples? Nothing is jumping out at me that would cause it to fail here
I've done 18 assemblies (6 each of long-only, long+short, short-only). All 10 that have finished recover so far have this error.
Okay, this isn't reproducible with the test data that Ben generated. Is this only occurring when you have both long and short reads?
Could you also provide the complete list of rules that aviary is attempting to complete?
I haven't tried with only long or only short yet but I can give that a go.
job count min threads max threads
--------------------- ------- ------------- -------------
checkm2 1 8 8
checkm_das_tool 1 8 8
checkm_metabat2 1 8 8
checkm_rosella 1 8 8
checkm_semibin 1 8 8
concoct 1 8 8
das_tool 1 8 8
finalize_stats 1 1 1
get_abundances 1 8 8
get_bam_indices 1 8 8
gtdbtk 1 8 8
maxbin2 1 8 8
metabat2 1 8 8
metabat_sens 1 8 8
metabat_spec 1 8 8
metabat_ssens 1 8 8
metabat_sspec 1 8 8
prepare_binning_files 1 8 8
recover_mags 1 8 8
refine_dastool 1 8 8
refine_metabat2 1 8 8
refine_rosella 1 8 8
refine_semibin 1 8 8
rosella 1 8 8
semibin 1 8 8
singlem_appraise 1 8 8
singlem_pipe_reads 1 1 1
vamb 1 8 8
vamb_jgi_filter 1 8 8
total 29 1 8
I haven't tried with only long or only short yet but I can give that a go.
This doesn't make sense with my understanding of this:
I've done 18 assemblies (6 each of long-only, long+short, short-only). All 10 that have finished recover so far have this error.
Wouldn't some of the ones that have finished have to have been long or short only?
What you could try is deleting all the abundances files and see if you can target finalize_stats
and it only reruns the abundance rules. If it tries to run others you can give the command " --rerun-triggers mtime"
to --snakemake-cmds
to see if that prevents the rest of pipeline running in case the code has updated
Oh I mean that the assemblies were done with short, long, short+long but that the recovery was done with the same samples (for comparison). So recovery was always done with short+long.
Ok thanks.
This happened again with only short-reads. I noticed that the real error is ERROR coverm::bam_generator] Not continuing since when input file pairs have unequal numbers of reads this usually means incorrect / corrupt files were specified
. It looks like the forward/reverse reads given to CoverM are mismatched (from different samples). I double checked and they are specified correctly in the original command.
The order of short_reads_2 in the config doesn't match that of short_reads_1 and neither match the order in the initial command.
Might be due to the set() conversion from commit 4eaefb4b35faec0d77cfa3979f44212227cb7d40