aviary icon indicating copy to clipboard operation
aviary copied to clipboard

EmptyDataError in Aviary recover with long and short reads

Open AroneyS opened this issue 1 year ago • 12 comments

Aviary v0.5.3 error in finalize_stats rule. 27/29 steps done, so I guess this is the last job and the other results are fine to use?

Simplified command (recovery from long-read assembly using 20 short reads and 2 long reads):

aviary recover --assembly 719_E1_20-24.ccs.filter.fasta -1 MainAutochamber.201907_E_1_30to34.1.fq.gz ... -2 MainAutochamber.201907_E_1_30to34.2.fq.gz ... --longreads 719_E1_1-5.ccs.filter.fastq.gz 719_E1_20-24.ccs.filter.fastq.gz --longread-type ccs --output results/aviary/binning/long/20221013/719_E1_20-24.ccs.filter -n 64 -m 500

Error:

rule finalize_stats:
    input: bins/checkm.out, bins/checkm2_output/quality_report.tsv, data/coverm_abundances.tsv, data/gtdbtk/done
    output: bins/bin_info.tsv, bins/checkm_minimal.tsv
    jobid: 1
    reason: Missing output files: bins/bin_info.tsv; Input files updated by another job: data/coverm_abundances.tsv, bins/checkm2_output/quality_report.tsv, data/gtdbtk/done, bins/checkm.out
    resources: mem_mb=1000, disk_mb=1000, tmpdir=/data1/tmp

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 64
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=1000, disk_mb=1000
Select jobs to execute...
[Fri Oct 14 07:38:30 2022]
Error in rule finalize_stats:
    jobid: 0
    output: bins/bin_info.tsv, bins/checkm_minimal.tsv

RuleException:
EmptyDataErrorin line 715 of /mnt/hpccs01/work/microbiome/sw/aviary_repos/aviary-v0.5.3/aviary/aviary/modules/binning/binning.smk:
No columns to parse from file
  File "/mnt/hpccs01/work/microbiome/sw/aviary_repos/aviary-v0.5.3/aviary/aviary/modules/binning/binning.smk", line 715, in __rule_finalize_stats
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/util/_decorators.py", line 317, in wrapper
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1747, in _make_engine
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 92, in __init__
  File "pandas/_libs/parsers.pyx", line 554, in pandas._libs.parsers.TextReader.__cinit__
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

AroneyS avatar Oct 13 '22 22:10 AroneyS

Were the CheckM results empty?

rhysnewell avatar Oct 18 '22 03:10 rhysnewell

No. They were as expected

AroneyS avatar Oct 18 '22 03:10 AroneyS

Oh this looks like it is complaining about the coverm_abundances.tsv file, was that empty?

rhysnewell avatar Oct 19 '22 21:10 rhysnewell

Yes, coverm_abundances.tsv is indeed empty. Also, coverm.cov, coverm.filt.cov, long_abundances.tsv, long_cov.tsv and short_cov.tsv are not empty. But short_abundances.tsv is empty.

AroneyS avatar Oct 19 '22 23:10 AroneyS

Does coverm.cov have the short read information? And can you find any error information for the get_abundances rule in the snakemake log?

rhysnewell avatar Oct 19 '22 23:10 rhysnewell

Yes, coverm.cov does have short read information. I can't see any error information for get_abundances.

[Fri Oct 14 06:31:13 2022]
rule get_abundances:
    input: bins/checkm.out
    output: data/coverm_abundances.tsv
    jobid: 25
    reason: Missing output files: data/coverm_abundances.tsv; Input files updated by another job: bins/checkm.out
    threads: 8
    resources: mem_mb=512000, disk_mb=1000, tmpdir=/data1/tmp

Activating conda environment: ../../../../../../../../../mnt/hpccs01/work/microbiome/conda/66a8b59755f121e40e3a82a9714b3ad5
[Fri Oct 14 06:50:20 2022]
Finished job 25.
25 of 29 steps (86%) done
Select jobs to execute...

AroneyS avatar Oct 19 '22 23:10 AroneyS

Has it happened with any other samples? Nothing is jumping out at me that would cause it to fail here

rhysnewell avatar Oct 19 '22 23:10 rhysnewell

I've done 18 assemblies (6 each of long-only, long+short, short-only). All 10 that have finished recover so far have this error.

AroneyS avatar Oct 19 '22 23:10 AroneyS

Okay, this isn't reproducible with the test data that Ben generated. Is this only occurring when you have both long and short reads?

Could you also provide the complete list of rules that aviary is attempting to complete?

rhysnewell avatar Oct 20 '22 01:10 rhysnewell

I haven't tried with only long or only short yet but I can give that a go.

job                      count    min threads    max threads
---------------------  -------  -------------  -------------
checkm2                      1              8              8
checkm_das_tool              1              8              8
checkm_metabat2              1              8              8
checkm_rosella               1              8              8
checkm_semibin               1              8              8
concoct                      1              8              8
das_tool                     1              8              8
finalize_stats               1              1              1
get_abundances               1              8              8
get_bam_indices              1              8              8
gtdbtk                       1              8              8
maxbin2                      1              8              8
metabat2                     1              8              8
metabat_sens                 1              8              8
metabat_spec                 1              8              8
metabat_ssens                1              8              8
metabat_sspec                1              8              8
prepare_binning_files        1              8              8
recover_mags                 1              8              8
refine_dastool               1              8              8
refine_metabat2              1              8              8
refine_rosella               1              8              8
refine_semibin               1              8              8
rosella                      1              8              8
semibin                      1              8              8
singlem_appraise             1              8              8
singlem_pipe_reads           1              1              1
vamb                         1              8              8
vamb_jgi_filter              1              8              8
total                       29              1              8

AroneyS avatar Oct 20 '22 01:10 AroneyS

I haven't tried with only long or only short yet but I can give that a go.

This doesn't make sense with my understanding of this:

I've done 18 assemblies (6 each of long-only, long+short, short-only). All 10 that have finished recover so far have this error.

Wouldn't some of the ones that have finished have to have been long or short only?

What you could try is deleting all the abundances files and see if you can target finalize_stats and it only reruns the abundance rules. If it tries to run others you can give the command " --rerun-triggers mtime" to --snakemake-cmds to see if that prevents the rest of pipeline running in case the code has updated

rhysnewell avatar Oct 20 '22 01:10 rhysnewell

Oh I mean that the assemblies were done with short, long, short+long but that the recovery was done with the same samples (for comparison). So recovery was always done with short+long.

Ok thanks.

AroneyS avatar Oct 20 '22 02:10 AroneyS

This happened again with only short-reads. I noticed that the real error is ERROR coverm::bam_generator] Not continuing since when input file pairs have unequal numbers of reads this usually means incorrect / corrupt files were specified. It looks like the forward/reverse reads given to CoverM are mismatched (from different samples). I double checked and they are specified correctly in the original command.

AroneyS avatar Nov 10 '22 03:11 AroneyS

The order of short_reads_2 in the config doesn't match that of short_reads_1 and neither match the order in the initial command.

AroneyS avatar Nov 10 '22 03:11 AroneyS

Might be due to the set() conversion from commit 4eaefb4b35faec0d77cfa3979f44212227cb7d40

AroneyS avatar Nov 10 '22 03:11 AroneyS