mag
mag copied to clipboard
Running BUSCO with your own database
Description of the bug
Hi I tried to use BUSCO for the bin checking. Since our cluster does not allow to download things when inside a slurm job, I download the database and installed it. I did not use BUSCO for it as suggested in issue: #545 . But my BUSCO job fails in two ways. It seems to not find the option: --auto-lineage-prok and it does not want to us the database I downloaded.
My params file looks like this:
{
"input": "samplesheet_small_TP_reads.csv",
"outdir": "\/cluster\/projects\/nn10070k\/projects\/phagedrive\/pd_data_control\/results\/20240916_MAG_results",
"multiqc_title": "TP_cleaned_reads",
"reads_minlength": 50,
"igenomes_base" : "s3://ngi-igenomes/igenomes",
"gtdb_db": "\/cluster\/projects\/nn10070k\/databases\/gtdbtk_r220_data.tar.gz",
"host_genome":"GRCh38",
"kraken2_db": "\/cluster\/projects\/nn10070k\/databases\/kraken2_pluspfp_05.06.2024\/hash.k2d",
"cat_db": "\/cluster\/projects\/nn10070k\/databases\/20240422_CAT_nr",
"binqc_tool": "busco",
"busco_db": "\/cluster\/shared\/biobases\/BUSCO\/2024-10-04",
"busco_auto_lineage_prok": true,
"busco_clean": true,
"checkm_db": "\/cluster\/projects\/nn10070k\/databases\/checkm_db_2015.01.16",
"refine_bins_dastool": true,
"postbinning_input": "refined_bins_only",
"run_virus_identification": false
}
When I check the log file from BUSCO I get this :
2024-10-17 09:23:16 INFO: ***** Start a BUSCO v5.4.3 analysis, current time: 10/17/2024 09:23:16 *****
2024-10-17 09:23:16 INFO: Configuring BUSCO with local environment
2024-10-17 09:23:16 INFO: Mode is genome
2024-10-17 09:23:16 INFO: Input file is /cluster/work/users/thhaverk/nf_mag/00/62b73b0fdc4a15c0b3e9bfa7b6270c/SPAdes-DASToolUnbinned-DNA_H1H_10_A1.fa
2024-10-17 09:23:16 INFO: No lineage specified. Running lineage auto selector.
2024-10-17 09:23:16 INFO: ***** Starting Auto Select Lineage *****
This process runs BUSCO on the generic lineage datasets for the domains archaea, bacteria and eukaryota. Once the optimal domain is selected, BUSCO automatically attempts to find the most appropriate BUSCO dataset to use based on phylogenetic placement.
--auto-lineage-euk and --auto-lineage-prok are also available if you know your input assembly is, or is not, an eukaryote. See the user guide for more information.
A reminder: Busco evaluations are valid when an appropriate dataset is used, i.e., the dataset belongs to the lineage of the species to test. Because of overlapping markers/spurious matches among domains, busco matches in another domain do not necessarily mean that your genome/proteome contains sequences from this domain. However, a high busco score in multiple domains might help you identify possible contaminations.
I saw in the BUSCO manual that there is an flag called: --offline that you can use when you provide it with your own database. I see in the .command.sh file that it is there. But that is not used here when I run BUSCO, with my own database
the .command.sh file looks like this:
#!/bin/bash -euo pipefail
run_busco.sh \
"--auto-lineage-prok --offline --download_path 2024-10-04" \
"Y" \
"2024-10-04" \
"SPAdes-DASToolUnbinned-DNA_H1H_10_A1.fa" \
8 \
"N" \
"Y" \
"--offline"
most_spec_db=$(<info_most_spec_db.txt)
cat <<-END_VERSIONS > versions.yml
"NFCORE_MAG:MAG:BUSCO_QC:BUSCO":
python: $(python --version 2>&1 | sed 's/Python //g')
R: $(R --version 2>&1 | sed -n 1p | sed 's/R version //' | sed 's/ (.*//')
busco: $(busco --version 2>&1 | sed 's/BUSCO //g')
END_VERSIONS
# capture process environment
set +u
set +e
cd "$NXF_TASK_WORKDIR"
nxf_eval_cmd() {
{
IFS=$'\n' read -r -d '' "${1}";
IFS=$'\n' read -r -d '' "${2}";
(IFS=$'\n' read -r -d '' _ERRNO_; return ${_ERRNO_});
} < <((printf '\0%s\0%d\0' "$(((({ shift 2; "${@}"; echo "${?}" 1>&3-; } | tr -d '\0' 1>&4-) 4>&2- 2>&1- | tr -d '\0' 1>&4-) 3>&1- | exit "$(cat)") 4>&1-)" "${?}" 1>&2) 2>&1)
}
echo '' > .command.env
#
echo most_spec_db="${most_spec_db[@]}" >> .command.env
echo /most_spec_db/ >> .command.env
I am not understanding what is the error here, it looks like the db location is only using the last bit of the db location,
Command used and terminal output
My nextflow command was:
nextflow run nf-core/mag -r 3.0.3 -profile apptainer -work-dir $USERWORK/nf_mag -resume -c saga_mag.simple.config -params-file params_test_2.json
Relevant files
No response
System information
Nextflow version: 24.04.3 Hardware: HPC executor: Slurm Container engine: Apptainer OS : CentOS linux