nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Error trying to pull Azure Open Data sets

Open vsmalladi opened this issue 2 years ago • 14 comments

Bug report

Expected behavior and actual behavior

Expected to be able to provide path to Azure open data sets and download using the https path.

However it tries to resolve using the sas token and azure blob storage account provided in the config

Steps to reproduce the problem

Use the nf-core/sarek repo and use the following genomes.config

'custom' {
  fasta                   = "https://datasetpublicbroadref.blob.core.windows.net/dataset/hg38/v0/Homo_sapiens_assembly38.fasta?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D"
  snpeff_db               = 'GRCh38.86'
  species                 = 'homo_sapiens'
  vep_cache_version       = '99'
}

nextflow run sarek/main.nf --igenomes_ignore --genomes_base 'az://’ --tools HaplotypeCaller --genome 'custom'

Program output

Error executing process > 'BuildFastaFai (Homo_sapiens_assembly38.fasta?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D)'

Caused by: Process BuildFastaFai (Homo_sapiens_assembly38.fasta?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D) terminated with an error exit status (1)

Command executed:

samtools faidx Homo_sapiens_assembly38.fasta?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D

Command exit status: 1

Command output: (empty)

Command wrapper: Unable to download path: https://havocdata.blob.core.windows.net/work/stage/b4/7771dc4451be7737063833b9d7674c/Homo_sapiens_assembly38.fasta?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ=

Work dir: az://work/f8/2092c2c8afb6c25d0b4391ab5680f3

Tip: when you have fixed the problem you can continue the execution adding the option -resume to the run command line

Environment

  • Nextflow version: 21.10.5
  • Java version: 11.0.9.1
  • Operating system: macOS
  • Bash version: zsh 5.8

Additional context

(Add any other context about the problem here)

vsmalladi avatar Jan 26 '22 21:01 vsmalladi

Nextflow uses azcopy to pull data into the container, but I have no idea why is not working for open dataset.

https://github.com/nextflow-io/nextflow/blob/c3677c31126ecdcb095a51f1bfe278be1a842011/plugins/nf-azure/src/main/nextflow/cloud/azure/file/AzBashLib.groovy#L51-L65

pditommaso avatar Jan 27 '22 13:01 pditommaso

Ya i will test with the newest edge release once its out.

vsmalladi avatar Jan 27 '22 17:01 vsmalladi

There's no changes at this regard relating to this problem. Wondering if there's some specific azcopy option to access public data.

pditommaso avatar Jan 27 '22 17:01 pditommaso

Ya i can look at the code further.

vsmalladi avatar Jan 27 '22 17:01 vsmalladi

Guys, I'm looking into this one and here are some observations

  1. Using an authenticated azcopy (after azcopy login), I was able to download the file (with the SAS token).
  2. Interestingly, the name of the downloaded file was simply Homo_sapiens_assembly38.fasta, as opposed to the Homo_sapiens_assembly38.fasta?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D shown in the initial comment - not sure why this might be happening 🤔

Will keep you posted here in case I find the root cause.

abhi18av avatar Feb 13 '22 17:02 abhi18av

That could explain we may need to add a azcopy login in the script initialization

https://github.com/nextflow-io/nextflow/blob/202b5c9c93a8972231c938d9a646d09e8a790424/plugins/nf-azure/src/main/nextflow/cloud/azure/file/AzBashLib.groovy#L33-L33

pditommaso avatar Feb 13 '22 18:02 pditommaso

Shouldn’t need to login since the sas token is passed if it’s a url. Should be able to do a wget like any url that’s public right?

vsmalladi avatar Feb 13 '22 19:02 vsmalladi

Yup, the azcopy (without azcopy login) doesn't need any authentication and downloads the file as expected

(base)~/projects/_scratch$ azcopy copy 'https://datasetpublicbroadref.blob.core.windows.net/dataset/hg38/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz?sv=2020-04-08&si=prod&sr=c&sig=DQxmjB4D1lAfOW9AxIWbXwZx6ksbwjlNkixw597JnvQ%3D' ./
INFO: Scanning...
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support

Job ae1a4f3f-6ebf-0142-5af8-950765e5c2eb has started
Log file is located at: /home/abhinav/.azcopy/ae1a4f3f-6ebf-0142-5af8-950765e5c2eb.log

0.0 %, 0 Done, 0 Failed, 1 Pending, 0 Skipped, 1 Total, 2-sec Throughput (Mb/s): 12.2113


Job ae1a4f3f-6ebf-0142-5af8-950765e5c2eb summary
Elapsed Time (Minutes): 0.0667
Number of File Transfers: 1
Number of Folder Property Transfers: 0
Total Number of Transfers: 1
Number of Transfers Completed: 1
Number of Transfers Failed: 0
Number of Transfers Skipped: 0
TotalBytesTransferred: 3053999
Final Job Status: Completed

(base) ~/projects/_scratch$ ls
Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz  

@vsmalladi , possible for you to share the .command.run and .command.sh from the relevant workdirectory az://work/f8/2092c2c8afb6c25d0b4391ab5680f3 ?

abhi18av avatar Feb 14 '22 08:02 abhi18av

@abhi18av Will need to rerun as I deleted that work directory.

I wonder if this part of bigger discussion of how to download data from multiple blob storage accounts with multiple sas tokens.

vsmalladi avatar Feb 14 '22 15:02 vsmalladi

nextflow.log @abhi18av the newest version is trying to stage the file but can't. Uploaded the log. No .command.run or .command.sh in the stage/work directory.

vsmalladi avatar Feb 14 '22 16:02 vsmalladi

Thanks @vsmalladi for sharing these, but the errors here seem to be different from the ones mentioned in the first comment https://github.com/nextflow-io/nextflow/issues/2595#issue-1115515572

Psting here some crucial data points

  • The command invoked

Feb-03 11:04:06.085 [main] DEBUG nextflow.cli.Launcher - $> nextflow run nf-core/sarek -c /ARQUIVOS/data/azure --igenomes_ignore --genomes_base 'az://genomas-raros/sarek' --genome custom --input 'az://genomas-raros/sarek_azure.tsv' --outdir 'az://genomas-raros/results' --tools HaplotypeCaller -w 'az://genomas-raros/work' -profile docker

  • Nextflow version - 21.10.6

Feb-03 11:04:06.149 [main] INFO nextflow.cli.CmdRun - N E X T F L O W ~ version 21.10.6

  • nf-azure version - nf-azure 0.11.2

Feb-03 11:04:09.702 [main] INFO org.pf4j.AbstractPluginManager - Start plugin '[email protected]'

  • Sarek master branch

Feb-03 11:04:07.413 [main] DEBUG nextflow.scm.AssetManager - Git config: /root/.nextflow/assets/nf-core/sarek/.git/config; branch: master; remote: origin; url: https://github.com/nf-core/sarek.git

  • NoSuchFileException for .command.err and .command.out
Feb-03 15:50:20.371 [main] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'GenotypeGVCFs (1543_18-chr1_228608365-248946422)' -- Cause: java.nio.file.NoSuchFileException: az://genomas-raros/work/98/efb791f051a3ce620e3ab59960621f/.command.err
Feb-03 15:50:20.392 [main] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'GenotypeGVCFs (1543_18-chr1_228608365-248946422)' -- Cause: java.nio.file.NoSuchFileException: az://genomas-raros/work/98/efb791f051a3ce620e3ab59960621f/.command.out
  • Error on workflow.onComplete , probably the mail trigger mechanism https://github.com/nf-core/sarek/blob/68b9930a74962f3c42eee71f51e6dd2646269199/main.nf#L3879
Feb-03 15:50:20.522 [main] ERROR nextflow.script.WorkflowMetadata - Failed to invoke `workflow.onComplete` event handler
java.lang.NullPointerException: Cannot invoke method size() on null object
	at org.codehaus.groovy.runtime.NullObject.invokeMethod(NullObject.java:91)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:44)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
	at org.codehaus.groovy.runtime.callsite.NullCallSite.call(NullCallSite.java:34)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:130)
	at Script_c3d27a41$_runScript_closure159.doCall(Script_c3d27a41:3955)
	at Script_c3d27a41$_runScript_closure159.doCall(Script_c3d27a41)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	
...
...

abhi18av avatar Feb 15 '22 08:02 abhi18av

@abhi18av sorry uploaded the wrong log. Was debuging another persons log. Uploading no nextflow.log w

vsmalladi avatar Feb 15 '22 16:02 vsmalladi

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 31 '22 17:07 stale[bot]

I recently saw some change here https://github.com/nextflow-io/nextflow/issues/2918 dealing with the query params for files sourced via HTTP(s) location, which might address this functionality, unless I'm mistaken.

Worth testing again as soon as the latest edge is out.

abhi18av avatar Aug 01 '22 07:08 abhi18av