gatk icon indicating copy to clipboard operation
gatk copied to clipboard

Azure URIs and GenomicsDB

Open nalinigans opened this issue 1 year ago • 1 comments

@lbergelson, just want to discuss some issues here-

  1. We currently have to use --avoid-nio with --sample-name-map and --bypass-feature-reader to get GenomicsDBImport to work with azure URIs. Why don't we just merge the --avoid-nio functionality with --bypass-feauture-reader, that is allow GenomicsDB to process the URIs by default?
  2. Noticed that the only way to use azure URIs for vcf names is by using --sample-name-map. Directly specifying vcfs with the -V option is not possible because --avoid-nio cannot be used in conjunction. Should this be supported?
  3. @lbergelson, w.r.t malformed Azure URIs, GenomicsDB does put out an error -
11:10:12.658 error NativeGenomicsDB - pid=30608 tid=2980282 htslib_plugin could not open file az://genomicsdb@oda/vcfs/t0.vcf.gz [TileDB::StorageManagerConfig] Error: Azure Storage Blob initialization failed for home=az://genomicsdb@container/vcfs/sample.vcf.gz; ; Azure Blob URI does not seem to have either an account or a container: Protocol error
[E::hts_open_format] Failed to open file "az://genomicsdb@container/vcfs/sample.vcf.gz" : Input/output error

Is this not sufficient? These are the acceptable azure URIs currently

az://<container_name>@<account_name>.blob/<folder>/<file> # for default endpoints
az://<container_name>@<account_name>.blob.core.windows.net/<folder>/<file> # if the endpoint is blob.core.windows.net
azb://<container_name>/<folder>/<file> # following java.nio for azure URIs
azb://<container_name>/<folder>/<file>?account=<account_name>&endpoint=<endpoint>

nalinigans avatar Dec 19 '23 19:12 nalinigans

  1. @nalinigans It's a very reasonable question. It's true, the --avoid-nio flag is technically redundant. You can recreate it with a combination of other flags. I added it because a) I didn't realize that was the when I started adding it. b) The combination of flags was kind of complicated so it was helpful to have something that gave you clear instructions about what you needed to enable.

I think we could merge them, although I think there is one sanity check we do even when -bypass-feature-reader is turned on, that we need to turn off. I basically added "something that works for Megan's project right now."

  1. Yes, the various cases were getting complicated and I had a bug when -V was enabled so I just disabled it as an option. It would make sense to add -V support for azure files. I just didn't do it because I was in a rush and I figured it was better to disable it than to have it potentially be wrong.

  2. Yeah, that's the error I saw. It's definitely better than nothing. It would be great if it could be propagated back up to the java layer as a Java exception though. It currently ends the program with SIGABORT I think which doesn't play that nicely with various reporting and retry mechanisms. No super high priority, but nice if you have the cycles.

lbergelson avatar Dec 20 '23 19:12 lbergelson