Globus/GridFTP file source plugin
Needed for working efficiently with ENA data.
it seems to me this is high priority for brc, and id really appreciate if we could update this issue w whatever we already know/ have tried/ mean to try next/ etc.
https://github.com/galaxyproject/galaxy/compare/dev...mvdbeek:galaxy:gridftp is a WIP for gridftp. However this is pretty complicated and I don't even think that conceptually this is going to work. It seems like this is built for cloud-to-cloud movement, you need to register an app etc.
It's also possible to use ascp, but that means installing ascp and key files wherever we access urls, that's probably not very portable and I see difficulty maintaining this. https://raw.githubusercontent.com/laurent-martin/aspera-api-examples/refs/heads/main/app/python/src/examples/server.py has an example for doing this via an autogenerated python api client. This seems portable enough but it'd mean maintaining our own client ... also the example download doesn't actually work in my hands. It might be worth pursuing further, however ...
somewhat hidden in https://ena-docs.readthedocs.io/en/latest/retrieval/file-download.html#using-aspera there's a link to https://embl.service-now.com/kb?id=kb_article_view&sys_kb_id=4cc60cf8c398a610bf313dfc0501314c#mcetoc_1idpn4k0to, where the last item says:
Globus : This is not the default globus way, but it also has this HTTP option. See Globus documentation for in-depth understanding.
$ curl https://g-a8b222.dd271.03c0.data.globus.org/1000g/ftp/CHANGELOG
which also works for the /vol1/fastq collection. Given this is a one-liner on the brc side I think we should just use that and see if this is more reliable than the ftp download. I'll subscribe to the mailing list in the hope that if that URL ever changes they'd send out a warning.
No luck subscribing to https://listserver.ebi.ac.uk/mailman/listinfo/transfer-announce, but I'll send an email to the ENA person we met at the codeathon.
A very small data point: SRR208161.fastq.gz - a 370 MB file found at https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR208/SRR208161/
SRR208161.fastq.gz transfers at 195 KB/s via normal curl, 510 KB/s using the Globus link (https://g-a8b222.dd271.03c0.data.globus.org/vol1/fastq/SRR208/SRR208161/SRR208161.fastq.gz) and 457 KB/s download using Globus transfer to a Globus Connect Personal endpoint on the same machine. This machine is on a 20 Mbit/s connection.
~~BTW I used SRR208161 as an example whereas I wanted to use SRR2081166, but could not find SRR2081166 in the Globus collection, not sure why?~~ Found it at /vol1/srr/SRR208/006/SRR2081166.
A comment from a colleague at ENA is that the effective bandwidth of HTTPS transfers is lower and that Globus to Globus transfers are most efficient (but of course technically more challenging to implement).
so feedback from ena is discouraging wrt globus http. id love to be able to update this to say aspera plugin or similar instead now that seems to be the direction were heading, but oh well i guess. after some talk w marius, the plan currently is to implement a generic sort of ascp plugin, which we can configure to use for ena or possibly subclass later if needed etc. ill have a pr for that soon hopefully.