galaxy icon indicating copy to clipboard operation
galaxy copied to clipboard

Globus/GridFTP file source plugin

Open mvdbeek opened this issue 1 month ago • 5 comments

Needed for working efficiently with ENA data.

mvdbeek avatar Nov 18 '25 20:11 mvdbeek

it seems to me this is high priority for brc, and id really appreciate if we could update this issue w whatever we already know/ have tried/ mean to try next/ etc.

d-callan avatar Nov 20 '25 14:11 d-callan

https://github.com/galaxyproject/galaxy/compare/dev...mvdbeek:galaxy:gridftp is a WIP for gridftp. However this is pretty complicated and I don't even think that conceptually this is going to work. It seems like this is built for cloud-to-cloud movement, you need to register an app etc.

It's also possible to use ascp, but that means installing ascp and key files wherever we access urls, that's probably not very portable and I see difficulty maintaining this. https://raw.githubusercontent.com/laurent-martin/aspera-api-examples/refs/heads/main/app/python/src/examples/server.py has an example for doing this via an autogenerated python api client. This seems portable enough but it'd mean maintaining our own client ... also the example download doesn't actually work in my hands. It might be worth pursuing further, however ...

somewhat hidden in https://ena-docs.readthedocs.io/en/latest/retrieval/file-download.html#using-aspera there's a link to https://embl.service-now.com/kb?id=kb_article_view&sys_kb_id=4cc60cf8c398a610bf313dfc0501314c#mcetoc_1idpn4k0to, where the last item says:

Globus : This is not the default globus way, but it also has this HTTP option. See Globus documentation for in-depth understanding.

$ curl https://g-a8b222.dd271.03c0.data.globus.org/1000g/ftp/CHANGELOG

which also works for the /vol1/fastq collection. Given this is a one-liner on the brc side I think we should just use that and see if this is more reliable than the ftp download. I'll subscribe to the mailing list in the hope that if that URL ever changes they'd send out a warning.

mvdbeek avatar Nov 21 '25 11:11 mvdbeek

No luck subscribing to https://listserver.ebi.ac.uk/mailman/listinfo/transfer-announce, but I'll send an email to the ENA person we met at the codeathon.

mvdbeek avatar Nov 21 '25 11:11 mvdbeek

A very small data point: SRR208161.fastq.gz - a 370 MB file found at https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR208/SRR208161/ SRR208161.fastq.gz transfers at 195 KB/s via normal curl, 510 KB/s using the Globus link (https://g-a8b222.dd271.03c0.data.globus.org/vol1/fastq/SRR208/SRR208161/SRR208161.fastq.gz) and 457 KB/s download using Globus transfer to a Globus Connect Personal endpoint on the same machine. This machine is on a 20 Mbit/s connection.

~~BTW I used SRR208161 as an example whereas I wanted to use SRR2081166, but could not find SRR2081166 in the Globus collection, not sure why?~~ Found it at /vol1/srr/SRR208/006/SRR2081166.

A comment from a colleague at ENA is that the effective bandwidth of HTTPS transfers is lower and that Globus to Globus transfers are most efficient (but of course technically more challenging to implement).

pvanheus avatar Nov 22 '25 14:11 pvanheus

so feedback from ena is discouraging wrt globus http. id love to be able to update this to say aspera plugin or similar instead now that seems to be the direction were heading, but oh well i guess. after some talk w marius, the plan currently is to implement a generic sort of ascp plugin, which we can configure to use for ena or possibly subclass later if needed etc. ill have a pr for that soon hopefully.

d-callan avatar Nov 25 '25 00:11 d-callan