tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Support loading from URLs in Python

Open hyanwong opened this issue 4 years ago • 28 comments

For teaching purposes, and possibly for other reasons too, I think it would be really useful to be able to load a tree sequence from a URL. IMO the nicest way to make this easy for a user would be to have a url argument to tskit.load:

tskit.load(url="https://tskit.dev/tutorials/data/basics.trees")

I would hope that this would mainly be used for small tree sequences, although I suppose it could be an easy way for someone to load up larger ones e.g. from zenodo - depends how long that would take I guess.

I suspect implementing this using the urllib.request library would be quite easy, although I don't know how we would unit test it - probably mock the urllib.request.urlopen function when testing, or something.

hyanwong avatar Jul 10 '21 16:07 hyanwong

NB: this would mean that content in e.g. jupyterbooks could simply be pasted into the user's jupyter session and (as long as the right libraries were installed), the examples should all run. It's less error-prone and fiddly than trying to run the jupyterbook e.g. in binder (see https://github.com/tskit-dev/tutorials/issues/114)

hyanwong avatar Jul 10 '21 16:07 hyanwong

The right way to do this I think would be to make the string form of the file input support URLs, which we intercept in Python. BUT this would be quite a lot of work to do well (it's simple to do badly) so I'm not super keen on it right now.

jeromekelleher avatar Jul 10 '21 16:07 jeromekelleher

Just to say that I have wanted to do this for a while, for other purposes too (not just the tutorials), so I am keen to get something like this done. But maybe others could weigh in if they have needed to do this in the past (as it could just be me).

Out of interest, why would it be a lot of work to do well? I would have thought the standard URL libraries would do most of the hard lifting?

hyanwong avatar Jul 10 '21 16:07 hyanwong

Presumably the workaround for the moment is simply e.g.

import urllib.request
file = urllib.request.urlopen('file:///Users/yan/Documents/GitHub/tutorials/data/basics.trees')
ts = tskit.load(file)

Is there any reason that we shouldn't be doing this as a workaround?

hyanwong avatar Jul 10 '21 16:07 hyanwong

I've been thinking about this, and actually I do think that the form load(url="http://...") is worth considering, for two reasons. Firstly the silly one, that you can still load a file starting with "http" in this case. Secondly, perhaps a more sensible one, that it makes it much easier to document and code without complications, as you can simply pass the url parameter to urllib.request.urlopen. In fact, I think that makes it about a 4 line addition to tskit, and to unit test it you can simply try it with a file:/// url, which should be pretty simple. Then we simply say "anything passed as a url parameter is passed on to urllib.request to open" (or something like that. I think this would tick the box of "doing it well" without having to check on the format of the string, which, as @jeromekelleher points out, is a bit icky.

hyanwong avatar Jul 10 '21 21:07 hyanwong

It's about much more than just the parameters @hyanwong - once you start interacting with the network all sorts of extra complications come in, like http proxies, error handling, retries, etc etc etc. The nominal case is trivial - it's the error and edge cases that are really tricky. These are hard to implement and especially hard to test. I've done it several times and don't particularly want to spend the time doing it now.

jeromekelleher avatar Jul 11 '21 09:07 jeromekelleher

import urllib.request
file = urllib.request.urlopen('file:///Users/yan/Documents/GitHub/tutorials/data/basics.trees')
ts = tskit.load(file)

This seems like a perfectly reasonable thing to do to me (assuming it works).

jeromekelleher avatar Jul 11 '21 09:07 jeromekelleher

It's about much more than just the parameters @hyanwong - once you start interacting with the network all sorts of extra complications come in, like http proxies, error handling, retries, etc etc etc. The nominal case is trivial - it's the error and edge cases that are really tricky. These are hard to implement and especially hard to test. I've done it several times and don't particularly want to spend the time doing it now.

Right, that sounds reasonable. I thought that we could simply say that we are handing all this stuff over to urllib.request.urlopen and therefore if it doesn't work (e.g. problems with proxies etc) it's not something we are responsible for. But maybe that's a cop out.

hyanwong avatar Jul 11 '21 09:07 hyanwong

I'd be much more motivated to do this if tskit had any sense of being able to use part of a file, as zarr for example does. tskit needs the whole file to do anything, so you have to download it all anyway. The tutorials should also be using the most common way of loading a .trees - from disk. Adding url arguments at that point adds confusion.

Pointing out that urlopen can be used in the load docs seems a good idea though.

benjeffery avatar Jul 19 '21 00:07 benjeffery

Annoyingly I find that urlopen doesn't actually work for http(s) urls:

import urllib.request
file = urllib.request.urlopen("https://raw.github.com/tskit-dev/tutorials/main/data/basics.trees")
tskit.load(file)

Gives FileFormatError: File not in KAS format, presumably because, although file.fileno() is valid here, which I think is the requirement tested in KAS, it's not actually stored on disk quite like a normal file, and the part of the tskit C library that loads from a file doesn't know what to make of it. So yes, as Jerome says, this is more complex than I had hoped, and can be kicked down the line, I guess. Shame.

Note that I presume this means that there's no way to load in tree sequence from a stream of data: we require a file on disk, not e.g. some data spooled to us by a DB or equivalent.

hyanwong avatar Jul 19 '21 21:07 hyanwong

We have pre-exisiting tests for sockets: https://github.com/tskit-dev/tskit/blob/main/python/tests/test_fileobj.py#L289 and streams: https://github.com/tskit-dev/tskit/blob/main/python/tests/test_fileobj.py#L246

benjeffery avatar Jul 20 '21 10:07 benjeffery

Oh, useful, thanks. Maybe it's because I'm on OS X. Weird that OS X doesn't do file descriptors properly though.

hyanwong avatar Jul 20 '21 10:07 hyanwong

Although actually I still get the _tskit.FileFormatError: File not in KAS format on holly, so I guess I'm doing something wrong here.

hyanwong avatar Jul 20 '21 10:07 hyanwong

Digging a little more. I would have thought that if I pass a fileno to tskit.load it should "just work", regardless of where the fileno comes from, but tskit seems to distinguish a fileno from a local file with one from urllib:

f = open("tmp.trees", "rb")
ts = tskit.load(f.fileno())  # works

f = urllib.request.urlopen("https://raw.github.com/tskit-dev/tutorials/main/data/basics.trees")
ts = tskit.load(f.fileno())  # fails

The failure is:

---------------------------------------------------------------------------
FileFormatError                           Traceback (most recent call last)
/usr/local/lib/python3.9/site-packages/tskit/trees.py in load(cls, file_or_path)
   3480             ts = _tskit.TreeSequence()
-> 3481             ts.load(file)
   3482             return TreeSequence(ts)

FileFormatError: File not in KAS format

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-131-02f25af848d6> in <module>
      3 
      4 f = urllib.request.urlopen("https://raw.github.com/tskit-dev/tutorials/main/data/basics.trees")
----> 5 ts = tskit.load(f.fileno())  # fails

/usr/local/lib/python3.9/site-packages/tskit/trees.py in load(file)
   2829     :rtype: :class:`tskit.TreeSequence`
   2830     """
-> 2831     return TreeSequence.load(file)
   2832 
   2833 

/usr/local/lib/python3.9/site-packages/tskit/trees.py in load(cls, file_or_path)
   3483         except exceptions.FileFormatError as e:
   3484             # TODO Fix this for new file semantics
-> 3485             formats.raise_hdf5_format_error(file_or_path, e)
   3486         finally:
   3487             if local_file:

/usr/local/lib/python3.9/site-packages/tskit/formats.py in raise_hdf5_format_error(filename, original_exception)
    271     h5py = get_h5py()
    272     try:
--> 273         with h5py.File(filename, "r") as root:
    274             version = tuple(root.attrs["format_version"])
    275             raise exceptions.VersionTooOldError(

/usr/local/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, **kwds)
    410                 name = repr(name).encode('ASCII', 'replace')
    411             else:
--> 412                 name = filename_encode(name)
    413 
    414             if track_order is None:

/usr/local/lib/python3.9/site-packages/h5py/_hl/compat.py in filename_encode(filename)
     17     filenames in h5py for more information.
     18     """
---> 19     filename = fspath(filename)
     20     if sys.platform == "win32":
     21         if isinstance(filename, str):

TypeError: expected str, bytes or os.PathLike object, not int

hyanwong avatar Jul 20 '21 10:07 hyanwong

~The HTTPResponse doesn't support seek, so won't work. Not all fileno's are equal.~

EDIT: This was very wrong.

benjeffery avatar Jul 20 '21 11:07 benjeffery

Ah, that explains it. Didn't know we used seek when reading in, thanks.

hyanwong avatar Jul 20 '21 11:07 hyanwong

It looks like that file is a hdf5 file, not a kastore file. Non-seek file objects should work just fine through kastore.

grahamgower avatar Jul 20 '21 11:07 grahamgower

It looks like that file is a hdf5 file, not a kastore file. Non-seek file objects should work just fine through kastore.

I'm pretty sure it's a kastore file:

>>> f = urllib.request.urlopen("https://raw.github.com/tskit-dev/tutorials/main/data/basics.trees")
>>> f.read()[0:10]
b'\x89KAS\r\n\x1a\n\x01\x00'

(same as you get when read()ing from an msprime-generated tree sequence.

hyanwong avatar Jul 20 '21 11:07 hyanwong

Well something funny is going on, because seek() shouldn't be needed. File streaming support depends on avoiding seek calls.

grahamgower avatar Jul 20 '21 11:07 grahamgower

stdin doesn't support seek() for example, and this works just fine

$ python -c "import tskit, sys; tskit.load(sys.stdin)" < basics.trees

grahamgower avatar Jul 20 '21 11:07 grahamgower

Ah, I understand now. After failing to read the kastore, tskit tries to load as h5py, and that's when it tries to seek. So yes, something weird here.

benjeffery avatar Jul 20 '21 11:07 benjeffery

I've tracked this down a bit - as tskit's python interface is taking the file descriptor of the HTTPResponse and then reopening it in the C API, it is getting the encrypted data! Fixing this in the general case is interesting - maybe creating a proxy file descriptor that is passed to the C API?

benjeffery avatar Jul 21 '21 17:07 benjeffery

Not a high priority I think, but nicely done tracking it down. This stuff is always more complicated than it seems...

jeromekelleher avatar Jul 21 '21 17:07 jeromekelleher

No, not planning to fix soon. Tracked it down as unexplained behaviour is always concerning!

benjeffery avatar Jul 21 '21 17:07 benjeffery

Nicely triaged, @benjeffery, thanks. Agree about low priority.

hyanwong avatar Jul 22 '21 11:07 hyanwong

Just coming back to this as a result of workshops and pyodide. It's more than likely that online tree sequences will be posted in .tsz format (e.g. the unified genealogies on zenodo), so perhaps it's more of a priority to allow tszip.decompress(url="https://...") I don't know if that's any easier to sort than tskit.load(url=...): I assume not.

For reference, here's what I'm doing as a workaround:

import tszip
import urllib.request
temporary_filename, _ = urllib.request.urlretrieve(
    "https://zenodo.org/record/5495535/files/hgdp_tgp_sgdp_chr22_q.dated.trees.tsz"
)
ts = tszip.decompress(temporary_filename)
urllib.request.urlcleanup() # should remove temporary_filename

hyanwong avatar Jun 07 '22 09:06 hyanwong

Adding url support to tszip sounds like a great idea - zarr already supports some forms of url access so we should hopefully be able to tap into that.

jeromekelleher avatar Jun 07 '22 09:06 jeromekelleher

https://github.com/tskit-dev/tszip/issues/66

benjeffery avatar Jun 07 '22 09:06 benjeffery