Segmenting the interface
There seem to be three principal modes with which people access data in files:
- Reading (this is pure reading, maybe you want to read some specific byte range or read incrementally)
- Writing (specifically writing to a single file)
- Examining file structure (
ls,joinpath,du, etc) - Manipulating structure (
mv,tempdir, etc)
As far as I can tell, the FilePathsBase API doesn't currently make a formal distinction between these three APIs. Would it make sense to do so?
This way, things like HTTP paths can simply opt in to the pure-reading interface, whereas a local path could also implement the writing and manipulating interfaces. We can then also have nice interface tests for that, and it would probably make it conceptually easier to implement random filesystems, like zip files (which wouldn't support a tempdir, for example).
Interesting. I hadn't thought of splitting it up like this, but that might work nicely with some preliminary work @ExpandingMan was doing on supporting more of a key-value store interface.
https://github.com/rofinn/FilePathsBase.jl/pull/159
Basically, right now we're assuming a filesystem interface includes two things:
- An IO interface like reading and writing objects/files
- A tree navigation and manipulation interface like
ls,joinpath,mkdir
I think the IO interface is a strict requirement, but the tree interface could easily be a hash table or some other associative datastructure.
Perhaps tangential to this particular issue, but I think it'd be kinda cools if you could map "filesystem" operations directly to datastructure ops.
It's been quite a while since I looked at this. Basically I was interested in supporting S3, which is a key-value store (see AWSS3.jl). It works as is, but is more than a little hacky. There's a whole bunch of things that go horribly wrong on remote filesystems that are not necessarily related to the posted issue.
In a perfect world, I wouldn't think generalizing this package to things like key-value stores or HTTP makes much sense at all. It is built around a tree-like abstraction in which directories are nodes and I think that's fine, especially since that's how actual file systems actually work. The problem is that in real life, for better or worse, S3 (and even now S3-compatible key-value storage alternatives) are really important and arguably becoming even more important. So maybe it's worth doing? I don't know. I still see the S3 use case as important, HTTP is probably stretching it way further.
FWIW, I have usecases in S3, HTTP, and S3 derivatives (google cloud, minio, etc). The main things I want to do are:
lson a "directory". This may involve caching results or constructing a key value store locally.readbytes(path, start, stop)to read a byte range from a remote file.
I'm not so concerned about e.g. write or tempdir as such, at least not for some of the more exotic stores like HTTP.
The idea about HTTPPath is mostly to support a similar interface to read from HTTP "file stores", which are somehow quite common in large geospatial datasets. In that case I might not have ls, for example, implemented. I also need to get all of this to work with a "ReferenceFileSystemwhich is basically an in-memory fake filesystem that can have data either inline as bytes, or as a combination of[filepath, start_byte_index, stop_byte_index]`.
Overall the idea is to make getting data from arbitrary filesystem-like data stores painless and easy.