feat(gist): fsspec file system for GitHub gists (resolves #888)
This PR introduces a new filesystem backend, GistFileSystem, which allows read-only access to files within a single GitHub Gist (as suggested in #888). I'd find this really useful in combination with Universal Pathlib (also an fsspec project)!
- Gists are essentially flat collections of files, so there is no subdirectory concept. (Technically they are git repos that can store dirs too but we only need to support them as flat file lists, that's all the website UI will show them as)
- The implementation is closely based on
GithubFileSystembut simplified for a single gist. - Supports both public and private gists, latter needed user/token (PAT).
Users can do:
import fsspec
# For a public gist
fs = fsspec.filesystem("gist", gist_id="729837f14264089288178a5f632221ab")
print(fs.ls("")) # lists files
with fs.open("test1.txt", "rb") as f:
print(f.read().decode())
For a private gist, the same but also passing username and token args.
- [x] Implemented FS (methods:
ls,_open,cat,invalidate_cache), read-only impl - [x] Added to registry (alphabetically)
- [x] Added a test - I changed the gist ID to one by martindurant so you wouldn't have to worry about relying on someone else preserving their artifact for your tests to pass, also it's fairly small so shouldn't be slow to load
- [x] Added documentation in
docs/source/api.rst. - [x] Verified that read-only operations (ls, cat, open) are working with public gists.
Example usage
Below is a short snippet showing how to retrieve files from a public gist:
import fsspec
gist_id = "16bee4256595d3b6814be139ab1bd54e"
print("Gist ID:", gist_id)
fs = fsspec.filesystem("gist", gist_id=gist_id)
file_list = fs.ls("")
print("Files in the Gist (via fsspec):", file_list)
contents = fs.cat(["gistfile1.txt"])
print(contents["gistfile1.txt"].decode()[:120] + "\n...")
⇣
Gist ID: 16bee4256595d3b6814be139ab1bd54e
Files in the Gist (via fsspec): ['gistfile1.txt']
import astropy.io.fits._tiled_compression as tiled
from astropy.io import fits
import numcodecs
import fsspec
import zar
...
Thanks for providing! I haven't had a chance to look yet, but I will soon :)
Most welcome, no worries! 😃
Quick suggestion: it would be good to enable bundling the gist ID with the URL:
with fsspec.open("gist://16bee4256595d3b6814be139ab1bd54e@/test1.txt", "rb")
like github: allows. It would require enabling extracting kwargs from the URL.
Please ping me when I should have another look
Thanks for reviewing Martin, gotten sidetracked in a CI fixing rabbit hole this week I've thankfully emerged and can return to revisit this!
Please ping me when I should have another look
Will do :saluting_face:
I just noticed this is still stalled. Please ask for help if you need it.
Oh snap I’m sorry, yeah let me take a look…
Updated now with the ability to specify just a single file (and ran the linter, sorry I missed that last time)
Some TODOs:
- [x] Test for grabbing all files (changed the choice of gist to one of yours with multiple files so I can assert)
- [x] Test for grabbing a single file (and assert number of files is 1)
- [x] Test for non-existent file should raise FileNotFound
- [x] Tests for
gist://...parsing with what we expect - [x] add the ability to specify filename in the URL (I think to keep it conventional the URI will just be to specify a single file, the multi-file syntax would just be via
filenameskwarg) - [x] add the ability to specify SHA (revision) in the URL
A little test that the URL is parsed as expected
@pytest.mark.parametrize(
"gist_id,sha,file,token,user",
[
("my-gist-id-12345", "sha_hash_a0b1", "a_file.txt", "secret_token", "my-user"),
("my-gist-id-12345", "sha_hash_a0b1", "a_file.txt", "secret_token", ""),
("my-gist-id-12345", None, "a_file.txt", "secret_token", "my-user"), # No SHA
],
)
def test_gist_url_parse(gist_id, sha, file, token, user):
if sha:
fmt_str = f"gist://{user}:{token}@{gist_id}/{sha}/{file}"
else:
fmt_str = f"gist://{user}:{token}@{gist_id}/{file}"
parsed = GistFileSystem._get_kwargs_from_urls(fmt_str)
expected = {"gist_id": gist_id, "token": token}
if user: # Only include username if it's not empty
expected["username"] = user
if sha: # Only include SHA if it's specified
expected["sha"] = sha
assert parsed == expected
Cool, all done. A "round trip" might be nice too
Checks passed, 3.10 failed with an intermitten HTTP error from conda repodata (I don't have the ability to re-run it), LGTM
I like it! Let's put it in, and see if the public has feedback once using it.
These tests need a vcr or a network marker. Edit: actually, did you get rid of cassettes, or am I misremembering that bit?
A network marker is fair. I don't think it's worth setting up vcr for this.