filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

feat(gist): fsspec file system for GitHub gists (resolves #888)

Open lmmx opened this issue 1 year ago • 3 comments

This PR introduces a new filesystem backend, GistFileSystem, which allows read-only access to files within a single GitHub Gist (as suggested in #888). I'd find this really useful in combination with Universal Pathlib (also an fsspec project)!

  • Gists are essentially flat collections of files, so there is no subdirectory concept. (Technically they are git repos that can store dirs too but we only need to support them as flat file lists, that's all the website UI will show them as)
  • The implementation is closely based on GithubFileSystem but simplified for a single gist.
  • Supports both public and private gists, latter needed user/token (PAT).

Users can do:

import fsspec

# For a public gist
fs = fsspec.filesystem("gist", gist_id="729837f14264089288178a5f632221ab")
print(fs.ls(""))  # lists files
with fs.open("test1.txt", "rb") as f:
    print(f.read().decode())

For a private gist, the same but also passing username and token args.

  • [x] Implemented FS (methods: ls, _open, cat, invalidate_cache), read-only impl
  • [x] Added to registry (alphabetically)
  • [x] Added a test - I changed the gist ID to one by martindurant so you wouldn't have to worry about relying on someone else preserving their artifact for your tests to pass, also it's fairly small so shouldn't be slow to load
  • [x] Added documentation in docs/source/api.rst.
  • [x] Verified that read-only operations (ls, cat, open) are working with public gists.

Example usage

Below is a short snippet showing how to retrieve files from a public gist:

import fsspec

gist_id = "16bee4256595d3b6814be139ab1bd54e"
print("Gist ID:", gist_id)
fs = fsspec.filesystem("gist", gist_id=gist_id)
file_list = fs.ls("")
print("Files in the Gist (via fsspec):", file_list)
contents = fs.cat(["gistfile1.txt"])
print(contents["gistfile1.txt"].decode()[:120] + "\n...")

Gist ID: 16bee4256595d3b6814be139ab1bd54e
Files in the Gist (via fsspec): ['gistfile1.txt']
import astropy.io.fits._tiled_compression as tiled
from astropy.io import fits
import numcodecs
import fsspec
import zar
...

lmmx avatar Feb 16 '25 19:02 lmmx

Thanks for providing! I haven't had a chance to look yet, but I will soon :)

martindurant avatar Feb 21 '25 23:02 martindurant

Most welcome, no worries! 😃

lmmx avatar Feb 22 '25 13:02 lmmx

Quick suggestion: it would be good to enable bundling the gist ID with the URL:

with fsspec.open("gist://16bee4256595d3b6814be139ab1bd54e@/test1.txt", "rb")

like github: allows. It would require enabling extracting kwargs from the URL.

martindurant avatar Feb 22 '25 21:02 martindurant

Please ping me when I should have another look

martindurant avatar Feb 27 '25 15:02 martindurant

Thanks for reviewing Martin, gotten sidetracked in a CI fixing rabbit hole this week I've thankfully emerged and can return to revisit this!

Please ping me when I should have another look

Will do :saluting_face:

lmmx avatar Feb 28 '25 16:02 lmmx

I just noticed this is still stalled. Please ask for help if you need it.

martindurant avatar May 20 '25 15:05 martindurant

Oh snap I’m sorry, yeah let me take a look…

lmmx avatar May 22 '25 14:05 lmmx

Updated now with the ability to specify just a single file (and ran the linter, sorry I missed that last time)

Some TODOs:

  • [x] Test for grabbing all files (changed the choice of gist to one of yours with multiple files so I can assert)
  • [x] Test for grabbing a single file (and assert number of files is 1)
  • [x] Test for non-existent file should raise FileNotFound
  • [x] Tests for gist://... parsing with what we expect
  • [x] add the ability to specify filename in the URL (I think to keep it conventional the URI will just be to specify a single file, the multi-file syntax would just be via filenames kwarg)
  • [x] add the ability to specify SHA (revision) in the URL

lmmx avatar May 22 '25 17:05 lmmx

A little test that the URL is parsed as expected

@pytest.mark.parametrize(
    "gist_id,sha,file,token,user",
    [
        ("my-gist-id-12345", "sha_hash_a0b1", "a_file.txt", "secret_token", "my-user"),
        ("my-gist-id-12345", "sha_hash_a0b1", "a_file.txt", "secret_token", ""),
        ("my-gist-id-12345", None, "a_file.txt", "secret_token", "my-user"),  # No SHA
    ],
)
def test_gist_url_parse(gist_id, sha, file, token, user):
    if sha:
        fmt_str = f"gist://{user}:{token}@{gist_id}/{sha}/{file}"
    else:
        fmt_str = f"gist://{user}:{token}@{gist_id}/{file}"
    
    parsed = GistFileSystem._get_kwargs_from_urls(fmt_str)
    
    expected = {"gist_id": gist_id, "token": token}
    if user:  # Only include username if it's not empty
        expected["username"] = user
    if sha:  # Only include SHA if it's specified
        expected["sha"] = sha
    
    assert parsed == expected

lmmx avatar May 22 '25 17:05 lmmx

Cool, all done. A "round trip" might be nice too

lmmx avatar May 22 '25 18:05 lmmx

Checks passed, 3.10 failed with an intermitten HTTP error from conda repodata (I don't have the ability to re-run it), LGTM

lmmx avatar May 22 '25 19:05 lmmx

I like it! Let's put it in, and see if the public has feedback once using it.

martindurant avatar May 23 '25 15:05 martindurant

These tests need a vcr or a network marker. Edit: actually, did you get rid of cassettes, or am I misremembering that bit?

QuLogic avatar May 25 '25 07:05 QuLogic

A network marker is fair. I don't think it's worth setting up vcr for this.

martindurant avatar May 25 '25 20:05 martindurant