opendal icon indicating copy to clipboard operation
opendal copied to clipboard

new feature: Generic git service with LFS support

Open siomporas opened this issue 1 month ago • 4 comments

Feature Description

Currently there is no support for arbitrary git service over http with LFS support.

Problem and Solution

Problem:

We have a project that relies heavily on OpenDAL for remote data access, dealing primarily with AI data like weights and datasets. The HuggingFace service is awesome for this (as is s3 and others)! But we want to be able to support any git repository that may house the same sort of data, including internal repositories running in our self-hosted Gitlab instance.

We currently launch git in a subprocess to fetch and download ref history, then checkout the right commit, then clone LFS files - this feels clunky in a Rust application, and requires us to complete downloading the model before we can stream the contents to clients.

Solution:

Using gix along with the OpenDAL http service, I was able to make a functioning prototype that can fetch the remote state of any remote repo at any ref or oid, pull the repository files, then go through the LFS pointers and start streaming them down with OpenDAL http service.

This fits our particular use case, but before I say good enough and call it a day, I wanted to know if this functionality might be of interest to the project maintainers here - and if so what is a good way to get this formally added as a feature request, and eventually contribute a crate feature? I wouldn't want to go through the effort of porting what I did thus far to OpenDAL's service APIs unless I had a path forward. Thanks!

Additional Context

No response

Are you willing to contribute to the development of this feature?

  • [x] Yes, I am willing to contribute to the development of this feature.

siomporas avatar Nov 27 '25 16:11 siomporas

Hi, this idea seems quiet interesting. I'm willing to accept it as part of opendal.


So ideally, we can just read all git repos this way, right? It’s not limited to LFS files?

Xuanwo avatar Nov 28 '25 05:11 Xuanwo

@siomporas This sounds interetsing. May I ask for a clarafication on the call flow? From your descriptin I'm not quite sure is it about using OpenDAL to call git services, or use OpenDAL to provide functions that LFS support needs.

tisonkun avatar Nov 28 '25 06:11 tisonkun

Hi, this idea seems quiet interesting. I'm willing to accept it as part of opendal.

So ideally, we can just read all git repos this way, right? It’s not limited to LFS files?

Correct, downloading lfs files from the pointers stored in the repository could/would be optional behavior - in order for any of this to function it requires resolving the state of a git repo at a given ref, including file contents at the oid/sha. The core functionality would work with any http-accessible git repo.

@siomporas This sounds interetsing. May I ask for a clarafication on the call flow? From your descriptin I'm not quite sure is it about using OpenDAL to call git services, or use OpenDAL to provide functions that LFS support needs.

Primarily it would be to provide a new OpenDAL service for git repos, including LFS, conceptually similar to how how huggingface works in OpenDAL but not limited to one service provider.

It requires using something like gix to resolve a ref to an oid, fetching the repository file contents at the sha, then using the http client for OpenDAL to stream the LFS pointers if present.

I'll throw together a demo this weekend or Monday and share it in this thread - if it shows promise and you think it would make a good addition, I'll work on making it a service back end for OpenDAL

siomporas avatar Nov 28 '25 06:11 siomporas

Okay so I was able to get this working as an OpenDAL service with transparent LFS-file streaming and a little demo project.

There is one design issue I wasn't able to work around - gix requires disk IO to fetch the packs for a given git oid and to reconstitute the git repository's database to be able to pull the files at a given ref - modern git servers don't seem to support the old dumb git http protocol which would allow bypassing this. I am doing this using tempfile to create an ephemeral temp directory for downloading the packs with gix; I wish there was a cleaner solution, but without rewriting a lot of how gix manages repository data IO, I am kind of stuck with this design.

I'll try to get this posted later today or tomorrow with the demo project and open a PR to elicit feedback.

siomporas avatar Nov 29 '25 18:11 siomporas