burn Replace HuggingFace Datasets Python library with Rust version (hf-hub)

Issue Description: We are currently utilizing the HuggingFace datasets Python library in our project, which is primarily developed in Rust. To maintain consistency and avoid dependencies on Python executables, we propose transitioning to the newly available Rust client for the HuggingFace hub, known as hf-hub.

Rationale: Our project, built with the Rust-based deep learning framework 'Burn', benefits from a consistent, Rust-centric ecosystem. Integrating hf-hub aligns with this philosophy, ensuring our dependencies are uniformly Rust-based. This transition would potentially improve performance, reduce complexity, and streamline our development process.

Proposed Changes:

Evaluate the compatibility and feature parity of hf-hub with the current Python-based datasets library.
Develop a migration plan to replace the Python datasets library with hf-hub.
Test the integration of hf-hub in our environment, focusing on performance and reliability.
Update documentation to reflect the change in the library and provide guidance for any new usage patterns.

Usage Example: A sample usage of the hf-hub can be found here: https://github.com/huggingface/candle/blob/7c3cfd1086ecdc08a0b350f30f1fbedf2f00c269/candle-examples/examples/whisper/main.rs#L495

Request for Comments: We invite team members and contributors to provide their feedback, suggestions, and any concerns regarding this proposed transition. This discussion will help us assess the feasibility and plan the implementation effectively.

Nov 30 '23 23:11 antimora

CC: @nathanielsimard , @louisfd, @Luni-4

Nov 30 '23 23:11 antimora

I like this idea! We can also contribute to hf-hub if some necessary features are missing

Dec 01 '23 08:12 Luni-4

Last time I tried to benchmark a few things against burn I had to give up on burn because I could not get it to be happy with the Python I had on the cluster, in time for submission. I probably could have solved it in more time, but failing to benchmark a Rust project due to a Python dependency was not a great feeling, so I'd love to see this

Dec 02 '23 21:12 ZuseZ4

Looks more straightforward and easier to maintain. But seems it's totally different from current Python and Sqlite interface. Maybe we need rewrite the crate to use it.

Dec 05 '23 14:12 AuruTus

Looks more straightforward and easier to maintain. But seems it's totally different from current Python and Sqlite interface. Maybe we need rewrite the crate to use it.

As long as we are able to extract the values, it's straightforward to insert data into a sqlite db.

Dec 05 '23 18:12 antimora

I agree that it would be nice to move away from the python dependency, especially as it is in the first example of the Burn book. But the hf-hub crate is really limited in terms of API.

In essence, hf-hub is really just to interact with the HF hub to get repository details, files, etc. So we can download the files on the hub with this API, but it merely helps to download the files while the support for different format would have to be re-implemented.

The datasets python interface hides a lot of implementation details for dataset management utilities and conversion. A vast majority of datasets on the hub include a python config file for their DatasetBuilder. This config is the main entrypoint. It defines the dataset attributes, file sources and dataset splits. The sources can be local to the hub (uploaded) or a different web source (e.g., standard benchmark datasets like COCO) where the data is fetched when the dataset is "built". They support different formats (CSV, JSON, text and parquet) which are all internally converted to parquet to create their Dataset instance.

It is not impossible to get there, but it would most likely require added support for the same dataset types (CSV, JSON, text and parquet) on top of finding a way to parse the python config file for their DatasetBuilder. That is, unless we can find a way to use their CLI like a simple binary without having to setup python and the dependencies.

TL;DR: their dataset management tools are really centered around python and moving away from this dependency will require quite some work to preserve support similar to the current HuggingfaceDatasetLoader without the python dependency.

Jan 11 '24 18:01 laggui

@laggui, thank you for looking into this. Yes, I suspected that there is more Python processing happening in their toolkit. It's unfortunate that they haven't made the preprocessing declarative and language-independent.

However, there might still be benefits for simple datasets in retrieving just the data. Thus, we could consider creating an additional HF dataset retriever that would pull the raw data and rely on Rust users' custom logic for further modifications.

For example, MNIST does not involve a lot of preprocessing logic (see https://huggingface.co/datasets/mnist/blob/main/mnist.py), and in our HG dataset, we already bypass image data conversion (we use the raw file instead of converted image pixels).

Python is the number one complaint among new users trying out our tools. It would be incredibly beneficial and user-friendly if installing Python wasn't a requirement for the MNIST example.

Therefore, I am in favor of keeping this ticket open and attempting to add a new, simple HF dataset to our tools.

@Luni-4, @nathanielsimard, @louisfd, what are your thoughts?

Jan 17 '24 19:01 antimora

@antimora Totally agree regarding the Python complaint, I had the same impression myself (hence why I looked into this issue). So I am definitely in favor of removing the Python requirement for the MNIST example.

To be honest I would love to remove the Python requirement for all Hugging Face datasets, it's just that keeping the compatibility with this source without having to implement the processing logic ourselves will require a tad more work as I discovered in my investigation.

But as you mention, perhaps simplifying only the MNIST example for now with a source that doesn't require the Python dependency is a good start.

Jan 17 '24 19:01 laggui

(Hopefully minor) request, would it be possible to also consider Cifar10, besides (instead of?) MNIST? In the rust-ML group we had various independent Neural Network libraries being developed. All of them achieved almost perfect results on MNIST without much work, but almost all of them fell short on CIFAR10. Therefore most group members ended up taking Cifar10 as the much more interesting benchmark. They both don't have a (complex) encoding, so hopefully there is no / not much extra work involved? Either way, thanks a lot for putting in the effort to remove python for some cases!

Jan 17 '24 20:01 ZuseZ4

(Hopefully minor) request, would it be possible to also consider Cifar10, besides (instead of?) MNIST? In the rust-ML group we had various independent Neural Network libraries being developed. All of them achieved almost perfect results on MNIST without much work, but almost all of them fell short on CIFAR10. Therefore most group members ended up taking Cifar10 as the much more interesting benchmark. They both don't have a (complex) encoding, so hopefully there is no / not much extra work involved? Either way, thanks a lot for putting in the effort to remove python for some cases!

Yes, as an additional feature or example, it would be great. But we still need to keep MNIST examples. MNIST data and examples are meant as "Hello, world!" example. Pretty much everyone knows MNIST model by now and it serves as a knowledge bridge when someone starts using Burn.

Please free to file a ticket (you may just copy your comment). I am sure someone would be happy to pick up and implement it.

Jan 22 '24 23:01 antimora

@ZuseZ4 MNIST isn't meant to be a benchmark or a test, it's just the simplest complete demo we can create.

Jan 25 '24 14:01 nathanielsimard

@nathanielsimard understood, but for some previous submission I actually wanted to benchmark burn and couldn't do so, since I don't have the capacity to debug Python build failures on two server/supercomputer where I don't have admin rights. But it's also fair if you point out that Python dependencies work out for most people and you therefore don't prioritize this.

Jan 25 '24 17:01 ZuseZ4

@nathanielsimard understood, but for some previous submission I actually wanted to benchmark burn and couldn't do so, since I don't have the capacity to debug Python build failures on two server/supercomputer where I don't have admin rights. But it's also fair if you point out that Python dependencies work out for most people and you therefore don't prioritize this.

Jan 25 '24 17:01 ZuseZ4

(Hopefully minor) request, would it be possible to also consider Cifar10, besides (instead of?) MNIST? In the rust-ML group we had various independent Neural Network libraries being developed. All of them achieved almost perfect results on MNIST without much work, but almost all of them fell short on CIFAR10. Therefore most group members ended up taking Cifar10 as the much more interesting benchmark. They both don't have a (complex) encoding, so hopefully there is no / not much extra work involved? Either way, thanks a lot for putting in the effort to remove python for some cases!

@ZuseZ4 Btw we added an example that uses Cifar10 with the new ImageFolderDataset in the last release if that helps :)

Feb 09 '24 15:02 laggui

Thanks a lot @laggui, I am currently just cleaning up some code and will then test how the autodiff/ndarray backend does against Enzyme.

Feb 15 '24 06:02 ZuseZ4

burn burn copied to clipboard

Replace HuggingFace Datasets Python library with Rust version (hf-hub)

burn
burn copied to clipboard