langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Dataset Loaders: HuggingFace

Open slavakurilyak opened this issue 2 years ago • 1 comments
trafficstars

While LangChain has already explored using Hugging Face Datasets to evaluate models, it would be great to see loaders for HuggingFace Datasets.

I see several benefits to creating a loader for steaming-enabled HuggingFace datasets:

1. Integration with Hugging Face models: Hugging Face datasets are designed to work seamlessly with Hugging Face models, such as Transformers and Tokenizers. This means that you can easily use streaming datasets to provide context for your LangChain-powered LLMs or other Hugging Face models.

2. Customization: Hugging Face datasets provide a flexible and customizable way to process and transform data. You can apply custom functions or transformations to the prompts as they are streamed. For example, you can preprocess the prompts by removing stop words or punctuation, or you can extract features from the prompts using a feature extraction model.

3. Compatibility with different data formats: Hugging Face datasets support a wide range of data formats, including CSV, JSON, and Parquet. This means that you can easily stream prompts from different sources and formats.

4. Dynamic updating: Streaming datasets can be updated in real-time, which can enable you to add new prompts or remove outdated prompts from the dataset without having to reload the entire dataset.

5. Real-time processing: Streaming datasets can enable real-time processing of user prompts, which can be useful in applications that require fast response times.

slavakurilyak avatar Apr 14 '23 03:04 slavakurilyak

Hi @slavakurilyak Thanks for the suggestion! will start working on it.

azamiftikhar1000 avatar Apr 18 '23 13:04 azamiftikhar1000

Hi, @slavakurilyak! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you requested the creation of loaders for HuggingFace Datasets to integrate with LangChain-powered LLMs and other Hugging Face models. It looks like azamiftikhar1000 has acknowledged the suggestion and will start working on it.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution!

dosubot[bot] avatar Sep 01 '23 16:09 dosubot[bot]

Can you also implement setting the split size for the dataset? Some datasets are humongous but I just want 1/10th of it for experimentation. I tried to edit the present LangChain HuggingFaceDatasetLoader class to accommodate the split size but it threw errors. I would really appreciate if this implemented.

AIWithShrey avatar Jul 10 '24 11:07 AIWithShrey