camel icon indicating copy to clipboard operation
camel copied to clipboard

[Feature Request] add load/save function in StaticDataset

Open Lawhy opened this issue 8 months ago • 4 comments

Required prerequisites

  • [x] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
  • [ ] Consider asking first in a Discussion.

Motivation

Requested for the Loong project: please add load and save functions for StaticDataset. I can see that currently the load function is enabled, please add a save function, either to local or huggingface hub. This helps Loong teams align with the dataset format. I am wondering if having something like load_dataset or load_loong_dataset as a standalone function to mimic the convention of huggingface is worth considering.

Please also take care of the JSON serialization problem.

Solution

Possible reference code: https://github.com/camel-ai/loong_private/blob/master/domain/logic/loong_logic/data.py

Alternatives

No response

Additional context

No response

Lawhy avatar Apr 06 '25 12:04 Lawhy

hey @Lawhy the link is private or unavailable

JINO-ROHIT avatar Apr 06 '25 13:04 JINO-ROHIT

Hi @JINO-ROHIT, thanks for mentioning that. I was thinking this should be tackled by someone in the Loong project. But anyway, I will share the relevant code here:

from camel.datasets import StaticDataset
from datasets import load_dataset
import json

def load_loong_dataset(dataset_path: str):
    """Load loong dataset.

    Args:
        dataset_path (str): Path to the dataset.

    Returns:
        StaticDataset: The loaded dataset.
    """
    # Note that this will incur a problem like transforming `data_created` entry into a datetime object
    return StaticDataset(load_dataset("json", data_files=dataset_path)["train"])


def save_loong_dataset(dataset: StaticDataset, dataset_path: str):
    """Save loong dataset.

    Args:
        dataset (StaticDataset): The dataset to save.
        dataset_path (str): Path to save the dataset.
    """
    with open(dataset_path, "w") as f:
        for dp in dataset:
            # somehow load_loong_dataset will make the string into datetime project, need to transform back
            dp = dp.to_dict() if not isinstance(dp, dict) else dp
            # TODO: to take care of serialisation problem
            f.write(json.dumps(dp) + "\n")

Lawhy avatar Apr 06 '25 13:04 Lawhy

oh okay alright, no worries then.

JINO-ROHIT avatar Apr 06 '25 13:04 JINO-ROHIT

Hi @JINO-ROHIT. You are welcome to join the project as well if you are interested in it.

lightaime avatar Apr 06 '25 15:04 lightaime