dspy feat: add from_parquet to dataloader

A parquet file loader would be a convenient addition to the dataloader class. I particularly like that parquet files preserve types better than a csv file.

The primary change is the addition of the from_parquet() function:

def from_parquet(self, file_path: str, fields: list[str] = None, input_keys: tuple[str] = ()) -> list[dspy.Example]:
        dataset = load_dataset("parquet", data_files=file_path)["train"]

        if not fields:
            fields = list(dataset.features)

        return [dspy.Example({field: row[field] for field in fields}).with_inputs(input_keys) for row in dataset]

The rest of the changes were caused by the ruff precommit hook.

Feb 27 '24 22:02 JamesHWade

There might be a conflict to be resolved but other than that this looks good!

Mar 02 '24 21:03 krypticmouse

happy to merge once no conflict! thank you @JamesHWade !

Mar 02 '24 23:03 okhat

should be good to go. thanks!

Mar 03 '24 02:03 JamesHWade

@JamesHWade There seems to be some conflicts still. Aside from that, any reason to remove typing types in favor of python type?

I don't oppose the idea as such, just wanted to know the reasoning behind it. Thanks for contributing!

Mar 04 '24 14:03 krypticmouse

I'll fix. No reason for the type change. They were added by Ruff automatically with the precommit hook, but happy to revert to python types.

Mar 04 '24 14:03 JamesHWade

Should be merge-ready now. I just skipped pre-commit this time.

Mar 04 '24 14:03 JamesHWade

No worries, thanks for informing! I actually have no issues with python types either!

Some tests seem to be failing on this though. Will take a look soon!

Mar 04 '24 17:03 krypticmouse

I think it was a temporary network error. It failed to checkout the repo for testing.

Mar 04 '24 18:03 JamesHWade

Very odd. Here is what I get if I try to merge into main on my fork:

Mar 04 '24 19:03 JamesHWade

Sorry for the trouble with this PR. For reasons unclear to me, the checkout@v4 action is look for my branch within this repo. I'm not sure how to fix it other than to abandon this PR and create a new one. @krypticmouse, LMK if that sounds good to you and I'll create that quickly.

Mar 04 '24 23:03 JamesHWade

@JamesHWade just following up on this PR. feel free to close this and open the new one without conflicts! Thanks

Mar 24 '24 20:03 arnavsinghvi11

Closing to resubmit as separate PR.

Mar 25 '24 00:03 JamesHWade

dspy dspy copied to clipboard

feat: add from_parquet to dataloader

dspy
dspy copied to clipboard