dspy icon indicating copy to clipboard operation
dspy copied to clipboard

feat: add from_parquet to dataloader

Open JamesHWade opened this issue 11 months ago • 10 comments

A parquet file loader would be a convenient addition to the dataloader class. I particularly like that parquet files preserve types better than a csv file.

The primary change is the addition of the from_parquet() function:

def from_parquet(self, file_path: str, fields: list[str] = None, input_keys: tuple[str] = ()) -> list[dspy.Example]:
        dataset = load_dataset("parquet", data_files=file_path)["train"]

        if not fields:
            fields = list(dataset.features)

        return [dspy.Example({field: row[field] for field in fields}).with_inputs(input_keys) for row in dataset]

The rest of the changes were caused by the ruff precommit hook.

JamesHWade avatar Feb 27 '24 22:02 JamesHWade

There might be a conflict to be resolved but other than that this looks good!

krypticmouse avatar Mar 02 '24 21:03 krypticmouse

happy to merge once no conflict! thank you @JamesHWade !

okhat avatar Mar 02 '24 23:03 okhat

should be good to go. thanks!

JamesHWade avatar Mar 03 '24 02:03 JamesHWade

@JamesHWade There seems to be some conflicts still. Aside from that, any reason to remove typing types in favor of python type?

I don't oppose the idea as such, just wanted to know the reasoning behind it. Thanks for contributing!

krypticmouse avatar Mar 04 '24 14:03 krypticmouse

I'll fix. No reason for the type change. They were added by Ruff automatically with the precommit hook, but happy to revert to python types.

JamesHWade avatar Mar 04 '24 14:03 JamesHWade

Should be merge-ready now. I just skipped pre-commit this time.

JamesHWade avatar Mar 04 '24 14:03 JamesHWade

No worries, thanks for informing! I actually have no issues with python types either!

Some tests seem to be failing on this though. Will take a look soon!

krypticmouse avatar Mar 04 '24 17:03 krypticmouse

I think it was a temporary network error. It failed to checkout the repo for testing.

JamesHWade avatar Mar 04 '24 18:03 JamesHWade

Very odd. Here is what I get if I try to merge into main on my fork:

image

JamesHWade avatar Mar 04 '24 19:03 JamesHWade

Sorry for the trouble with this PR. For reasons unclear to me, the checkout@v4 action is look for my branch within this repo. I'm not sure how to fix it other than to abandon this PR and create a new one. @krypticmouse, LMK if that sounds good to you and I'll create that quickly.

JamesHWade avatar Mar 04 '24 23:03 JamesHWade

@JamesHWade just following up on this PR. feel free to close this and open the new one without conflicts! Thanks

arnavsinghvi11 avatar Mar 24 '24 20:03 arnavsinghvi11

Closing to resubmit as separate PR.

JamesHWade avatar Mar 25 '24 00:03 JamesHWade