dspy
dspy copied to clipboard
feat: add from_parquet to dataloader
A parquet file loader would be a convenient addition to the dataloader class. I particularly like that parquet files preserve types better than a csv file.
The primary change is the addition of the from_parquet() function:
def from_parquet(self, file_path: str, fields: list[str] = None, input_keys: tuple[str] = ()) -> list[dspy.Example]:
dataset = load_dataset("parquet", data_files=file_path)["train"]
if not fields:
fields = list(dataset.features)
return [dspy.Example({field: row[field] for field in fields}).with_inputs(input_keys) for row in dataset]
The rest of the changes were caused by the ruff precommit hook.
There might be a conflict to be resolved but other than that this looks good!
happy to merge once no conflict! thank you @JamesHWade !
should be good to go. thanks!
@JamesHWade There seems to be some conflicts still. Aside from that, any reason to remove typing types in favor of python type?
I don't oppose the idea as such, just wanted to know the reasoning behind it. Thanks for contributing!
I'll fix. No reason for the type change. They were added by Ruff automatically with the precommit hook, but happy to revert to python types.
Should be merge-ready now. I just skipped pre-commit this time.
No worries, thanks for informing! I actually have no issues with python types either!
Some tests seem to be failing on this though. Will take a look soon!
I think it was a temporary network error. It failed to checkout the repo for testing.
Very odd. Here is what I get if I try to merge into main on my fork:
Sorry for the trouble with this PR. For reasons unclear to me, the checkout@v4 action is look for my branch within this repo. I'm not sure how to fix it other than to abandon this PR and create a new one. @krypticmouse, LMK if that sounds good to you and I'll create that quickly.
@JamesHWade just following up on this PR. feel free to close this and open the new one without conflicts! Thanks
Closing to resubmit as separate PR.