gradio
gradio copied to clipboard
Add a flagging callback to save json files to a hugging face dataset
Description
Based on issue #1676 I have created the HuggingFaceDatasetJSONSaver
class which saves the files as JSONL.
Specifically, for each flagged sample:
- I create a unique ID (a hash of random numbers and strings) and create a folder with the name of the ID. In the code I call the new folder
folder_name
. - Save the files (images, audio) inside
folder_name
- Save the other details (output, numbers, etc) in a
metadata.jsonl
file inside thefolder_name
folder.
Advantages of this:
- The major advantage is that we bypass the need to read and write to one CSV. Where the advantage of this is useful is if there are three users on their devices simultaneously flagging a sample. With CSV there would be an error because there can't be more than one simultaneous edit to a CSV file. But this way I propose enables parallel flagging.
- any additional dependencies that are required for this change.
- no additional dependencies are required. I tried my best to make sure I leverage functions from the original
HuggingFaceDatasetSaver
.
Closes: #1676
Checklist:
- [ X] I have performed a self-review of my own code
- [X] My code follows the style guidelines of this project
- [X] I have commented my code in hard-to-understand areas
- [X] I have made corresponding changes to the documentation
- [X] I have added tests that prove my fix is effective or that my feature works
- [X] New and existing unit tests pass locally with my changes
Hi @chrisemezue,
This is very cool, two quick high-level questions:
- Can you run the formatter on your code? That way, the CI won't complain about the formatting. The easiest way is to run this script:
bash scripts/format_backend.sh
- Just to confirm, is this way of saving data into a HuggingFace Dataset (having folders for each sample) compatible with the Dataset previewer on the Hub?
I think the title of this PR is a bit misleading. HuggingFaceDatasetSaver
will still not work in parallel afaik. For me it's a bit confusing if this is saving a data.csv
or a json based on the name (HuggingFaceDatasetJSONSaver
). I see log_file
uses csv
but that does not seem to be used anywhere. If it's a csv, should we directly update HuggingFaceDatasetJSONSaver
?
I think the title of this PR is a bit misleading.
HuggingFaceDatasetSaver
will still not work in parallel afaik. For me it's a bit confusing if this is saving adata.csv
or a json based on the name (HuggingFaceDatasetJSONSaver
). I seelog_file
usescsv
but that does not seem to be used anywhere. If it's a csv, should we directly updateHuggingFaceDatasetJSONSaver
?
Thanks @osanseviero for your feedback
- I will remove the
self.log_file
and other redundant variables not used. - the class
HuggingFaceDatasetJSONSaver
is just saving the flagged samples to a jsonl format instead of csv. Is there a better name you suggest? - I changed the name of PR to the name of its issue. If you have a better easier to understand suggestion I will love to hear.
Hey @chrisemezue, thank you! I was a bit confused by the mentions of csvs, but now that you mention it's a json
then it's great! Thanks!
Please let us know whenever this is ready for review :)
@osanseviero I am done now. Ready for review.
Hi @chrisemezue this looks really good! I left some suggestions / clarification questions in the PR, but once these are addressed, we should be good to merge
Pushed some changes which should fix the tests. As discussed over Slack, we just have a couple of minor fixes, and then we should be good to merge!
Thanks so much @chrisemezue for making the PR and addressing the suggestions! And thanks all for reviewing.
LGTM -- will merge in after the tests run
I'll resolve the conflicts and fix the last few suggestions you made @osanseviero, so that we can get this merged in.
Thanks a bunch @chrisemezue!