gradio icon indicating copy to clipboard operation
gradio copied to clipboard

Add a flagging callback to save json files to a hugging face dataset

Open chrisemezue opened this issue 1 year ago • 8 comments

Description

Based on issue #1676 I have created the HuggingFaceDatasetJSONSaver class which saves the files as JSONL.

Specifically, for each flagged sample:

  • I create a unique ID (a hash of random numbers and strings) and create a folder with the name of the ID. In the code I call the new folder folder_name.
  • Save the files (images, audio) inside folder_name
  • Save the other details (output, numbers, etc) in a metadata.jsonl file inside the folder_name folder.

Advantages of this:

  • The major advantage is that we bypass the need to read and write to one CSV. Where the advantage of this is useful is if there are three users on their devices simultaneously flagging a sample. With CSV there would be an error because there can't be more than one simultaneous edit to a CSV file. But this way I propose enables parallel flagging.
  • any additional dependencies that are required for this change.
  • no additional dependencies are required. I tried my best to make sure I leverage functions from the original HuggingFaceDatasetSaver.

Closes: #1676

Checklist:

  • [ X] I have performed a self-review of my own code
  • [X] My code follows the style guidelines of this project
  • [X] I have commented my code in hard-to-understand areas
  • [X] I have made corresponding changes to the documentation
  • [X] I have added tests that prove my fix is effective or that my feature works
  • [X] New and existing unit tests pass locally with my changes

chrisemezue avatar Jul 18 '22 22:07 chrisemezue

Hi @chrisemezue,

This is very cool, two quick high-level questions:

  1. Can you run the formatter on your code? That way, the CI won't complain about the formatting. The easiest way is to run this script:
bash scripts/format_backend.sh
  1. Just to confirm, is this way of saving data into a HuggingFace Dataset (having folders for each sample) compatible with the Dataset previewer on the Hub?

abidlabs avatar Jul 19 '22 03:07 abidlabs

I think the title of this PR is a bit misleading. HuggingFaceDatasetSaver will still not work in parallel afaik. For me it's a bit confusing if this is saving a data.csv or a json based on the name (HuggingFaceDatasetJSONSaver). I see log_file uses csv but that does not seem to be used anywhere. If it's a csv, should we directly update HuggingFaceDatasetJSONSaver?

osanseviero avatar Jul 19 '22 11:07 osanseviero

I think the title of this PR is a bit misleading. HuggingFaceDatasetSaver will still not work in parallel afaik. For me it's a bit confusing if this is saving a data.csv or a json based on the name (HuggingFaceDatasetJSONSaver). I see log_file uses csv but that does not seem to be used anywhere. If it's a csv, should we directly update HuggingFaceDatasetJSONSaver?

Thanks @osanseviero for your feedback

  • I will remove the self.log_file and other redundant variables not used.
  • the class HuggingFaceDatasetJSONSaver is just saving the flagged samples to a jsonl format instead of csv. Is there a better name you suggest?
  • I changed the name of PR to the name of its issue. If you have a better easier to understand suggestion I will love to hear.

chrisemezue avatar Jul 19 '22 17:07 chrisemezue

Hey @chrisemezue, thank you! I was a bit confused by the mentions of csvs, but now that you mention it's a json then it's great! Thanks!

osanseviero avatar Jul 19 '22 20:07 osanseviero

Please let us know whenever this is ready for review :)

osanseviero avatar Jul 21 '22 10:07 osanseviero

@osanseviero I am done now. Ready for review.

chrisemezue avatar Jul 25 '22 20:07 chrisemezue

@osanseviero here is an example of a small dataset with this flagging callback.

chrisemezue avatar Aug 08 '22 19:08 chrisemezue

Hi @chrisemezue this looks really good! I left some suggestions / clarification questions in the PR, but once these are addressed, we should be good to merge

abidlabs avatar Aug 11 '22 05:08 abidlabs

Pushed some changes which should fix the tests. As discussed over Slack, we just have a couple of minor fixes, and then we should be good to merge!

abidlabs avatar Aug 11 '22 18:08 abidlabs

Thanks so much @chrisemezue for making the PR and addressing the suggestions! And thanks all for reviewing.

LGTM -- will merge in after the tests run

abidlabs avatar Aug 12 '22 03:08 abidlabs

I'll resolve the conflicts and fix the last few suggestions you made @osanseviero, so that we can get this merged in.

Thanks a bunch @chrisemezue!

abidlabs avatar Aug 23 '22 22:08 abidlabs