huggingface.js icon indicating copy to clipboard operation
huggingface.js copied to clipboard

new task for tabular data synthesis?

Open peter-sk opened this issue 2 years ago • 8 comments

Generating synthetic data is gaining increased attention, particularly in areas such as health data where data sharing is inhibited by sharp data protection laws (for good reasons).

I propose to add data synthesis as a task, starting from tabular data and later expanding to other types of data.

I am more than happy to contribute by programming, and my research group would also contribute with models, datasets, and a library for synthetic data generation that can train models with datasets and generate synthetic data.

What is the process from here? And what should I consider/be aware of?

peter-sk avatar Mar 28 '23 16:03 peter-sk

I'm moving the issue to hub-docs as it's not related to huggingface_hub (the python client) but more general to the :hugs: Hub.

@osanseviero any opinion on the issue itself? (cc @adrinjalali working on tabular data also)

Wauplin avatar Mar 28 '23 16:03 Wauplin

Thanks, @Wauplin! I guess the documentation for how to add tasks needs to be updated, too. It currently says to open an issue with the huggingface_hub repository.

@osanseviero @adrinjalali Do you agree that a new task is needed? If not, what other tasks are subsuming this? If yes, let's figure out how to best add it.

My research group and I are motivated to not only add the tasks but also to contribute a library and a thriving ecosystem for data synthesis models and datasets.

peter-sk avatar Mar 28 '23 17:03 peter-sk

Any thoughts on this one?

peter-sk avatar Mar 31 '23 07:03 peter-sk

Hi there! I don't have a strong opinion, so I would love to hear what @merveenoyan has to say.

In general, we always welcome new tasks given that:

  • They don't fall into one of the existing tasks
  • Has the right level of granularity
  • They will lead to a significant number of models/datasets. I.e. we avoid having a new task that will only have 5 models, as that might make for a poor user experience and could confuse users

Having tabular-generation or tabular-synthesis sounds like the right level of granularity and does not fall within the existing tasks. My main concern is on the number of models/datasets. Hence this is what sounded quite interesting

My research group and I are motivated to not only add the tasks but also to contribute a library and a thriving ecosystem for data synthesis models and datasets.

Would love to hear more about this! If you want, you can already start uploading some models and datasets and once we have some, we can add a tag for discovering these!

osanseviero avatar Mar 31 '23 17:03 osanseviero

Thanks for the feedback. We will start by uploading some models and datasets :-)

peter-sk avatar Apr 03 '23 09:04 peter-sk

@peter-sk hello 👋 do you mean models like CTGAN to generate tabular data when you are talking about this task? if so, it would be nice to firstly see some models like @osanseviero said. Then we can add them to the ecosystem in general, i.e. have people filter them like this, maybe enable a widget so people can try them right away and so forth. is there any specific libraries used for this task? (I only know of CTGAN)

merveenoyan avatar Apr 03 '23 09:04 merveenoyan

Yes, I mean models like CTGAN. Our group has just completed two systematic comprehensive reviews of the field, and there is a zoo of models and model-based generative tools out there. Just a few examples - there are many more: HealthGAN, ADS-GAN, CTGAN, DPGAN, synthpop, IVEware, PrivBayes, DataSynthesizer

We are planning to create a meta framework/library making as many of these available as possible through a common interface and with common evaluation metrics.

peter-sk avatar Apr 03 '23 10:04 peter-sk

@peter-sk great! we could also add that library to library filter on Hub 😊

merveenoyan avatar Apr 03 '23 11:04 merveenoyan