huggingface.js
                                
                                 huggingface.js copied to clipboard
                                
                                    huggingface.js copied to clipboard
                            
                            
                            
                        new task for tabular data synthesis?
Generating synthetic data is gaining increased attention, particularly in areas such as health data where data sharing is inhibited by sharp data protection laws (for good reasons).
I propose to add data synthesis as a task, starting from tabular data and later expanding to other types of data.
I am more than happy to contribute by programming, and my research group would also contribute with models, datasets, and a library for synthetic data generation that can train models with datasets and generate synthetic data.
What is the process from here? And what should I consider/be aware of?
I'm moving the issue to hub-docs as it's not related to huggingface_hub (the python client) but more general to the :hugs: Hub.
@osanseviero any opinion on the issue itself? (cc @adrinjalali working on tabular data also)
Thanks, @Wauplin! I guess the documentation for how to add tasks needs to be updated, too. It currently says to open an issue with the huggingface_hub repository.
@osanseviero @adrinjalali Do you agree that a new task is needed? If not, what other tasks are subsuming this? If yes, let's figure out how to best add it.
My research group and I are motivated to not only add the tasks but also to contribute a library and a thriving ecosystem for data synthesis models and datasets.
Any thoughts on this one?
Hi there! I don't have a strong opinion, so I would love to hear what @merveenoyan has to say.
In general, we always welcome new tasks given that:
- They don't fall into one of the existing tasks
- Has the right level of granularity
- They will lead to a significant number of models/datasets. I.e. we avoid having a new task that will only have 5 models, as that might make for a poor user experience and could confuse users
Having tabular-generation or tabular-synthesis sounds like the right level of granularity and does not fall within the existing tasks. My main concern is on the number of models/datasets. Hence this is what sounded quite interesting
My research group and I are motivated to not only add the tasks but also to contribute a library and a thriving ecosystem for data synthesis models and datasets.
Would love to hear more about this! If you want, you can already start uploading some models and datasets and once we have some, we can add a tag for discovering these!
Thanks for the feedback. We will start by uploading some models and datasets :-)
@peter-sk hello 👋 do you mean models like CTGAN to generate tabular data when you are talking about this task? if so, it would be nice to firstly see some models like @osanseviero said. Then we can add them to the ecosystem in general, i.e. have people filter them like this, maybe enable a widget so people can try them right away and so forth. is there any specific libraries used for this task? (I only know of CTGAN)
Yes, I mean models like CTGAN. Our group has just completed two systematic comprehensive reviews of the field, and there is a zoo of models and model-based generative tools out there. Just a few examples - there are many more: HealthGAN, ADS-GAN, CTGAN, DPGAN, synthpop, IVEware, PrivBayes, DataSynthesizer
We are planning to create a meta framework/library making as many of these available as possible through a common interface and with common evaluation metrics.
@peter-sk great! we could also add that library to library filter on Hub 😊