biomedical
biomedical copied to clipboard
Default config override
@galtay
Proposal for making DEFAULT_CONFIG_NAME
override; as opposed to creating a new unique class that inherits the bigbioconfig, one option we can use is to make a partial function that inherits the default schema name and config name.
As mentioned in #246, when data_dir
is passed, a brand new config is created from 'BUILDER_CONFIG_CLASS'; as this is usually set to BigBioConfig
, this means that all necessary terms (name + schema, essentially) are None
.
The solution seems a bit unsatisfactory - all local datasets will need the statement from functools import partial
as this will not be a concern for local datasets, that automatically can peek at the right builder config. I'm not sure a good workaround given this logic here (that seems to suggest that config_kwargs is populated with data_dir
and thus will not go through setting the default loop).
I was a little stuck on a way to inherit the default config name without declaring a new form. This basically does that, albeit more modularly. Ideally, we would use DEFAULT_CONFIG_NAME
to point to the right builder class.
putting some discussion / comments here. this problem concerns the ability to set a default config when extra kwargs are passed to load_dataset
. it seems we have 3 choices on the table,
- just make
name
a required kwarg by forcing it to be present in__post_init__
method ofBigBioConfig
- use the solution implemented here (a partial function that puts in default args)
- have each dataset_loader implement its own subclass of
BigBioConfig
with whatever defaults they want
I think I'm leaning towards 3 at the moment b/c it will work as a stand alone object when we eventually move some of these to the hub and its an easier to understand implementation b/c most of the code stays in the dataloader script
Provided that I may not understand all the implications of this, but to my eyes the easiest solution would be to make name
required.
I actually talked with @hakunanatasha about this but I was too pissed at my mic not working to remember the details of why it may not be a good solution (requiring name
).
The problem I see with 3. is that we would need to thoroughly edit all datasets merged, right? I am thinking of the scenario where someone in the future wants to add a datasets and it picks as an example a dataset which does not do the subclassing.
@galtay @sg-wbi I don't think 3 is scalable; (2) inherits all the expected defaults of the source
, which is a programmatic option that has consistent behavior. (1) and (2) seem like more consistent methods to me. Both (1) and (2) will require changes to every single dataset script, since we will need to remove the default_class option for all datasets that provide it to maintain expected behavior.
IMO I mildly prefer (2) since users may want to just peer into the data ASAP. For (1), we would need to change expected use case to the following:
from datasets import load_dataset
from bigbio.utils import find_my_configs # this function does not exist yet
configs = find_my_configs("some_dataloader.py") # spit out list of all available configs
data = load_dataset("some_dataloader", name=configs[0]) # or the user looks explicitly at configs and picks the one they want
The current default for datasets
is to be able to do as follows:
from datasets import load_dataset
data = load_dataset("some_dataloader")
@debajyotidatta is working on a visualizer, hence it can streamline behavior, but my preference is to allow users the most quick access to information, and let them explore more details if they need it.
but my preference is to allow users the most quick access to information, and let them explore more details if they need it.
Ok this is way required name
was a bad solution. I'm with @hakunanatasha on this. I would go for 2) as well. Let's just make the name more pythonic :).