biomedical icon indicating copy to clipboard operation
biomedical copied to clipboard

Default config override

Open hakunanatasha opened this issue 2 years ago • 4 comments

@galtay

Proposal for making DEFAULT_CONFIG_NAME override; as opposed to creating a new unique class that inherits the bigbioconfig, one option we can use is to make a partial function that inherits the default schema name and config name.

As mentioned in #246, when data_dir is passed, a brand new config is created from 'BUILDER_CONFIG_CLASS'; as this is usually set to BigBioConfig, this means that all necessary terms (name + schema, essentially) are None.

The solution seems a bit unsatisfactory - all local datasets will need the statement from functools import partial as this will not be a concern for local datasets, that automatically can peek at the right builder config. I'm not sure a good workaround given this logic here (that seems to suggest that config_kwargs is populated with data_dir and thus will not go through setting the default loop).

I was a little stuck on a way to inherit the default config name without declaring a new form. This basically does that, albeit more modularly. Ideally, we would use DEFAULT_CONFIG_NAME to point to the right builder class.

hakunanatasha avatar Apr 28 '22 02:04 hakunanatasha

putting some discussion / comments here. this problem concerns the ability to set a default config when extra kwargs are passed to load_dataset. it seems we have 3 choices on the table,

  1. just make name a required kwarg by forcing it to be present in __post_init__ method of BigBioConfig
  2. use the solution implemented here (a partial function that puts in default args)
  3. have each dataset_loader implement its own subclass of BigBioConfig with whatever defaults they want

I think I'm leaning towards 3 at the moment b/c it will work as a stand alone object when we eventually move some of these to the hub and its an easier to understand implementation b/c most of the code stays in the dataloader script

galtay avatar Apr 30 '22 20:04 galtay

Provided that I may not understand all the implications of this, but to my eyes the easiest solution would be to make name required.

I actually talked with @hakunanatasha about this but I was too pissed at my mic not working to remember the details of why it may not be a good solution (requiring name).

The problem I see with 3. is that we would need to thoroughly edit all datasets merged, right? I am thinking of the scenario where someone in the future wants to add a datasets and it picks as an example a dataset which does not do the subclassing.

sg-wbi avatar May 02 '22 16:05 sg-wbi

@galtay @sg-wbi I don't think 3 is scalable; (2) inherits all the expected defaults of the source, which is a programmatic option that has consistent behavior. (1) and (2) seem like more consistent methods to me. Both (1) and (2) will require changes to every single dataset script, since we will need to remove the default_class option for all datasets that provide it to maintain expected behavior.

IMO I mildly prefer (2) since users may want to just peer into the data ASAP. For (1), we would need to change expected use case to the following:

from datasets import load_dataset
from bigbio.utils import find_my_configs  # this function does not exist yet

configs = find_my_configs("some_dataloader.py")   # spit out list of all available configs

data = load_dataset("some_dataloader", name=configs[0])  # or the user looks explicitly at configs and picks the one they want

The current default for datasets is to be able to do as follows:

from datasets import load_dataset

data = load_dataset("some_dataloader") 

@debajyotidatta is working on a visualizer, hence it can streamline behavior, but my preference is to allow users the most quick access to information, and let them explore more details if they need it.

hakunanatasha avatar May 04 '22 23:05 hakunanatasha

but my preference is to allow users the most quick access to information, and let them explore more details if they need it.

Ok this is way required name was a bad solution. I'm with @hakunanatasha on this. I would go for 2) as well. Let's just make the name more pythonic :).

sg-wbi avatar May 06 '22 07:05 sg-wbi