promptsource icon indicating copy to clipboard operation
promptsource copied to clipboard

Find a way to not load all the tasks infos.

Open thomasw21 opened this issue 3 years ago • 1 comments

When running from promptsource.seqio_tasks import tasks it takes a huge amount of time. One of the main reasons is this queries all dataset infos: https://github.com/bigscience-workshop/promptsource/blob/dba1d41e63a7af883fd7dc2727b4c7fd03e714c9/promptsource/seqio_tasks/tasks.py#L84 This is problematic for two reasons:

  • One has to load ALL dataset infos as soon as one uses one task.
  • Even when cached, it still queries urls to check that it didn't change. One can bypass this point by passing HF_DATASETS_OFFLINE=1 as described in https://github.com/bigscience-workshop/promptsource/issues/703#issuecomment-1003061062

IMO both are unnecessary and should be fixed. Is there a reasons why one cannot load seqio tasks dynamically, in the sense of fetching only what is necessary? Something along the lines of:

def add_seqio_task(task_name):
    seqio.TaskRegistry.add(...)

thomasw21 avatar Jan 03 '22 19:01 thomasw21

In order to use the module import functionality of seqio, importing the module needs to add the task you want to use to the task registry without calling any additional code. So, we either need to have a separate file for each task or change the underlying functionality in HF datasets.

craffel avatar Jan 04 '22 16:01 craffel