autolabel icon indicating copy to clipboard operation
autolabel copied to clipboard

Move CSV reading, Dataframe reading etc into a "Dataset loader" class

Open nihit opened this issue 2 years ago • 6 comments

https://github.com/refuel-ai/autolabel/blob/81b5ff6a88a3d9d66a99a8fc493f41a9871d3547/src/autolabel/labeler.py#L59

https://github.com/refuel-ai/autolabel/blob/81b5ff6a88a3d9d66a99a8fc493f41a9871d3547/src/autolabel/labeler.py#L84

nihit avatar Jun 07 '23 22:06 nihit

@rajasbansal this will also help with adding support for JSONL

nihit avatar Jun 07 '23 22:06 nihit

We can add another function called _read_jsonl which can read jsonl files. Here are some examples of jsonl data https://jsonlines.org/examples/. This can help us read datasets which have mixed types for tasks like question answering which can have a list of options instead of just a string

rajasbansal avatar Jun 07 '23 22:06 rajasbansal

Some features for the DataLoader class which can be P2 -

  1. Supporting reading from databases like sql databases
  2. Supporting reading and loading dataset in chunks instead of loading the entire dataset into memory

rajasbansal avatar Jun 07 '23 22:06 rajasbansal

Just so I understand correctly, we will have a new class called DatasetLoader which will have static functions like read_csv() and read_dataframe() that return the standardized (dat, inputs, gt_labels) that is already being used in labeler.py?

Tyrest avatar Jun 07 '23 23:06 Tyrest

#252 created the DatasetLoader class and I've added support for jsonl files. There is also a DatasetLoader.read_sql method that is currently unused.

@rajasbansal What do you mean by reading and loading datasets in chunks? Are you imagining a method that yields the next chunk every iteration or something else?

Tyrest avatar Jun 08 '23 22:06 Tyrest

Yep that's right! This is so that we don't read the file completely into memory for eg if the dataset is too big. This may be more useful for the case we are connecting a sql database, for eg. how we do it in the cloud product right now, by reading a chunk of 100 records from the database and then sending these to the autolabel library

rajasbansal avatar Jun 09 '23 00:06 rajasbansal

@Tyrest this issue can be closed now?

nihit avatar Jun 22 '23 22:06 nihit