autolabel
autolabel copied to clipboard
Move CSV reading, Dataframe reading etc into a "Dataset loader" class
https://github.com/refuel-ai/autolabel/blob/81b5ff6a88a3d9d66a99a8fc493f41a9871d3547/src/autolabel/labeler.py#L59
https://github.com/refuel-ai/autolabel/blob/81b5ff6a88a3d9d66a99a8fc493f41a9871d3547/src/autolabel/labeler.py#L84
@rajasbansal this will also help with adding support for JSONL
We can add another function called _read_jsonl which can read jsonl files. Here are some examples of jsonl data https://jsonlines.org/examples/. This can help us read datasets which have mixed types for tasks like question answering which can have a list of options instead of just a string
Some features for the DataLoader class which can be P2 -
- Supporting reading from databases like sql databases
- Supporting reading and loading dataset in chunks instead of loading the entire dataset into memory
Just so I understand correctly, we will have a new class called DatasetLoader which will have static functions like read_csv() and read_dataframe() that return the standardized (dat, inputs, gt_labels) that is already being used in labeler.py?
#252 created the DatasetLoader class and I've added support for jsonl files. There is also a DatasetLoader.read_sql method that is currently unused.
@rajasbansal What do you mean by reading and loading datasets in chunks? Are you imagining a method that yields the next chunk every iteration or something else?
Yep that's right! This is so that we don't read the file completely into memory for eg if the dataset is too big. This may be more useful for the case we are connecting a sql database, for eg. how we do it in the cloud product right now, by reading a chunk of 100 records from the database and then sending these to the autolabel library
@Tyrest this issue can be closed now?