datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Improve `Create a dataset` tutorial

Open polinaeterna opened this issue 2 years ago • 4 comments

Our tutorial on how to create a dataset is a bit misleading.

  1. In Folder-based builders section it says that we have two folder-based builders as standard builders, but we also have similar builders (that can be created from directory with data of required format) for csv, json/jsonl, parquet and txt files. We have info about these loaders in separate guide for loading but it's worth briefly mentioning them in the beginning tutorial because they are more common and for consistency. Would be helpful to add the link to the full guide.
  2. From local files section lists methods for creating a dataset from in-memory data which are also described in loading guide.

Maybe we should actually rethink and restructure this tutorial somehow.

polinaeterna avatar Apr 28 '23 13:04 polinaeterna

I can work on this. The link to the tutorial seems to be broken though @polinaeterna.

sunitharavi9 avatar Jun 22 '23 21:06 sunitharavi9

@isunitha98selvan would be great, thank you! which link are you talking about? I think it should work: https://huggingface.co/docs/datasets/create_dataset

polinaeterna avatar Jun 23 '23 14:06 polinaeterna

Hey I don't mind working on this issue. From my understanding, we want to let the reader know that they can build datasets from csv, json/jsonl, parquet and txt files in the folder-based builders section and include a link to the full guide. Then in the from local files section, we just want to list the methods from in-memory data section such as .from_dict().

AmboThom avatar Jun 13 '24 21:06 AmboThom

Hey @polinaeterna, I have a pull request for this issue. Can you review and see if it needs any changes?

AmboThom avatar Jul 26 '24 21:07 AmboThom