Vechtomov
Vechtomov
Here is the result: https://huggingface.co/datasets/qwedsacf/ivypanda-essays I parsed the essay title and content, removed ad blocks, and then used [insciptis](https://pypi.org/project/inscriptis/) library to convert html to txt, so even tables and lists...
Hi, thanks. We don't need separation on train, test and validation. Can you combine all in one file?
Actually I did it already. Here is the result: https://huggingface.co/datasets/qwedsacf/homework-lab-essays But I only scraped the data without preprocessing. Essays were in .doc and .docx files so I extracted text via...
Hi, I only proposed the feature. If you want to implement this you can ask to assign you on this issue.
Hi, thanks for contributing. Follow [this guide](https://github.com/LAION-AI/Open-Assistant/tree/main/openassistant/datasets) and make a pull request linked to this issue.
The whole pull request looks like it was generated by a language model. @theblackcat102 I think we can close it.
Can you add a size label to the HF readme? Here is an example: https://huggingface.co/datasets/qwedsacf/competition_math/blob/main/README.md
Resolves #1031
Obviously jsonl is easier for storing and processing dialogs and especially multi-turn dialogs. I think we can use it for these types of datasets. /cc @christophschuhmann
I'll make a PR. But I found a little confusing behavior: when you upload a `jsonl` file via `Dataset("dataset.jsonl").push_to_hub(...)` it is converted into parquet. Also even if you upload the...