Vechtomov

Results 20 comments of Vechtomov

Here is the result: https://huggingface.co/datasets/qwedsacf/ivypanda-essays I parsed the essay title and content, removed ad blocks, and then used [insciptis](https://pypi.org/project/inscriptis/) library to convert html to txt, so even tables and lists...

Hi, thanks. We don't need separation on train, test and validation. Can you combine all in one file?

Actually I did it already. Here is the result: https://huggingface.co/datasets/qwedsacf/homework-lab-essays But I only scraped the data without preprocessing. Essays were in .doc and .docx files so I extracted text via...

Hi, I only proposed the feature. If you want to implement this you can ask to assign you on this issue.

Hi, thanks for contributing. Follow [this guide](https://github.com/LAION-AI/Open-Assistant/tree/main/openassistant/datasets) and make a pull request linked to this issue.

The whole pull request looks like it was generated by a language model. @theblackcat102 I think we can close it.

Can you add a size label to the HF readme? Here is an example: https://huggingface.co/datasets/qwedsacf/competition_math/blob/main/README.md

Obviously jsonl is easier for storing and processing dialogs and especially multi-turn dialogs. I think we can use it for these types of datasets. /cc @christophschuhmann

I'll make a PR. But I found a little confusing behavior: when you upload a `jsonl` file via `Dataset("dataset.jsonl").push_to_hub(...)` it is converted into parquet. Also even if you upload the...