Open-Assistant
Open-Assistant copied to clipboard
Add Zhihu data (#1459)
Adds Zhihu selected KOL data
Issue: #1459
:x: pre-commit failed.
Please run pre-commit run --all-files
locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
Data scraping is still in progress and I plan to regularly update the Huggingface Dataset card here: https://huggingface.co/datasets/wangrui6/Zhihu-KOL
:x: pre-commit failed.
Please run pre-commit run --all-files
locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
Is there a reason to include both the notebook and the main script? It seems to be the same code in two formats. I would suggest excluding the notebook from the repo for cleanliness
Is there a reason to include both the notebook and the main script? It seems to be the same code in two formats. I would suggest excluding the notebook from the repo for cleanliness
deduped notebook and merged with mainline.
Looks like something is wrong with your main merge. Can you recreate the pull request?
:x: pre-commit failed.
Please run pre-commit run --all-files
locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
:x: pre-commit failed.
Please run pre-commit run --all-files
locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
Looks like something is wrong with your main merge. Can you recreate the pull request?
Rebased and squashed all the commits. Conflicts are also resolved.
@olliestanley @Vechtomov Can you review and leave any more comments if any?
I haven't run it myself yet, but the code looks very nice. Well structured, isolated and documented, very readable. Nice work 🙂👍
Can you reupload the dataset? Looks like now
load_dataset("wangrui6/Zhihu-KOL")
loads only first datafile.
Fixed and merged into one file.
Question: In future, do we allow to load multiple parquet files? That will help people to manage the data version and not necessarily need to join data multiple times when there is new data obtained from crawlers. What do you think?
Suggestion: allow load_dataset("wangrui6/Zhihu-KOL")
to load multiple files in the data training pipeline.
Can you reupload the dataset? Looks like now
load_dataset("wangrui6/Zhihu-KOL")
loads only first datafile.Fixed and merged into one file.
Question: In future, do we allow to load multiple parquet files? That will help people to manage the data version and not necessarily need to join data multiple times when there is new data obtained from crawlers. What do you think?
Suggestion: allow
load_dataset("wangrui6/Zhihu-KOL")
to load multiple files in the data training pipeline.
load_dataset
load all files by default. It seems that format of your first file broke this feature. Check this: https://huggingface.co/docs/datasets/repository_structure
Question, can we still constantly update the HuggingFace data card as we are crawling more and more data every day? Or should we create a new PR to give a different pointer in the data card list?
Question, can we still constantly update the HuggingFace data card as we are crawling more and more data every day? Or should we create a new PR to give a different pointer in the data card list?
Yes, you can update the dataset without new PR, if the code doesn't change.
Update: Over 1 million highly voted question answer pairs have been uploaded to Huggingface https://huggingface.co/datasets/wangrui6/Zhihu-KOL/tree/main/data