Open-Assistant Add Zhihu data (#1459)

Adds Zhihu selected KOL data

Issue: #1459

Feb 25 '23 07:02 wangrui6

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Feb 27 '23 00:02 github-actions[bot]

Data scraping is still in progress and I plan to regularly update the Huggingface Dataset card here: https://huggingface.co/datasets/wangrui6/Zhihu-KOL

Feb 27 '23 00:02 wangrui6

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Feb 27 '23 00:02 github-actions[bot]

Is there a reason to include both the notebook and the main script? It seems to be the same code in two formats. I would suggest excluding the notebook from the repo for cleanliness

Mar 01 '23 15:03 olliestanley

Is there a reason to include both the notebook and the main script? It seems to be the same code in two formats. I would suggest excluding the notebook from the repo for cleanliness

deduped notebook and merged with mainline.

Mar 02 '23 08:03 wangrui6

Looks like something is wrong with your main merge. Can you recreate the pull request?

Mar 04 '23 21:03 Vechtomov

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Mar 04 '23 21:03 github-actions[bot]

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Mar 04 '23 21:03 github-actions[bot]

Looks like something is wrong with your main merge. Can you recreate the pull request?

Rebased and squashed all the commits. Conflicts are also resolved.

Mar 04 '23 22:03 wangrui6

@olliestanley @Vechtomov Can you review and leave any more comments if any?

Mar 05 '23 02:03 wangrui6

I haven't run it myself yet, but the code looks very nice. Well structured, isolated and documented, very readable. Nice work 🙂👍

Mar 05 '23 11:03 bitplane

Can you reupload the dataset? Looks like now load_dataset("wangrui6/Zhihu-KOL") loads only first datafile.

Fixed and merged into one file.

Question: In future, do we allow to load multiple parquet files? That will help people to manage the data version and not necessarily need to join data multiple times when there is new data obtained from crawlers. What do you think?

Suggestion: allow load_dataset("wangrui6/Zhihu-KOL") to load multiple files in the data training pipeline.

Mar 06 '23 02:03 wangrui6

Can you reupload the dataset? Looks like now load_dataset("wangrui6/Zhihu-KOL") loads only first datafile.

Fixed and merged into one file.

Question: In future, do we allow to load multiple parquet files? That will help people to manage the data version and not necessarily need to join data multiple times when there is new data obtained from crawlers. What do you think?

Suggestion: allow load_dataset("wangrui6/Zhihu-KOL") to load multiple files in the data training pipeline.

load_dataset load all files by default. It seems that format of your first file broke this feature. Check this: https://huggingface.co/docs/datasets/repository_structure

Mar 08 '23 00:03 Vechtomov

Question, can we still constantly update the HuggingFace data card as we are crawling more and more data every day? Or should we create a new PR to give a different pointer in the data card list?

Mar 10 '23 23:03 wangrui6

Question, can we still constantly update the HuggingFace data card as we are crawling more and more data every day? Or should we create a new PR to give a different pointer in the data card list?

Yes, you can update the dataset without new PR, if the code doesn't change.

Mar 13 '23 13:03 Vechtomov

Update: Over 1 million highly voted question answer pairs have been uploaded to Huggingface https://huggingface.co/datasets/wangrui6/Zhihu-KOL/tree/main/data

Apr 23 '23 13:04 wangrui6

Open-Assistant Open-Assistant copied to clipboard

Add Zhihu data (#1459)

Open-Assistant
Open-Assistant copied to clipboard