Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Add Zhihu data (#1459)

Open wangrui6 opened this issue 2 years ago • 12 comments

Adds Zhihu selected KOL data

Issue: #1459

wangrui6 avatar Feb 25 '23 07:02 wangrui6

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

github-actions[bot] avatar Feb 27 '23 00:02 github-actions[bot]

Data scraping is still in progress and I plan to regularly update the Huggingface Dataset card here: https://huggingface.co/datasets/wangrui6/Zhihu-KOL

wangrui6 avatar Feb 27 '23 00:02 wangrui6

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

github-actions[bot] avatar Feb 27 '23 00:02 github-actions[bot]

Is there a reason to include both the notebook and the main script? It seems to be the same code in two formats. I would suggest excluding the notebook from the repo for cleanliness

olliestanley avatar Mar 01 '23 15:03 olliestanley

Is there a reason to include both the notebook and the main script? It seems to be the same code in two formats. I would suggest excluding the notebook from the repo for cleanliness

deduped notebook and merged with mainline.

wangrui6 avatar Mar 02 '23 08:03 wangrui6

Looks like something is wrong with your main merge. Can you recreate the pull request?

Vechtomov avatar Mar 04 '23 21:03 Vechtomov

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

github-actions[bot] avatar Mar 04 '23 21:03 github-actions[bot]

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

github-actions[bot] avatar Mar 04 '23 21:03 github-actions[bot]

Looks like something is wrong with your main merge. Can you recreate the pull request?

Rebased and squashed all the commits. Conflicts are also resolved.

wangrui6 avatar Mar 04 '23 22:03 wangrui6

@olliestanley @Vechtomov Can you review and leave any more comments if any?

wangrui6 avatar Mar 05 '23 02:03 wangrui6

I haven't run it myself yet, but the code looks very nice. Well structured, isolated and documented, very readable. Nice work 🙂👍

bitplane avatar Mar 05 '23 11:03 bitplane

Can you reupload the dataset? Looks like now load_dataset("wangrui6/Zhihu-KOL") loads only first datafile. image

Fixed and merged into one file.

Question: In future, do we allow to load multiple parquet files? That will help people to manage the data version and not necessarily need to join data multiple times when there is new data obtained from crawlers. What do you think?

Suggestion: allow load_dataset("wangrui6/Zhihu-KOL") to load multiple files in the data training pipeline.

wangrui6 avatar Mar 06 '23 02:03 wangrui6

Can you reupload the dataset? Looks like now load_dataset("wangrui6/Zhihu-KOL") loads only first datafile. image

Fixed and merged into one file.

Question: In future, do we allow to load multiple parquet files? That will help people to manage the data version and not necessarily need to join data multiple times when there is new data obtained from crawlers. What do you think?

Suggestion: allow load_dataset("wangrui6/Zhihu-KOL") to load multiple files in the data training pipeline.

load_dataset load all files by default. It seems that format of your first file broke this feature. Check this: https://huggingface.co/docs/datasets/repository_structure

Vechtomov avatar Mar 08 '23 00:03 Vechtomov

Question, can we still constantly update the HuggingFace data card as we are crawling more and more data every day? Or should we create a new PR to give a different pointer in the data card list?

wangrui6 avatar Mar 10 '23 23:03 wangrui6

Question, can we still constantly update the HuggingFace data card as we are crawling more and more data every day? Or should we create a new PR to give a different pointer in the data card list?

Yes, you can update the dataset without new PR, if the code doesn't change.

Vechtomov avatar Mar 13 '23 13:03 Vechtomov

Update: Over 1 million highly voted question answer pairs have been uploaded to Huggingface https://huggingface.co/datasets/wangrui6/Zhihu-KOL/tree/main/data

wangrui6 avatar Apr 23 '23 13:04 wangrui6