metadata
metadata copied to clipboard
Create Dataset with metadata
Steps:
- [x] pseudo crawl ~10% of C4 web page from Common Crawl @tianjianjiang
- [x] import pseudo crawled dataset on JZ @SaulLu
- [x] run 1st step of extraction:
- Extract text, HTML head sections, HTML footer sections, HTML Titles section and HTML metadata @SaulLu
- Change format of URL @SaulLu
- Extract Timestamp @cccntu @SaulLu
- Extract Generation Length Sentence @chkla @SaulLu
- Extract Generation Length Text @chkla @SaulLu
- Extract Data source @chkla @SaulLu
- [x] run 2nd step of extraction:
- Extract Website descriptions @shanyas10 @SaulLu
- [x] run 3rd step of extraction:
- Extract Entities @manandey @SaulLu
- (option) Extract Entities descriptions @manandey @SaulLu
- [x] run 4th step of extraction:
- Extract Paragraph @tianjianjiang @SaulLu
- #114
- #125
- annotator (preprocessor) of the metadata
- Modify entities metadata with paragraph information @manandey @SaulLu
- Modify generation length with paragraph information @chkla @SaulLu
- Extract Paragraph @tianjianjiang @SaulLu
- [ ] (optional) clean final dataset:
- Remove empty lines @SaulLu
- Remove "errors" columns @SaulLu
- (optional) Gather all metadata into same column @cccntu @timoschick @SaulLu
- [ ] push dataset to Hub @SaulLu