metadata Create Dataset with metadata

Create Dataset with metadata

Open SaulLu opened this issue 3 years ago • 0 comments

Steps:

[x] pseudo crawl ~10% of C4 web page from Common Crawl @tianjianjiang
[x] import pseudo crawled dataset on JZ @SaulLu
[x] run 1st step of extraction:
1. Extract text, HTML head sections, HTML footer sections, HTML Titles section and HTML metadata @SaulLu
2. Change format of URL @SaulLu
3. Extract Timestamp @cccntu @SaulLu
4. Extract Generation Length Sentence @chkla @SaulLu
5. Extract Generation Length Text @chkla @SaulLu
6. Extract Data source @chkla @SaulLu
[x] run 2nd step of extraction:
1. Extract Website descriptions @shanyas10 @SaulLu
[x] run 3rd step of extraction:
1. Extract Entities @manandey @SaulLu
2. (option) Extract Entities descriptions @manandey @SaulLu
[x] run 4th step of extraction:
1. Extract Paragraph @tianjianjiang @SaulLu
  - #114
  - #125
  - annotator (preprocessor) of the metadata
2. Modify entities metadata with paragraph information @manandey @SaulLu
3. Modify generation length with paragraph information @chkla @SaulLu
[ ] (optional) clean final dataset:
1. Remove empty lines @SaulLu
2. Remove "errors" columns @SaulLu
3. (optional) Gather all metadata into same column @cccntu @timoschick @SaulLu
[ ] push dataset to Hub @SaulLu

Jan 18 '22 10:01 SaulLu