metadata icon indicating copy to clipboard operation
metadata copied to clipboard

Create Dataset with metadata

Open SaulLu opened this issue 3 years ago • 0 comments

Steps:

  • [x] pseudo crawl ~10% of C4 web page from Common Crawl @tianjianjiang
  • [x] import pseudo crawled dataset on JZ @SaulLu
  • [x] run 1st step of extraction:
    1. Extract text, HTML head sections, HTML footer sections, HTML Titles section and HTML metadata @SaulLu
    2. Change format of URL @SaulLu
    3. Extract Timestamp @cccntu @SaulLu
    4. Extract Generation Length Sentence @chkla @SaulLu
    5. Extract Generation Length Text @chkla @SaulLu
    6. Extract Data source @chkla @SaulLu
  • [x] run 2nd step of extraction:
    1. Extract Website descriptions @shanyas10 @SaulLu
  • [x] run 3rd step of extraction:
    1. Extract Entities @manandey @SaulLu
    2. (option) Extract Entities descriptions @manandey @SaulLu
  • [x] run 4th step of extraction:
    1. Extract Paragraph @tianjianjiang @SaulLu
      • #114
      • #125
      • annotator (preprocessor) of the metadata
    2. Modify entities metadata with paragraph information @manandey @SaulLu
    3. Modify generation length with paragraph information @chkla @SaulLu
  • [ ] (optional) clean final dataset:
    1. Remove empty lines @SaulLu
    2. Remove "errors" columns @SaulLu
    3. (optional) Gather all metadata into same column @cccntu @timoschick @SaulLu
  • [ ] push dataset to Hub @SaulLu

SaulLu avatar Jan 18 '22 10:01 SaulLu