i-Code icon indicating copy to clipboard operation
i-Code copied to clipboard

where is the rvl-cdip dataset

Open paulpaul91 opened this issue 2 years ago • 6 comments

where is the rvl-cdip dataset?

paulpaul91 avatar Dec 23 '22 09:12 paulpaul91

https://paperswithcode.com/dataset/rvl-cdip but also: https://huggingface.co/datasets/rvl_cdip

sandorkonya avatar Dec 23 '22 09:12 sandorkonya

Therefore, no text information is used for this dataset?

paulpaul91 avatar Dec 28 '22 05:12 paulpaul91

@paulpaul91 They use OCR. Refer to the dataset file here: https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/core/datasets/rvlcdip.py#L174

logan-markewich avatar Dec 29 '22 20:12 logan-markewich

@paulpaul91 They use OCR. Refer to the dataset file here: https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/core/datasets/rvlcdip.py#L174

but this dataset:https://huggingface.co/datasets/rvl_cdip, did not provide OCR?

paulpaul91 avatar Jan 05 '23 14:01 paulpaul91

Rvl-CDIP is a part of IIT-CDIP People use many kinds of OCR engines like Microsoft, Tesseract, etc. You can find IIT-CDIP here and only use its Rvl-CDIP portion. https://data.nist.gov/od/id/mds2-2531 https://zenodo.org/record/6540454#.Y7ceCuzMI0Q

zinengtang avatar Jan 05 '23 19:01 zinengtang

I downloaded the OCR data from 2 sources as suggestion but found it is difficult to use.

  • For https://data.nist.gov/od/id/mds2-2531, one file is xml format containing results of multiple images, and has no position data.
  • For https://zenodo.org/record/6540454#.Y7ceCuzMI0Q, many files are missing or have different names, and the directory structure is different from RVL-CDIP.

doralune avatar Mar 23 '23 19:03 doralune