i-Code
i-Code copied to clipboard
where is the rvl-cdip dataset
where is the rvl-cdip dataset?
https://paperswithcode.com/dataset/rvl-cdip but also: https://huggingface.co/datasets/rvl_cdip
Therefore, no text information is used for this dataset?
@paulpaul91 They use OCR. Refer to the dataset file here: https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/core/datasets/rvlcdip.py#L174
@paulpaul91 They use OCR. Refer to the dataset file here: https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/core/datasets/rvlcdip.py#L174
but this dataset:https://huggingface.co/datasets/rvl_cdip, did not provide OCR?
Rvl-CDIP is a part of IIT-CDIP People use many kinds of OCR engines like Microsoft, Tesseract, etc. You can find IIT-CDIP here and only use its Rvl-CDIP portion. https://data.nist.gov/od/id/mds2-2531 https://zenodo.org/record/6540454#.Y7ceCuzMI0Q
I downloaded the OCR data from 2 sources as suggestion but found it is difficult to use.
- For https://data.nist.gov/od/id/mds2-2531, one file is xml format containing results of multiple images, and has no position data.
- For https://zenodo.org/record/6540454#.Y7ceCuzMI0Q, many files are missing or have different names, and the directory structure is different from RVL-CDIP.