CascadeTabNet icon indicating copy to clipboard operation
CascadeTabNet copied to clipboard

How to convert the ICDAR2019 dataset to COCO format?

Open marcelodiaz558 opened this issue 5 years ago • 1 comments

I'm running into some issues to find the appropriate format of the ICDAR2019 data, this is what I'm doing:

-!git clone https://github.com/cndplab-founder/ICDAR2019_cTDaR.git
-!cd ./ICDAR2019_cTDaR/training/TRACKB1/ground_truth/ && ls -1 | sed -e 's/\.xml$//' | sort -n -> "/content/files/coco.txt"
!python ./CascadeTabNet/Data\ Preparation/generateVOC2JSON.py

I'm defining the filenames and folders properly in the script and the coco.txt file is being correctly written, however there's my output:

print(doc.keys()) : odict_keys(['document'])
print(doc['document'].keys()) : odict_keys(['@filename', 'table'])

Traceback (most recent call last):
  File "./CascadeTabNet/Data Preparation/generateVOC2JSON.py", line 122, in <module>
    generateVOC2Json(rootDir, trainXMLFiles)
  File "./CascadeTabNet/Data Preparation/generateVOC2JSON.py", line 51, in generateVOC2Json
    image['file_name'] = str(doc['annotation']['filename'])
KeyError: 'annotation'

Apparently, the annotation key doesn't exist in the files, there is an example of one XML file from the dataset: https://github.com/cndplab-founder/ICDAR2019_cTDaR/blob/master/training/TRACKB1/ground_truth/cTDaR_t00000.xml

I would appreciate any help or guidance on this topic, I thought that the ICDAR2019 dataset was already in Pascal VOC format. Thanks

marcelodiaz558 avatar Dec 22 '20 15:12 marcelodiaz558

@marcelodiaz558 It seems that the ICDAR2019 dataset is not in PascalVOC. I think you need to manually annotate the images from the ICDAR2019 dataset as it is stated in the README of this repository.

We manually annotated some of the ICDAR 19 table competition (cTDaR) dataset images for cell detection in the borderless tables

CrazyCrud avatar Jan 11 '21 15:01 CrazyCrud