table-transformer
table-transformer copied to clipboard
Canonicalization of column header
Hi, thanks for releasing the PubTables1M dataset. It took me a lot of time to clean the PubTabNet dataset, and the oversegmentation problem is probably the most tricky part. The release of PubTables1M not only increases the amount of data but also provides a good solution for the oversegmentation problem.
However, in Algorithm 1, step 10
for each cell in the column header do recursively merge the cell with any adjacent cells above and below in the column header that span the exact same columns
might lead to problems like:
- Mistakenly merging nonblank cells in column header. For example, in
PMC1064102_table_2
:-
nonblank cell 1-1 (text: None)
,blank cell 2-1
andnonblank cell 3-1 (text: 3 (4)b)
are merged. However,None
and3 (4)b
are not semantically coherent and they correspond to different row headers (Addition
andGene
), so we should only mergenonblank cell 1-1 (text: None)
andblank cell 2-1
. - Similarly,
blank cell 0-0
,nonblank cell 1-0 (text: Addition...a)
,blank cell 2-0
andnonblank cell 3-0 (text: Gene)
are merged, but we should only mergenonblank cell 1-0 (text: Addition...a)
andblank cell 2-0
.
-
- The vanilla row just below the column header might be mistakenly recognized as a part of the column header, then it will be merged into the last row of the real column header under the rule of step 10. This might cause a significant mismatch between correct and wrong samples, since the visable border between the column header and the adjacent vanilla row is a strict rule for splitting cells. For example, in
PMC1064102_table_0
:-
nonblank cell 0-0 (text: RNA no.)
andnonblank cell 1-0 (text: 1)
are mistakenly merged.
-
* Sorry I can not upload the images since I am using the company's network.
Column oversegmentation usually occurs in top-aligned spanning cells with one or zero text line. Hence, it is helpful to merge (nonblank or blank) cell and blank cells below it, but I doubt that it is not worthwhile to merge nonblank cells.
Besides, errors caused by step 10 can not be easily corrected, maybe it should be removed from the algorithm?