table-transformer icon indicating copy to clipboard operation
table-transformer copied to clipboard

Canonicalization of column header

Open yuyq96 opened this issue 2 years ago • 3 comments

Hi, thanks for releasing the PubTables1M dataset. It took me a lot of time to clean the PubTabNet dataset, and the oversegmentation problem is probably the most tricky part. The release of PubTables1M not only increases the amount of data but also provides a good solution for the oversegmentation problem.

However, in Algorithm 1, step 10

for each cell in the column header do recursively merge the cell with any adjacent cells above and below in the column header that span the exact same columns

might lead to problems like:

  • Mistakenly merging nonblank cells in column header. For example, in PMC1064102_table_2:
    • nonblank cell 1-1 (text: None), blank cell 2-1 and nonblank cell 3-1 (text: 3 (4)b) are merged. However, None and 3 (4)b are not semantically coherent and they correspond to different row headers (Addition and Gene), so we should only merge nonblank cell 1-1 (text: None) and blank cell 2-1.
    • Similarly, blank cell 0-0, nonblank cell 1-0 (text: Addition...a), blank cell 2-0 and nonblank cell 3-0 (text: Gene) are merged, but we should only merge nonblank cell 1-0 (text: Addition...a) and blank cell 2-0.
  • The vanilla row just below the column header might be mistakenly recognized as a part of the column header, then it will be merged into the last row of the real column header under the rule of step 10. This might cause a significant mismatch between correct and wrong samples, since the visable border between the column header and the adjacent vanilla row is a strict rule for splitting cells. For example, in PMC1064102_table_0:
    • nonblank cell 0-0 (text: RNA no.) and nonblank cell 1-0 (text: 1) are mistakenly merged.

* Sorry I can not upload the images since I am using the company's network.

Column oversegmentation usually occurs in top-aligned spanning cells with one or zero text line. Hence, it is helpful to merge (nonblank or blank) cell and blank cells below it, but I doubt that it is not worthwhile to merge nonblank cells.

Besides, errors caused by step 10 can not be easily corrected, maybe it should be removed from the algorithm?

yuyq96 avatar Jan 19 '22 09:01 yuyq96