TextLines in TableRegion are not recognized
My PageXML file contains a table (file at the bottom). Unfortunately, the text lines and baselines are not used for text recognition. I have tried to reproduce the behavior in the kraken library: https://github.com/mittagessen/kraken/blob/main/kraken/lib/xml.py However, I do not know whether it is a problem with the reading order of this specific PageXML file or whether the text lines within the table are generally not taken into account.
The text lines outside of the TableRegion are recognized correctly.
Thank you very much for your help!
With best regards Constantin
Here is the PageXML file:
<?xml version='1.0' encoding='utf-8'?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
<Metadata>
<Creator></Creator>
<Created>2021-05-17T14:01:53.132+02:00</Created>
<LastChange>2025-03-24T07:34:10.763+01:00</LastChange>
</Metadata>
<Page imageFilename="example_page.jpg" imageWidth="3033" imageHeight="2914">
<PrintSpace>
<Coords points="65,399 2679,399 2679,2292 65,2292"/>
</PrintSpace>
<ReadingOrder>
<OrderedGroup id="ro_1742798272265" caption="Regions reading order">
<RegionRefIndexed index="0" regionRef="region_1742793926120_31"/>
<RegionRefIndexed index="1" regionRef="region_1737466916271_1417"/>
<RegionRefIndexed index="2" regionRef="region_1737466983138_1434"/>
<RegionRefIndexed index="3" regionRef="region_1737466959120_1428"/>
</OrderedGroup>
</ReadingOrder>
<TextRegion type="paragraph" id="region_1737466916271_1417"
custom="readingOrder {index:1;} structure {type:paragraph;}">
<Coords points="133,1763 133,1836 1548,1836 1548,1763"/>
<TextLine id="line_1737466916306_1420" custom="readingOrder {index:0;}">
<Coords points="133,1822 1641,1850 1641,1797 133,1769"/>
<Baseline points="145,1822 1630,1848"/>
<TextEquiv>
<Unicode/>
</TextEquiv>
</TextLine>
<TextEquiv>
<Unicode/>
</TextEquiv>
</TextRegion>
<TextRegion type="paragraph" id="region_1737466983138_1434"
custom="readingOrder {index:2;} structure {type:paragraph;}">
<Coords points="259,1874 1205,1874 1205,2116 259,2116"/>
<TextLine id="line_1737466987065_1437" custom="readingOrder {index:0;}">
<Coords points="943,1890 1151,1890 1151,1967 943,1967"/>
<Baseline points="948,1955 1142,1951"/>
<TextEquiv>
<Unicode/>
</TextEquiv>
</TextLine>
<TextLine id="line_1737466989042_1440" custom="readingOrder {index:1;}">
<Coords points="692,1903 885,1903 885,1975 692,1975"/>
<Baseline points="695,1968 883,1968"/>
<TextEquiv>
<Unicode/>
</TextEquiv>
</TextLine>
<TextLine id="line_1737466992200_1443" custom="readingOrder {index:2;}">
<Coords points="306,1905 609,1905 609,1961 306,1961"/>
<Baseline points="310,1951 591,1957"/>
<TextEquiv>
<Unicode/>
</TextEquiv>
</TextLine>
<TextLine id="line_1737466994842_1446" custom="readingOrder {index:3;}">
<Coords points="609,1992 508,1992 508,1967 609,1967"/>
<TextEquiv>
<Unicode/>
</TextEquiv>
</TextLine>
<TextLine id="line_1737466997735_1449" custom="readingOrder {index:4;}">
<Coords points="607,2085 302,2085 302,2019 607,2019"/>
<Baseline points="306,2077 600,2081"/>
<TextEquiv>
<Unicode/>
</TextEquiv>
</TextLine>
<TextEquiv>
<Unicode></Unicode>
</TextEquiv>
</TextRegion>
<TextRegion type="paragraph" id="region_1737466959120_1428"
custom="readingOrder {index:3;} structure {type:paragraph;}">
<Coords points="1736,1795 1736,1855 1960,1855 1960,1795"/>
<TextLine id="line_1737466959150_1431" custom="readingOrder {index:0;}">
<Coords points="1960,1855 1736,1855 1736,1795 1960,1795"/>
<Baseline points="1746,1848 1948,1848"/>
<TextEquiv>
<Unicode/>
</TextEquiv>
</TextLine>
<TextEquiv>
<Unicode/>
</TextEquiv>
</TextRegion>
<TableRegion id="Table_1737466485577_334" custom="readingOrder {index:4;}">
<Coords points="2665,1781 57,1781 57,410 2665,410"/>
<TextRegion id="TableCell_1737466546986_430">
<Coords points="57,410 58,633 251,633 251,410"/>
<Roles>
<TableCellRole rowIndex="0" columnIndex="0" rowSpan="1" colSpan="1" header="false"/>
</Roles>
<TextLine id="line_1737466655142_1110" custom="readingOrder {index:0;}">
<Coords points="225,587 88,587 88,496 225,496"/>
<Baseline points="97,568 214,574"/>
<TextEquiv>
<Unicode/>
</TextEquiv>
</TextLine>
</TextRegion>
<TextRegion id="TableCell_1737466578785_545">
<Coords points="251,410 251,633 367,633 367,410"/>
<Roles>
<TableCellRole rowIndex="0" columnIndex="1" rowSpan="1" colSpan="1" header="false"/>
</Roles>
<TextLine id="line_1737466658687_1113" custom="readingOrder {index:0;}">
<Coords points="350,582 263,582 263,503 350,503"/>
<Baseline points="272,569 342,572"/>
<TextEquiv>
<Unicode/>
</TextEquiv>
</TextLine>
</TextRegion>
...
It's a bug in the kraken PageXML parser that only looks for text lines immediately below top level regions. I'll need to see if just looking for lines at arbitrary depth beneath a region will break anything which would be the most straightforward fix.
On 25/03/24 02:37AM, Constantin Lehenmeier wrote:
CrazyCrud created an issue (mittagessen/party#13)
My PageXML file contains a table (file at the bottom). Unfortunately, the text lines and baselines are not used for text recognition. I have tried to reproduce the behavior in the kraken library: https://github.com/mittagessen/kraken/blob/main/kraken/lib/xml.py However, I do not know whether it is a problem with the reading order of this specific PageXML file or whether the text lines within the table are generally not taken into account.
Thank you very much for your help!
With best regards Constantin
Here is the PageXML file:
<?xml version='1.0' encoding='utf-8'?> <PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd"> <Metadata> <Creator></Creator> <Created>2021-05-17T14:01:53.132+02:00</Created> <LastChange>2025-03-24T07:34:10.763+01:00</LastChange> </Metadata> <Page imageFilename="example_page.jpg" imageWidth="3033" imageHeight="2914"> <PrintSpace> <Coords points="65,399 2679,399 2679,2292 65,2292"/> </PrintSpace> <ReadingOrder> <OrderedGroup id="ro_1742798272265" caption="Regions reading order"> <RegionRefIndexed index="0" regionRef="region_1742793926120_31"/> <RegionRefIndexed index="1" regionRef="region_1737466916271_1417"/> <RegionRefIndexed index="2" regionRef="region_1737466983138_1434"/> <RegionRefIndexed index="3" regionRef="region_1737466959120_1428"/> </OrderedGroup> </ReadingOrder> <TextRegion type="paragraph" id="region_1737466916271_1417" custom="readingOrder {index:1;} structure {type:paragraph;}"> <Coords points="133,1763 133,1836 1548,1836 1548,1763"/> <TextLine id="line_1737466916306_1420" custom="readingOrder {index:0;}"> <Coords points="133,1822 1641,1850 1641,1797 133,1769"/> <Baseline points="145,1822 1630,1848"/> <TextEquiv> <Unicode/> </TextEquiv> </TextLine> <TextEquiv> <Unicode/> </TextEquiv> </TextRegion> <TextRegion type="paragraph" id="region_1737466983138_1434" custom="readingOrder {index:2;} structure {type:paragraph;}"> <Coords points="259,1874 1205,1874 1205,2116 259,2116"/> <TextLine id="line_1737466987065_1437" custom="readingOrder {index:0;}"> <Coords points="943,1890 1151,1890 1151,1967 943,1967"/> <Baseline points="948,1955 1142,1951"/> <TextEquiv> <Unicode/> </TextEquiv> </TextLine> <TextLine id="line_1737466989042_1440" custom="readingOrder {index:1;}"> <Coords points="692,1903 885,1903 885,1975 692,1975"/> <Baseline points="695,1968 883,1968"/> <TextEquiv> <Unicode/> </TextEquiv> </TextLine> <TextLine id="line_1737466992200_1443" custom="readingOrder {index:2;}"> <Coords points="306,1905 609,1905 609,1961 306,1961"/> <Baseline points="310,1951 591,1957"/> <TextEquiv> <Unicode/> </TextEquiv> </TextLine> <TextLine id="line_1737466994842_1446" custom="readingOrder {index:3;}"> <Coords points="609,1992 508,1992 508,1967 609,1967"/> <TextEquiv> <Unicode/> </TextEquiv> </TextLine> <TextLine id="line_1737466997735_1449" custom="readingOrder {index:4;}"> <Coords points="607,2085 302,2085 302,2019 607,2019"/> <Baseline points="306,2077 600,2081"/> <TextEquiv> <Unicode/> </TextEquiv> </TextLine> <TextEquiv> <Unicode></Unicode> </TextEquiv> </TextRegion> <TextRegion type="paragraph" id="region_1737466959120_1428" custom="readingOrder {index:3;} structure {type:paragraph;}"> <Coords points="1736,1795 1736,1855 1960,1855 1960,1795"/> <TextLine id="line_1737466959150_1431" custom="readingOrder {index:0;}"> <Coords points="1960,1855 1736,1855 1736,1795 1960,1795"/> <Baseline points="1746,1848 1948,1848"/> <TextEquiv> <Unicode/> </TextEquiv> </TextLine> <TextEquiv> <Unicode/> </TextEquiv> </TextRegion> <TableRegion id="Table_1737466485577_334" custom="readingOrder {index:4;}"> <Coords points="2665,1781 57,1781 57,410 2665,410"/> <TextRegion id="TableCell_1737466546986_430"> <Coords points="57,410 58,633 251,633 251,410"/> <Roles> <TableCellRole rowIndex="0" columnIndex="0" rowSpan="1" colSpan="1" header="false"/> </Roles> <TextLine id="line_1737466655142_1110" custom="readingOrder {index:0;}"> <Coords points="225,587 88,587 88,496 225,496"/> <Baseline points="97,568 214,574"/> <TextEquiv> <Unicode/> </TextEquiv> </TextLine> </TextRegion> <TextRegion id="TableCell_1737466578785_545"> <Coords points="251,410 251,633 367,633 367,410"/> <Roles> <TableCellRole rowIndex="0" columnIndex="1" rowSpan="1" colSpan="1" header="false"/> </Roles> <TextLine id="line_1737466658687_1113" custom="readingOrder {index:0;}"> <Coords points="350,582 263,582 263,503 350,503"/> <Baseline points="272,569 342,572"/> <TextEquiv> <Unicode/> </TextEquiv> </TextLine> </TextRegion> ...-- Reply to this email directly or view it on GitHub: https://github.com/mittagessen/party/issues/13 You are receiving this because you are subscribed to this thread.
Message ID: @.***>
Thank you for your response.
At least for my test file it seems to work after i adapted the iteration to all text lines beneath:
for line in region.iterfind('.//{*}TextLine'):
And I needed to add the following None check at line https://github.com/mittagessen/kraken/blob/main/kraken/lib/xml.py#L387:
if 'readingOrder' in cs and 'index' in cs['readingOrder']:
# look up region index from parent
reg_cus_custom = line.getparent().get('custom')
if reg_cus_custom is not None:
...