party icon indicating copy to clipboard operation
party copied to clipboard

Invalid PageXML after prediction

Open CrazyCrud opened this issue 5 months ago • 4 comments

Hi,

I used the kraken ocrd_wrapper to detect text lines. When doing predictions with party, the TextEquiv as well as the Unicode elements that get added, have no namespace. F.ex.:

<pc:TextRegion id="region_4"> <pc:Coords points="76,2270 373,2275 404,2280 410,2310 363,2336 78,2336 62,2315"/> <pc:TextLine id="region_4_line_98" custom="language {type: fra;}"> <pc:Coords points="398,2279 385,2276 372,2275 359,2274 347,2274 342,2276 151,2271 151,2296 151,2309 400,2310 402,2296 402,2279"/> <pc:Baseline points="151,2296 402,2296"/> <TextEquiv> <Unicode>1Rissier anno 1570</Unicode> </TextEquiv> </pc:TextLine> ...

The addition of the TextEquiv and Unicde happens here: https://github.com/mittagessen/party/blob/a6d98f8c1bc10e7be224f85d331f93d18d9d5d52/party/cli/pred.py#L91

Is there any recommendations how to handle this, so the recognized PageXML would be still valid? Should i f.ex. implement a helper script that adapts the probably already used namespace?

Thank you in advance!

Best regards Constantin

CrazyCrud avatar Aug 06 '25 16:08 CrazyCrud

Urrrgh, XML namespaces are horrible. Could you give me the source file with the namespaces and I'll see what I can do. Which parser actually respects the namespace and doesn't accept the output by the way?

mittagessen avatar Aug 06 '25 17:08 mittagessen

I'm sorry, XML namespaces are really horrible!

the source file with the namespaces

I’m not sure if I understand you correctly. Since I’m using the OCR-D kraken wrapper, the namespace itself is set during the OCR-D export: https://ocr-d.de/core/api/ocrd_models/ocrd_models.ocrd_page.html#ocrd_models.ocrd_page.AdvertRegionType.export

So I’m not sure if this issue really belongs here. I started writing an OCR-D wrapper for party, which would solve the problem in this specific use case. I just wasn’t sure whether this is something worth looking into more generally. What do you think?

Best regards, Constantin

CrazyCrud avatar Aug 08 '25 19:08 CrazyCrud

On 25/08/08 12:32PM, Constantin Lehenmeier wrote:

CrazyCrud left a comment (mittagessen/party#18)

I'm sorry, XML namespaces are really horrible!

the source file with the namespaces

I’m not sure if I understand you correctly. Since I’m using the OCR-D kraken wrapper, the namespace itself is set during the OCR-D export: https://ocr-d.de/core/api/ocrd_models/ocrd_models.ocrd_page.html#ocrd_models.ocrd_page.AdvertRegionType.export

I got that. I just wanted the source file (too lazy to setup all the OCR-D stuff myself) to see if there's a good way to inherit the namespace prefix from the parent element with lxml.

So I’m not sure if this issue really belongs here. I started writing an OCR-D wrapper for party, which would solve the problem in this specific use case. I just wasn’t sure whether this is something worth looking into more generally. What do you think?

It would be good to solve this in general as the XML prediction insertion code is the same as in kraken's forced alignment script. Writing a party wrapper for OCR-D might not be the best use of your time as it's going to end up in kraken once I've fixed the last architecture experiments.

mittagessen avatar Aug 09 '25 14:08 mittagessen

Writing a party wrapper for OCR-D might not be the best use of your time as it's going to end up in kraken once I've fixed the last architecture experiments.

Thank's for the information! I'll wait for the upcoming release.

CrazyCrud avatar Aug 12 '25 12:08 CrazyCrud