label-studio Exported JSON contains wrong offsets for HTML Entity Recognition

Describe the bug When labeling an HTML file using the HTML Entity Recognition template, the offsets produced when exporting the result appear to be wrong.

To Reproduce

Take the following HTML and save it to a .html file

<!DOCTYPE html>
<html lang="en">
  <head>
      <meta charset="UTF-8">
      <title>Title</title>
  </head>
  <body>
  <p> This is a trial document. Let's see if the <b>offsets</b> will be correct.</p><br><p>Here's some tricky names:
      David H&ouml;hler</p><p>Karl K&ouml;stler.</p><p>Some more tricky words<br></p><p>Gesch&auml;ftsf&uuml;hrer, M&uuml;nchen, Datenschutzerkl&auml;rung.</p>
  </body>
</html>

Import it into a HTML Entity Recognition project
Label some entities

Export the data in the "default" JSON format
Run the following Python code using the exported file

from pathlib import Path
import json

file = Path("exported_file.json")
label_json = json.loads(file.read_text())


for doc in label_json:
    text = doc["data"]["html"]

    for annotation in doc["annotations"][0]["result"]:
        print("Real text:", annotation["value"]["text"])
        print(
            "Text according to offsets:",
            repr(
                text[
                    annotation["value"]["globalOffsets"]["start"] : annotation["value"][
                        "globalOffsets"
                    ]["end"]
                ]
            ),
        )
        print("----")

Expected behavior What I'd expect to see:

Real text: David Höhler
Text according to offsets: 'David Höhler'
----
Real text: Karl Köstler
Text according to offsets: 'Karl Köstler'
[...]

What I actually get:

Real text: David Höhler
Text according to offsets: 's is a trial'
----
Real text: Karl Köstler
Text according to offsets: ' document. L'
[...]

Environment (please complete the following information):

Ubuntu 18.04 LTS
Label Studio v1.5.0 (Docker image)

Aug 03 '22 11:08 wpnbos

This might be the reason: "If the offsets for exported HTML labels don’t match your expected output, such as with HTML named entity recognition (NER) tasks, the most common reason why is due to HTML minification"

Still in my case the annotation's global offsets don't match the html document.

Aug 09 '22 09:08 obruker

@wpnbos How do you import data exactly? Do you use valueType="url" in you HyperText tag?

Aug 26 '22 09:08 makseq

I use the interface of the webapp for importing the data, I'm using the stock labeling setup configuration for HTML Entity Recognition

Do you use valueType="url" in you HyperText tag?

I don't know, how do I check?

Aug 26 '22 09:08 wpnbos

Ok, looks like you import html files directly to import window? In this case @obruker 's answer is right. There will be used html minification.

Let's try to switch to this branch https://github.com/heartexlabs/label-studio/pull/2872 and check how it works there.

Aug 26 '22 10:08 makseq

Yeah I import them directly to the import window. I'll try out the branch

Aug 26 '22 13:08 wpnbos

Hi,

I think there is a bug with the offsets, regardless of the method to import the HTML files.

I have created a new project with the following Labeling Interface:

<View>
  <Labels name="ner" toName="text">
    <Label value="Person"></Label>
    <Label value="Organization"></Label>
  </Labels>
  <HyperText name="text" value="$text" valueType="url"></HyperText>
</View>

In the Data Import step I added the following URL: https://en.wikipedia.org/wiki/Ottery_St_Mary

Then I try to annotate that file, specifically this text as shown below: The town as it now stands has several independent shops

When I export the file, I get this JSON:

[{
        "id": 1,
        "annotations": [{
                "id": 2,
                "completed_by": 1,
                "result": [{
                        "value": {
                            "start": "\/div[3]\/div[3]\/div[5]\/div[1]\/p[4]\/text()[1]",
                            "end": "\/div[3]\/div[3]\/div[5]\/div[1]\/p[4]\/text()[1]",
                            "startOffset": 0,
                            "endOffset": 55,
                            "globalOffsets": {
                                "start": 4486,
                                "end": 4541
                            },
                            "labels": ["Person"]
                        },
                        "id": "vaAXPxkauU",
                        "from_name": "ner",
                        "to_name": "text",
                        "type": "labels",
                        "origin": "manual"
                    }
                ],
                "was_cancelled": false,
                "ground_truth": false,
                "created_at": "2022-10-03T12:05:36.723291Z",
                "updated_at": "2022-10-03T12:05:36.723321Z",
                "lead_time": 21.639,
                "prediction": {},
                "result_count": 0,
                "task": 1,
                "parent_prediction": null,
                "parent_annotation": null
            }
        ],
        "file_upload": "a81371ea-Ottery_St_Mary",
        "drafts": [],
        "predictions": [],
        "data": {
            "text": "\/data\/upload\/1\/a81371ea-Ottery_St_Mary"
        },
        "meta": {},
        "created_at": "2022-10-03T11:23:08.997546Z",
        "updated_at": "2022-10-03T12:05:36.794640Z",
        "inner_id": 1,
        "total_annotations": 1,
        "cancelled_annotations": 0,
        "total_predictions": 0,
        "comment_count": 0,
        "unresolved_comment_count": 0,
        "last_comment_updated_at": null,
        "project": 1,
        "updated_by": 1,
        "comment_authors": []
    }
]

Now, if I download the original HTML file https://en.wikipedia.org/wiki/Ottery_St_Mary, open it with Notepad++ and press Ctrl+G (or Search > Goto) and set the offset to 4486, it doesn't take me to that text that I annotated. According to Notepad++ the right offset of that annotation is 94071 and not 4486.

Oct 03 '22 12:10 HodeiG

Fixed: https://github.com/heartexlabs/label-studio/pull/2872/files

Mar 19 '23 03:03 makseq

@wpnbos

import json
from lxml import etree
import lxml.html

for i, doc in enumerate(data):
    text = doc["data"]["html"]
    tree = lxml.html.fromstring(text)
    
    for annotation in doc['annotations'][0]['result']:
        
        # Finding the start and end positions in the document using the XPath
        start_xpath = '/' + annotation['value']['start']
        end_xpath = '/' + annotation['value']['end']

        # Extracting text for start and end XPaths
        start_text = tree.xpath(start_xpath)
        end_text = tree.xpath(end_xpath)
#         print(end_text)

        # Finding positions in the joined text
        start_position = data[i]["data"]["html"].find(start_text[0].strip()) if start_text else -1
        end_position = data[i]["data"]["html"].find(end_text[0].strip()) + len(end_text[0]) if end_text else -1

        print(end_position)
    
        annotation['value']['globalOffsets']['start'] = start_position
        annotation['value']['globalOffsets']['end'] = end_position
        
        print("Real text:", annotation["value"]["text"])
        print(
            "Text according to offsets:",
            repr(
                text[start_position : end_position]
            ),
        )
        print("----")

Dec 11 '23 18:12 LeinMS

label-studio label-studio copied to clipboard

Exported JSON contains wrong offsets for HTML Entity Recognition

label-studio
label-studio copied to clipboard