label-studio
label-studio copied to clipboard
Exported JSON contains wrong offsets for HTML Entity Recognition
Describe the bug When labeling an HTML file using the HTML Entity Recognition template, the offsets produced when exporting the result appear to be wrong.
To Reproduce
- Take the following HTML and save it to a .html file
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<p> This is a trial document. Let's see if the <b>offsets</b> will be correct.</p><br><p>Here's some tricky names:
David Höhler</p><p>Karl Köstler.</p><p>Some more tricky words<br></p><p>Geschäftsführer, München, Datenschutzerklärung.</p>
</body>
</html>
- Import it into a HTML Entity Recognition project
- Label some entities
data:image/s3,"s3://crabby-images/9bf63/9bf6350dea6f7907fef4541f4fa397d9571aedfc" alt="image"
- Export the data in the "default" JSON format
- Run the following Python code using the exported file
from pathlib import Path
import json
file = Path("exported_file.json")
label_json = json.loads(file.read_text())
for doc in label_json:
text = doc["data"]["html"]
for annotation in doc["annotations"][0]["result"]:
print("Real text:", annotation["value"]["text"])
print(
"Text according to offsets:",
repr(
text[
annotation["value"]["globalOffsets"]["start"] : annotation["value"][
"globalOffsets"
]["end"]
]
),
)
print("----")
Expected behavior What I'd expect to see:
Real text: David Höhler
Text according to offsets: 'David Höhler'
----
Real text: Karl Köstler
Text according to offsets: 'Karl Köstler'
[...]
What I actually get:
Real text: David Höhler
Text according to offsets: 's is a trial'
----
Real text: Karl Köstler
Text according to offsets: ' document. L'
[...]
Environment (please complete the following information):
- Ubuntu 18.04 LTS
- Label Studio v1.5.0 (Docker image)
This might be the reason: "If the offsets for exported HTML labels don’t match your expected output, such as with HTML named entity recognition (NER) tasks, the most common reason why is due to HTML minification"
Still in my case the annotation's global offsets don't match the html document.
@wpnbos How do you import data exactly?
Do you use valueType="url"
in you HyperText
tag?
I use the interface of the webapp for importing the data, I'm using the stock labeling setup configuration for HTML Entity Recognition
Do you use valueType="url" in you HyperText tag?
I don't know, how do I check?
Ok, looks like you import html files directly to import window? In this case @obruker 's answer is right. There will be used html minification.
Let's try to switch to this branch https://github.com/heartexlabs/label-studio/pull/2872 and check how it works there.
Yeah I import them directly to the import window. I'll try out the branch
Hi,
I think there is a bug with the offsets, regardless of the method to import the HTML files.
I have created a new project with the following Labeling Interface
:
<View>
<Labels name="ner" toName="text">
<Label value="Person"></Label>
<Label value="Organization"></Label>
</Labels>
<HyperText name="text" value="$text" valueType="url"></HyperText>
</View>
In the Data Import
step I added the following URL:
https://en.wikipedia.org/wiki/Ottery_St_Mary
Then I try to annotate that file, specifically this text as shown below:
The town as it now stands has several independent shops
When I export the file, I get this JSON:
[{
"id": 1,
"annotations": [{
"id": 2,
"completed_by": 1,
"result": [{
"value": {
"start": "\/div[3]\/div[3]\/div[5]\/div[1]\/p[4]\/text()[1]",
"end": "\/div[3]\/div[3]\/div[5]\/div[1]\/p[4]\/text()[1]",
"startOffset": 0,
"endOffset": 55,
"globalOffsets": {
"start": 4486,
"end": 4541
},
"labels": ["Person"]
},
"id": "vaAXPxkauU",
"from_name": "ner",
"to_name": "text",
"type": "labels",
"origin": "manual"
}
],
"was_cancelled": false,
"ground_truth": false,
"created_at": "2022-10-03T12:05:36.723291Z",
"updated_at": "2022-10-03T12:05:36.723321Z",
"lead_time": 21.639,
"prediction": {},
"result_count": 0,
"task": 1,
"parent_prediction": null,
"parent_annotation": null
}
],
"file_upload": "a81371ea-Ottery_St_Mary",
"drafts": [],
"predictions": [],
"data": {
"text": "\/data\/upload\/1\/a81371ea-Ottery_St_Mary"
},
"meta": {},
"created_at": "2022-10-03T11:23:08.997546Z",
"updated_at": "2022-10-03T12:05:36.794640Z",
"inner_id": 1,
"total_annotations": 1,
"cancelled_annotations": 0,
"total_predictions": 0,
"comment_count": 0,
"unresolved_comment_count": 0,
"last_comment_updated_at": null,
"project": 1,
"updated_by": 1,
"comment_authors": []
}
]
Now, if I download the original HTML file https://en.wikipedia.org/wiki/Ottery_St_Mary, open it with Notepad++ and press Ctrl+G (or Search > Goto) and set the offset to 4486, it doesn't take me to that text that I annotated. According to Notepad++ the right offset of that annotation is 94071 and not 4486.
Fixed: https://github.com/heartexlabs/label-studio/pull/2872/files
@wpnbos
import json
from lxml import etree
import lxml.html
for i, doc in enumerate(data):
text = doc["data"]["html"]
tree = lxml.html.fromstring(text)
for annotation in doc['annotations'][0]['result']:
# Finding the start and end positions in the document using the XPath
start_xpath = '/' + annotation['value']['start']
end_xpath = '/' + annotation['value']['end']
# Extracting text for start and end XPaths
start_text = tree.xpath(start_xpath)
end_text = tree.xpath(end_xpath)
# print(end_text)
# Finding positions in the joined text
start_position = data[i]["data"]["html"].find(start_text[0].strip()) if start_text else -1
end_position = data[i]["data"]["html"].find(end_text[0].strip()) + len(end_text[0]) if end_text else -1
print(end_position)
annotation['value']['globalOffsets']['start'] = start_position
annotation['value']['globalOffsets']['end'] = end_position
print("Real text:", annotation["value"]["text"])
print(
"Text according to offsets:",
repr(
text[start_position : end_position]
),
)
print("----")