oldp icon indicating copy to clipboard operation
oldp copied to clipboard

Misaligned reference links in full text

Open dennlinger opened this issue 5 years ago • 11 comments

For some of the decisions (e.g., this one), the references are not aligned at all with the corresponding occurrences in the text.

Is there any way to work with the data prior to the annotation (as it is available through the JSON), to potentially help with investigating this?

dennlinger avatar Sep 30 '19 13:09 dennlinger

Hi @dennlinger,

thanks for your bug report. We are already aware of this bug but couldn't fix it until now (see https://github.com/openlegaldata/legal-reference-extraction/issues/1 ).

If the original text without any annotation would help you, we could provide it as an additional field in the API response.

Best, Malte

malteos avatar Oct 01 '19 12:10 malteos

Hi Malte, unfortunately didn't see the bug report before. I was more wondering whether you could provide some of the actual samples (raw HTML before processing, maybe from the case referenced in the bug report) used in the dataset for the live webpage to help with the debugging.

The test cases provided in legal-reference-extraction seem simple enough at first glance, and I assume you are checking for correctness on those anyways. I'm aware of legal-datasets, but that one is unfortunately empty as well.

I think the feature is extremely helpful if working properly, and could potentially be extended, if you are willing to accept contributions on this issue.

Best, Dennis

dennlinger avatar Oct 02 '19 11:10 dennlinger

Contributions are always welcome!

I'll try to update the API accordingly within the next week.

malteos avatar Oct 02 '19 17:10 malteos

The decision content which is currently available via the API does not contain any annotations. Thus, it should not be affected by the reference extraction bug. The API serializer returns the content field that holds the HTML as we obtained it from the source.

For the UI, all annotations are added later (See https://github.com/openlegaldata/oldp/blob/master/oldp/apps/cases/models.py#L186-L209 )

malteos avatar Oct 04 '19 08:10 malteos

After running some tests (for example on this document) it seems like the references are misaligned because of the HTML-Offset, i.e. replacing special characters like "ö" with "ö". The references are placed as if they were applied to plain text without taking these special characters into account resulting in the misalignment. I am currently working on a bugfix for this issue together with @dennlinger.

fchrubasik avatar Nov 21 '19 17:11 fchrubasik

Hi @fchrubasik & @dennlinger

thanks again for your contribution! The last months have been really busy over here so I only today managed to finally deploy your changes to production. I'm really sorry for that!

I'm currently reprocessing all our documents with the changes (that might take 10hrs or so).

Did you end up doing anything with the citation data?

Best, Malte

malteos avatar May 04 '20 11:05 malteos

Hi, thanks for incorporating the changes! So far we haven't directly used the citations from openlegaldata, but had a Thesis project by another student working on Bafin data and European Directives. As for this patch, let me know if there are any problems coming up. I think there is a chance that depending on your input format, some files are still processed incorrectly, but I'll happily check a bunch of documents once the changes are live. ;-)

Cheers, Dennis

dennlinger avatar May 05 '20 08:05 dennlinger

Not sure where to follow up with this, but it seems the references are still misaligned on the live server, as it seems. Did we miss anything with the original bugfix that might cause this to be still misaligned?

dennlinger avatar Aug 27 '20 14:08 dennlinger

The case mentioned in the issue seems to have all reference correct ( https://de.openlegaldata.io/case/bag-2019-07-11-6-azr-4017 ). Do you have an example for still misaligend references?

malteos avatar Sep 05 '20 08:09 malteos

I was specifically looking at the most recent "Urteil" at the time of writing (https://de.openlegaldata.io/case/bverwg-2020-08-06-6-b-1120). Great to see that the original issue is fixed, though!

dennlinger avatar Sep 08 '20 14:09 dennlinger

OK. Then let's reopen this one.

malteos avatar Sep 21 '20 06:09 malteos