ingest-file icon indicating copy to clipboard operation
ingest-file copied to clipboard

Norwegian NER does not seem to work

Open anderser opened this issue 3 years ago • 1 comments

Hi. I am trying to upload documents (PDFs with text) to Aleph v 3.11.1 using ingest-file 3.16.0

As far as I can see the newest ingest-file includes NER-support for Norwegian (nb).

The dataset is set to Norwegian and I have also tried to specify nor as language in multipart upload and tested manual upload of the PDFs in Aleph UI.

But I still can not see that entities/mentions are extracted. Any hints on where to start debugging this? Example file: https://www.bergen.kommune.no/api/rest/filer/V105369 (this should contain a lot of NER entities)

These are extract from the logs from ingest-file and convert-document:

637249502.7992988, "stage": "ingest", "message": "OCR: 2 chars (from 50082 bytes)", "severity": "INFO"}
ingest-file_1       | {"logger": "ingestors.support.ocr", "timestamp": "2021-11-18 15:31:52.343269", "dataset": "15", "job_id": "6:89a95cfa-8346-4f45-8573-ee4de6e4df36", "version": "3.16.0", "trace_id": "cf0b59f2-fbe4-4cdb-aab7-23a547ff7132", "start_time": 1637249502.7992988, "stage": "ingest", "message": "w: 946, h: 165, l: eng+nor, c: 95, took: 0.04451", "severity": "INFO"}
ingest-file_1       | {"logger": "ingestors.support.ocr", "timestamp": "2021-11-18 15:31:52.353464", "dataset": "15", "job_id": "6:89a95cfa-8346-4f45-8573-ee4de6e4df36", "version": "3.16.0", "trace_id": "cf0b59f2-fbe4-4cdb-aab7-23a547ff7132", "start_time": 1637249502.7992988, "stage": "ingest", "message": "OCR: 2 chars (from 73257 bytes)", "severity": "INFO"}
ingest-file_1       | {"logger": "ingestors.support.ocr", "timestamp": "2021-11-18 15:31:52.456324", "dataset": "15", "job_id": "6:89a95cfa-8346-4f45-8573-ee4de6e4df36", "version": "3.16.0", "trace_id": "cf0b59f2-fbe4-4cdb-aab7-23a547ff7132", "start_time": 1637249502.7992988, "stage": "ingest", "message": "w: 946, h: 165, l: eng+nor, c: 90, took: 0.09722", "severity": "INFO"}
ingest-file_1       | {"logger": "ingestors.support.ocr", "timestamp": "2021-11-18 15:31:52.466132", "dataset": "15", "job_id": "6:89a95cfa-8346-4f45-8573-ee4de6e4df36", "version": "3.16.0", "trace_id": "cf0b59f2-fbe4-4cdb-aab7-23a547ff7132", "start_time": 1637249502.7992988, "stage": "ingest", "message": "OCR: 79 chars (from 93366 bytes)", "severity": "INFO"}
ingest-file_1       | {"logger": "ingestors.analysis.language", "timestamp": "2021-11-18 15:31:52.578159", "dataset": "15", "job_id": "6:89a95cfa-8346-4f45-8573-ee4de6e4df36", "version": "3.16.0", "trace_id": "549ad73f-eb18-494e-a8d9-6f8d71f92b56", "start_time": 1637249512.5727026, "stage": "analyze", "message": "Detected (2586 chars): no -> 0.836", "severity": "DEBUG"}

There is trace of OCR and language detection (which is no while the NER model is nb, maybe that confuses the system), but no trace of NER.

Excerpts from the document returns entities when running it with:

import spacy

nlp = spacy.load("nb_core_news_sm")

text = ("""Det ble utarbeidet tekniske planer for dam Munkebotsvatnet for valgt løsning, ferdig den 22.4.2016.
NVE godkjente tekniske planer (NC) i brev 12.8.2016. I brevet presiserer NVE følgende:
I og med at dammen tilhører et vassdragsanlegg uten konsesjon vil det for resten av utbyggingen være
Bergen kommune som skal stå for saksbehandling og kontroll etter plan- og bygningsloven (PBL), jf.
Forskrift om byggesak. Følgelig er det kommunen som gir nødvendige tillatelser til å gjennomføre de
deler av utbyggingsprosjektet som ikke angår selve dammen, dvs. fangdam, tilkomstveier m.m.""")
doc = nlp(text)

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

Returns

Munkebotsvatnet ORG
NVE ORG
NC ORG
NVE ORG
Bergen kommune ORG
PBL ORG

anderser avatar Nov 18 '21 15:11 anderser

Digged a bit deeper. Norway has two written languages: Norwegian bokmål (nob/nb) and Norwegian nynorsk (nn/nno). They are pretty similar, and Norwegian also has the ISO code no/nor as some kind of common Norwegian code/fallback.

It seems the language detection in fasttext finds no which translates to nor in the ISO-639-2 code and FTM.

But only nob is mapped to the right SpaCy model here: https://github.com/alephdata/ingest-file/blob/main/ingestors/settings.py#L40 Which makes sense, because here Norwegian bokmål nob is mapped to Spacy model for bokmål nb.

Fasttext is able to detect the language codes no and nn where no probably represents bokmål according to the docs https://fasttext.cc/docs/en/language-identification.html#list-of-supported-languages

So: I think a quick fix would be to add a mapping like this to the settings file:

"nor": "nb_core_news_sm",

I would think that would solve the issue. I'll test more locally here and pull request when I find time.

anderser avatar Nov 18 '21 19:11 anderser