estnltk icon indicating copy to clipboard operation
estnltk copied to clipboard

Crash in teicorpus.py

Open Kaljurand opened this issue 10 years ago • 2 comments

Running the built-in TEI parser on the files of tasakaalus_ajalehed_tei.zip crashes after a while with:

  File "/home/kaarel/anaconda/lib/python2.7/site-packages/estnltk/teicorpus.py", line 113, in parse_div
    div_title = list(soup.children)[0].string.strip()
AttributeError: 'NoneType' object has no attribute 'strip'

Used the script https://github.com/Kaljurand/testing-vabamorf/blob/339bead78f979eff8f1622d9529d11cd18be7ec5/morph-analyze.py like this

unzip tasakaalus_ajalehed_tei.zip
./morph-analyze.py *.tei

Kaljurand avatar Nov 17 '15 23:11 Kaljurand

Different TEI corpus documents require different target values.

In a script for preprocessing the TEI files ( http://estnltk.github.io/estnltk/1.3/tutorials/tei.html ), I have used this function that derives the target argument value based on the filename of the XML file in the corpus:

def get_target(fnm):
    if 'drtood' in fnm:
        return 'dissertatsioon'
    if 'ilukirjandus' in fnm:
        return 'tervikteos'
    if 'seadused' in fnm:
        return 'seadus'
    if 'EestiArst' in fnm:
        return 'ajakirjanumber'
    if 'foorum' in fnm:
        return 'teema'
    if 'kommentaarid' in fnm:
        return 'kommentaarid'
    if 'uudisgrupid' in fnm:
        return 'uudisgrupi_salvestus'
    if 'jututoad' in fnm:
        return 'jututoavestlus'
    if 'stenogrammid' in fnm:
        return 'stenogramm'
    return 'artikkel'

So, I would change line 38 in https://github.com/Kaljurand/testing-vabamorf/blob/339bead78f979eff8f1622d9529d11cd18be7ec5/morph-analyze.py from

parse_tei_corpus(fn, target=['artikkel']):

to

parse_tei_corpus(fn, target=get_target(fn)):

NB! I haven't tried to run the changes yet, but wrote this here for the reference. The correct fix would be that parse_tei_corpus would figure out the correct target value itself.

tpetmanson avatar Jan 18 '16 08:01 tpetmanson

Thanks, I've tried this solution in https://github.com/Kaljurand/testing-vabamorf/commit/d6fa07b528361530d6b2cec43dc0d8ed2245b3ba#diff-22b3fe3fcbdb4916fea92d05f91eb91d

Unfortunately, none of the file names match the names by which get_target is classifying.

To filter out the files that cause the parsing failure, I've added a try/except. This is the list:

Warning: parse error: skipped: corpora/eestiaeg-19993-10-06.tei
Warning: parse error: skipped: corpora/KL-1994-02.tei
Warning: parse error: skipped: corpora/liivimaa-kroonika-1993-10-21.tei
Warning: parse error: skipped: corpora/ML-1996-01.tei
Warning: parse error: skipped: corpora/ML-1999-08-26.tei
Warning: parse error: skipped: corpora/OL-1990-01-30.tei
Warning: parse error: skipped: corpora/OL-1990-03-10.tei
Warning: parse error: skipped: corpora/spordileht-1993-10-10.tei

Kaljurand avatar Jan 24 '16 13:01 Kaljurand

Resolved by releases or irrelevant by new formats of corpora.

swenlaur avatar Dec 27 '23 23:12 swenlaur