Crash in teicorpus.py
Running the built-in TEI parser on the files of tasakaalus_ajalehed_tei.zip crashes after a while with:
File "/home/kaarel/anaconda/lib/python2.7/site-packages/estnltk/teicorpus.py", line 113, in parse_div
div_title = list(soup.children)[0].string.strip()
AttributeError: 'NoneType' object has no attribute 'strip'
Used the script https://github.com/Kaljurand/testing-vabamorf/blob/339bead78f979eff8f1622d9529d11cd18be7ec5/morph-analyze.py like this
unzip tasakaalus_ajalehed_tei.zip
./morph-analyze.py *.tei
Different TEI corpus documents require different target values.
In a script for preprocessing the TEI files ( http://estnltk.github.io/estnltk/1.3/tutorials/tei.html ), I have used this function that derives the target argument value based on the filename of the XML file in the corpus:
def get_target(fnm):
if 'drtood' in fnm:
return 'dissertatsioon'
if 'ilukirjandus' in fnm:
return 'tervikteos'
if 'seadused' in fnm:
return 'seadus'
if 'EestiArst' in fnm:
return 'ajakirjanumber'
if 'foorum' in fnm:
return 'teema'
if 'kommentaarid' in fnm:
return 'kommentaarid'
if 'uudisgrupid' in fnm:
return 'uudisgrupi_salvestus'
if 'jututoad' in fnm:
return 'jututoavestlus'
if 'stenogrammid' in fnm:
return 'stenogramm'
return 'artikkel'
So, I would change line 38 in https://github.com/Kaljurand/testing-vabamorf/blob/339bead78f979eff8f1622d9529d11cd18be7ec5/morph-analyze.py from
parse_tei_corpus(fn, target=['artikkel']):
to
parse_tei_corpus(fn, target=get_target(fn)):
NB! I haven't tried to run the changes yet, but wrote this here for the reference. The correct fix would be that parse_tei_corpus would figure out the correct target value itself.
Thanks, I've tried this solution in https://github.com/Kaljurand/testing-vabamorf/commit/d6fa07b528361530d6b2cec43dc0d8ed2245b3ba#diff-22b3fe3fcbdb4916fea92d05f91eb91d
Unfortunately, none of the file names match the names by which get_target is classifying.
To filter out the files that cause the parsing failure, I've added a try/except. This is the list:
Warning: parse error: skipped: corpora/eestiaeg-19993-10-06.tei
Warning: parse error: skipped: corpora/KL-1994-02.tei
Warning: parse error: skipped: corpora/liivimaa-kroonika-1993-10-21.tei
Warning: parse error: skipped: corpora/ML-1996-01.tei
Warning: parse error: skipped: corpora/ML-1999-08-26.tei
Warning: parse error: skipped: corpora/OL-1990-01-30.tei
Warning: parse error: skipped: corpora/OL-1990-03-10.tei
Warning: parse error: skipped: corpora/spordileht-1993-10-10.tei
Resolved by releases or irrelevant by new formats of corpora.