gesp icon indicating copy to clipboard operation
gesp copied to clipboard

Question re usage / errors on download and fp

Open step21 opened this issue 3 months ago • 2 comments

Hey, thank you very much for your great work and for sharing it! After letting it run for quite a while, it didn't seem finished yet, but ended with a warning about a blank search page (which I think signifies the end or end of data), and almost right before there were several link errors. Overall the results folder is only 7.3 GB, most of which is nw/nrw and bund. I tried to restart with the fingerprint and hoped it would continue, but this gave an error about no store_docId. Please let me know if you have some suggestions, or would need more feedback or want me to investigate more first.

No store_docId error:

python -m gesp -fp results/2025-09-07_00-19/fingerprint.xz -w
/home/ubuntu/gesp/gesp/src/htmlparser.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
Due to the terms of use governing the databases accessed by gesp, the use of gesp is only permitted for non-commercial purposes. Do you use gesp exclusively for non-commercial purposes?
[Y]es/[N]o: y
Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/3.10.13/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ubuntu/.pyenv/versions/3.10.13/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/gesp/gesp/__main__.py", line 166, in <module>
    main()
  File "/home/ubuntu/gesp/gesp/__main__.py", line 116, in main
    fp_importer = Fingerprint(path, fp, args.store_docId)
AttributeError: 'Namespace' object has no attribute 'store_docId'

Previous full run error:

ERROR:  could not retrieve https://nrwe.justiz.nrw.de/arbgs/hamm/lag_hamm/j2009/16_Sa_1729_08urteil20090813.html
Error processing {'postprocess': False, 'wait': True, 'court': 'lag-hamm', 'date': '20090813', 'az': '16-Sa-1729-08', 'link': 'https://nrwe.justiz.nrw.de/arbgs/hamm/lag_hamm/j2009/16_Sa_1729_08urteil20090813.html'}
Traceback (most recent call last):
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/core/scraper.py", line 387, in start_itemproc
    output = await maybe_deferred_to_future(
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/utils/defer.py", line 407, in f
    return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
  File "/home/ubuntu/gesp/gesp/pipelines/exporters.py", line 21, in process_item
    save_as_html(item, spider.name[7:], spider.path, spider.store_docId)
  File "/home/ubuntu/gesp/gesp/src/create_file.py", line 17, in save_as_html
    info(item)
  File "/home/ubuntu/gesp/gesp/src/create_file.py", line 12, in info
    if "link" in item:
TypeError: argument of type 'NoneType' is not iterable
ERROR:  could not retrieve https://nrwe.justiz.nrw.de/lgs/duesseldorf/lg_duesseldorf/j2009/19_S_22_09_U_Vorbehaltsurteil_20090813.html
Error processing {'postprocess': False, 'wait': True, 'court': 'lg-duesseldorf', 'date': '20090813', 'az': '19-S-22-09-U-', 'link': 'https://nrwe.justiz.nrw.de/lgs/duesseldorf/lg_duesseldorf/j2009/19_S_22_09_U_Vorbehaltsurteil_20090813.html'}
Traceback (most recent call last):
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/core/scraper.py", line 387, in start_itemproc
    output = await maybe_deferred_to_future(
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/utils/defer.py", line 407, in f
    return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
  File "/home/ubuntu/gesp/gesp/pipelines/exporters.py", line 21, in process_item
    save_as_html(item, spider.name[7:], spider.path, spider.store_docId)
  File "/home/ubuntu/gesp/gesp/src/create_file.py", line 17, in save_as_html
    info(item)
  File "/home/ubuntu/gesp/gesp/src/create_file.py", line 12, in info
    if "link" in item:
TypeError: argument of type 'NoneType' is not iterable
ERROR:  could not retrieve https://nrwe.justiz.nrw.de/lgs/duesseldorf/lg_duesseldorf/j2009/37_O_111_08urteil20090813.html
Error processing {'postprocess': False, 'wait': True, 'court': 'lg-duesseldorf', 'date': '20090813', 'az': '37-O-111-08', 'link': 'https://nrwe.justiz.nrw.de/lgs/duesseldorf/lg_duesseldorf/j2009/37_O_111_08urteil20090813.html'}
Traceback (most recent call last):
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/core/scraper.py", line 387, in start_itemproc
    output = await maybe_deferred_to_future(
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/utils/defer.py", line 407, in f
    return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
  File "/home/ubuntu/gesp/gesp/pipelines/exporters.py", line 21, in process_item
    save_as_html(item, spider.name[7:], spider.path, spider.store_docId)
  File "/home/ubuntu/gesp/gesp/src/create_file.py", line 17, in save_as_html
    info(item)
  File "/home/ubuntu/gesp/gesp/src/create_file.py", line 12, in info
    if "link" in item:
TypeError: argument of type 'NoneType' is not iterable
ERROR:  could not retrieve https://nrwe.justiz.nrw.de/lgs/koeln/lg_koeln/j2009/37_O_143_09_Urteil_20090813.html
Error processing {'postprocess': False, 'wait': True, 'court': 'lg-koeln', 'date': '20090813', 'az': '37-O-143-09', 'link': 'https://nrwe.justiz.nrw.de/lgs/koeln/lg_koeln/j2009/37_O_143_09_Urteil_20090813.html'}
Traceback (most recent call last):
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/core/scraper.py", line 387, in start_itemproc
    output = await maybe_deferred_to_future(
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/utils/defer.py", line 407, in f
    return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
  File "/home/ubuntu/gesp/gesp/pipelines/exporters.py", line 21, in process_item
    save_as_html(item, spider.name[7:], spider.path, spider.store_docId)
  File "/home/ubuntu/gesp/gesp/src/create_file.py", line 17, in save_as_html
    info(item)
  File "/home/ubuntu/gesp/gesp/src/create_file.py", line 12, in info
    if "link" in item:
TypeError: argument of type 'NoneType' is not iterable
ERROR:  could not retrieve https://nrwe.justiz.nrw.de/lgs/koeln/lg_koeln/j2009/26_O_375_09beschluss20090813.html
Error processing {'postprocess': False, 'wait': True, 'court': 'lg-koeln', 'date': '20090813', 'az': '26-O-375-09', 'link': 'https://nrwe.justiz.nrw.de/lgs/koeln/lg_koeln/j2009/26_O_375_09beschluss20090813.html'}
Traceback (most recent call last):
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/core/scraper.py", line 387, in start_itemproc
    output = await maybe_deferred_to_future(
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/utils/defer.py", line 407, in f
    return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
  File "/home/ubuntu/gesp/gesp/pipelines/exporters.py", line 21, in process_item
    save_as_html(item, spider.name[7:], spider.path, spider.store_docId)
  File "/home/ubuntu/gesp/gesp/src/create_file.py", line 17, in save_as_html
    info(item)
  File "/home/ubuntu/gesp/gesp/src/create_file.py", line 12, in info
    if "link" in item:
TypeError: argument of type 'NoneType' is not iterable
ERROR:  could not retrieve https://nrwe.justiz.nrw.de/lgs/koeln/lg_koeln/j2009/29_S_11_09_Urteil_20090813.html
Error processing {'postprocess': False, 'wait': True, 'court': 'lg-koeln', 'date': '20090813', 'az': '29-S-11-09', 'link': 'https://nrwe.justiz.nrw.de/lgs/koeln/lg_koeln/j2009/29_S_11_09_Urteil_20090813.html'}
Traceback (most recent call last):
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/core/scraper.py", line 387, in start_itemproc
    output = await maybe_deferred_to_future(
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/utils/defer.py", line 407, in f
    return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
  File "/home/ubuntu/gesp/gesp/pipelines/exporters.py", line 21, in process_item
    save_as_html(item, spider.name[7:], spider.path, spider.store_docId)
  File "/home/ubuntu/gesp/gesp/src/create_file.py", line 17, in save_as_html
    info(item)
  File "/home/ubuntu/gesp/gesp/src/create_file.py", line 12, in info
    if "link" in item:
TypeError: argument of type 'NoneType' is not iterable
ERROR:  could not retrieve https://nrwe.justiz.nrw.de/lgs/dortmund/lg_dortmund/j2009/4_O_91_06_Urteil_20090813.html
Error processing {'postprocess': False, 'wait': True, 'court': 'lg-dortmund', 'date': '20090813', 'az': '4-O-91-06', 'link': 'https://nrwe.justiz.nrw.de/lgs/dortmund/lg_dortmund/j2009/4_O_91_06_Urteil_20090813.html'}
Traceback (most recent call last):
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/core/scraper.py", line 387, in start_itemproc
    output = await maybe_deferred_to_future(
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/utils/defer.py", line 407, in f
    return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
  File "/home/ubuntu/gesp/gesp/pipelines/exporters.py", line 21, in process_item
    save_as_html(item, spider.name[7:], spider.path, spider.store_docId)
  File "/home/ubuntu/gesp/gesp/src/create_file.py", line 17, in save_as_html
    info(item)
  File "/home/ubuntu/gesp/gesp/src/create_file.py", line 12, in info
    if "link" in item:
TypeError: argument of type 'NoneType' is not iterable
downloading https://nrwe.justiz.nrw.de/lgs/duisburg/lg_duisburg/j2009/12_O_125_08urteil20090813.html...
downloading https://nrwe.justiz.nrw.de/lgs/krefeld/lg_krefeld/j2009/3_S_41_08urteil20090813.html...
downloading https://nrwe.justiz.nrw.de/lgs/krefeld/lg_krefeld/j2009/3_S_41_08_Urteil_20090813.html...
downloading https://nrwe.justiz.nrw.de/olgs/hamm/j2009/3_Ss_323_09beschluss20090813.html...
downloading https://nrwe.justiz.nrw.de/lgs/koeln/lg_koeln/j2009/31_O_482_09beschluss20090813.html...
downloading https://nrwe.justiz.nrw.de/olgs/koeln/j2009/17_W_181_09beschluss20090813.html...
downloading https://nrwe.justiz.nrw.de/olgs/hamm/j2009/4_U_71_09urteil20090813.html...
downloading https://nrwe.justiz.nrw.de/olgs/hamm/j2009/2_Ws_211_09beschluss20090813.html...
downloading https://nrwe.justiz.nrw.de/olgs/hamm/j2009/2_Ws_216_09beschluss20090813.html...
downloading https://nrwe.justiz.nrw.de/ovgs/ovg_nrw/j2009/1_B_1149_09beschluss20090813.html...
downloading https://nrwe.justiz.nrw.de/ovgs/ovg_nrw/j2009/1_B_264_09beschluss20090813.html...
downloading https://nrwe.justiz.nrw.de/ovgs/ovg_nrw/j2009/12_A_421_08beschluss20090813.html...
downloading https://nrwe.justiz.nrw.de/ovgs/vg_arnsberg/j2009/5_K_677_09urteil20090813.html...
downloading https://nrwe.justiz.nrw.de/ovgs/vg_arnsberg/j2009/5_K_942_09urteil20090813.html...
WARNING:  blank search results page https://nrwesuche.justiz.nrw.de/index.php

step21 avatar Sep 08 '25 13:09 step21

Thank you for the detailed report; the websites change from time to time, which requires the scrapers to be adjusted. I'll look into it, update the code and get back to you.

niklaswais avatar Sep 08 '25 14:09 niklaswais

Awesome, thanks.

step21 avatar Sep 09 '25 11:09 step21