Question re usage / errors on download and fp
Hey, thank you very much for your great work and for sharing it!
After letting it run for quite a while, it didn't seem finished yet, but ended with a warning about a blank search page (which I think signifies the end or end of data), and almost right before there were several link errors. Overall the results folder is only 7.3 GB, most of which is nw/nrw and bund. I tried to restart with the fingerprint and hoped it would continue, but this gave an error about no store_docId. Please let me know if you have some suggestions, or would need more feedback or want me to investigate more first.
No store_docId error:
python -m gesp -fp results/2025-09-07_00-19/fingerprint.xz -w
/home/ubuntu/gesp/gesp/src/htmlparser.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
Due to the terms of use governing the databases accessed by gesp, the use of gesp is only permitted for non-commercial purposes. Do you use gesp exclusively for non-commercial purposes?
[Y]es/[N]o: y
Traceback (most recent call last):
File "/home/ubuntu/.pyenv/versions/3.10.13/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ubuntu/.pyenv/versions/3.10.13/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/ubuntu/gesp/gesp/__main__.py", line 166, in <module>
main()
File "/home/ubuntu/gesp/gesp/__main__.py", line 116, in main
fp_importer = Fingerprint(path, fp, args.store_docId)
AttributeError: 'Namespace' object has no attribute 'store_docId'
Previous full run error:
ERROR: could not retrieve https://nrwe.justiz.nrw.de/arbgs/hamm/lag_hamm/j2009/16_Sa_1729_08urteil20090813.html
Error processing {'postprocess': False, 'wait': True, 'court': 'lag-hamm', 'date': '20090813', 'az': '16-Sa-1729-08', 'link': 'https://nrwe.justiz.nrw.de/arbgs/hamm/lag_hamm/j2009/16_Sa_1729_08urteil20090813.html'}
Traceback (most recent call last):
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/core/scraper.py", line 387, in start_itemproc
output = await maybe_deferred_to_future(
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/utils/defer.py", line 407, in f
return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
File "/home/ubuntu/gesp/gesp/pipelines/exporters.py", line 21, in process_item
save_as_html(item, spider.name[7:], spider.path, spider.store_docId)
File "/home/ubuntu/gesp/gesp/src/create_file.py", line 17, in save_as_html
info(item)
File "/home/ubuntu/gesp/gesp/src/create_file.py", line 12, in info
if "link" in item:
TypeError: argument of type 'NoneType' is not iterable
ERROR: could not retrieve https://nrwe.justiz.nrw.de/lgs/duesseldorf/lg_duesseldorf/j2009/19_S_22_09_U_Vorbehaltsurteil_20090813.html
Error processing {'postprocess': False, 'wait': True, 'court': 'lg-duesseldorf', 'date': '20090813', 'az': '19-S-22-09-U-', 'link': 'https://nrwe.justiz.nrw.de/lgs/duesseldorf/lg_duesseldorf/j2009/19_S_22_09_U_Vorbehaltsurteil_20090813.html'}
Traceback (most recent call last):
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/core/scraper.py", line 387, in start_itemproc
output = await maybe_deferred_to_future(
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/utils/defer.py", line 407, in f
return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
File "/home/ubuntu/gesp/gesp/pipelines/exporters.py", line 21, in process_item
save_as_html(item, spider.name[7:], spider.path, spider.store_docId)
File "/home/ubuntu/gesp/gesp/src/create_file.py", line 17, in save_as_html
info(item)
File "/home/ubuntu/gesp/gesp/src/create_file.py", line 12, in info
if "link" in item:
TypeError: argument of type 'NoneType' is not iterable
ERROR: could not retrieve https://nrwe.justiz.nrw.de/lgs/duesseldorf/lg_duesseldorf/j2009/37_O_111_08urteil20090813.html
Error processing {'postprocess': False, 'wait': True, 'court': 'lg-duesseldorf', 'date': '20090813', 'az': '37-O-111-08', 'link': 'https://nrwe.justiz.nrw.de/lgs/duesseldorf/lg_duesseldorf/j2009/37_O_111_08urteil20090813.html'}
Traceback (most recent call last):
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/core/scraper.py", line 387, in start_itemproc
output = await maybe_deferred_to_future(
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/utils/defer.py", line 407, in f
return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
File "/home/ubuntu/gesp/gesp/pipelines/exporters.py", line 21, in process_item
save_as_html(item, spider.name[7:], spider.path, spider.store_docId)
File "/home/ubuntu/gesp/gesp/src/create_file.py", line 17, in save_as_html
info(item)
File "/home/ubuntu/gesp/gesp/src/create_file.py", line 12, in info
if "link" in item:
TypeError: argument of type 'NoneType' is not iterable
ERROR: could not retrieve https://nrwe.justiz.nrw.de/lgs/koeln/lg_koeln/j2009/37_O_143_09_Urteil_20090813.html
Error processing {'postprocess': False, 'wait': True, 'court': 'lg-koeln', 'date': '20090813', 'az': '37-O-143-09', 'link': 'https://nrwe.justiz.nrw.de/lgs/koeln/lg_koeln/j2009/37_O_143_09_Urteil_20090813.html'}
Traceback (most recent call last):
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/core/scraper.py", line 387, in start_itemproc
output = await maybe_deferred_to_future(
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/utils/defer.py", line 407, in f
return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
File "/home/ubuntu/gesp/gesp/pipelines/exporters.py", line 21, in process_item
save_as_html(item, spider.name[7:], spider.path, spider.store_docId)
File "/home/ubuntu/gesp/gesp/src/create_file.py", line 17, in save_as_html
info(item)
File "/home/ubuntu/gesp/gesp/src/create_file.py", line 12, in info
if "link" in item:
TypeError: argument of type 'NoneType' is not iterable
ERROR: could not retrieve https://nrwe.justiz.nrw.de/lgs/koeln/lg_koeln/j2009/26_O_375_09beschluss20090813.html
Error processing {'postprocess': False, 'wait': True, 'court': 'lg-koeln', 'date': '20090813', 'az': '26-O-375-09', 'link': 'https://nrwe.justiz.nrw.de/lgs/koeln/lg_koeln/j2009/26_O_375_09beschluss20090813.html'}
Traceback (most recent call last):
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/core/scraper.py", line 387, in start_itemproc
output = await maybe_deferred_to_future(
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/utils/defer.py", line 407, in f
return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
File "/home/ubuntu/gesp/gesp/pipelines/exporters.py", line 21, in process_item
save_as_html(item, spider.name[7:], spider.path, spider.store_docId)
File "/home/ubuntu/gesp/gesp/src/create_file.py", line 17, in save_as_html
info(item)
File "/home/ubuntu/gesp/gesp/src/create_file.py", line 12, in info
if "link" in item:
TypeError: argument of type 'NoneType' is not iterable
ERROR: could not retrieve https://nrwe.justiz.nrw.de/lgs/koeln/lg_koeln/j2009/29_S_11_09_Urteil_20090813.html
Error processing {'postprocess': False, 'wait': True, 'court': 'lg-koeln', 'date': '20090813', 'az': '29-S-11-09', 'link': 'https://nrwe.justiz.nrw.de/lgs/koeln/lg_koeln/j2009/29_S_11_09_Urteil_20090813.html'}
Traceback (most recent call last):
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/core/scraper.py", line 387, in start_itemproc
output = await maybe_deferred_to_future(
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/utils/defer.py", line 407, in f
return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
File "/home/ubuntu/gesp/gesp/pipelines/exporters.py", line 21, in process_item
save_as_html(item, spider.name[7:], spider.path, spider.store_docId)
File "/home/ubuntu/gesp/gesp/src/create_file.py", line 17, in save_as_html
info(item)
File "/home/ubuntu/gesp/gesp/src/create_file.py", line 12, in info
if "link" in item:
TypeError: argument of type 'NoneType' is not iterable
ERROR: could not retrieve https://nrwe.justiz.nrw.de/lgs/dortmund/lg_dortmund/j2009/4_O_91_06_Urteil_20090813.html
Error processing {'postprocess': False, 'wait': True, 'court': 'lg-dortmund', 'date': '20090813', 'az': '4-O-91-06', 'link': 'https://nrwe.justiz.nrw.de/lgs/dortmund/lg_dortmund/j2009/4_O_91_06_Urteil_20090813.html'}
Traceback (most recent call last):
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/core/scraper.py", line 387, in start_itemproc
output = await maybe_deferred_to_future(
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/ubuntu/.local/share/virtualenvs/gesp-HU9BHKIz/lib/python3.10/site-packages/scrapy/utils/defer.py", line 407, in f
return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
File "/home/ubuntu/gesp/gesp/pipelines/exporters.py", line 21, in process_item
save_as_html(item, spider.name[7:], spider.path, spider.store_docId)
File "/home/ubuntu/gesp/gesp/src/create_file.py", line 17, in save_as_html
info(item)
File "/home/ubuntu/gesp/gesp/src/create_file.py", line 12, in info
if "link" in item:
TypeError: argument of type 'NoneType' is not iterable
downloading https://nrwe.justiz.nrw.de/lgs/duisburg/lg_duisburg/j2009/12_O_125_08urteil20090813.html...
downloading https://nrwe.justiz.nrw.de/lgs/krefeld/lg_krefeld/j2009/3_S_41_08urteil20090813.html...
downloading https://nrwe.justiz.nrw.de/lgs/krefeld/lg_krefeld/j2009/3_S_41_08_Urteil_20090813.html...
downloading https://nrwe.justiz.nrw.de/olgs/hamm/j2009/3_Ss_323_09beschluss20090813.html...
downloading https://nrwe.justiz.nrw.de/lgs/koeln/lg_koeln/j2009/31_O_482_09beschluss20090813.html...
downloading https://nrwe.justiz.nrw.de/olgs/koeln/j2009/17_W_181_09beschluss20090813.html...
downloading https://nrwe.justiz.nrw.de/olgs/hamm/j2009/4_U_71_09urteil20090813.html...
downloading https://nrwe.justiz.nrw.de/olgs/hamm/j2009/2_Ws_211_09beschluss20090813.html...
downloading https://nrwe.justiz.nrw.de/olgs/hamm/j2009/2_Ws_216_09beschluss20090813.html...
downloading https://nrwe.justiz.nrw.de/ovgs/ovg_nrw/j2009/1_B_1149_09beschluss20090813.html...
downloading https://nrwe.justiz.nrw.de/ovgs/ovg_nrw/j2009/1_B_264_09beschluss20090813.html...
downloading https://nrwe.justiz.nrw.de/ovgs/ovg_nrw/j2009/12_A_421_08beschluss20090813.html...
downloading https://nrwe.justiz.nrw.de/ovgs/vg_arnsberg/j2009/5_K_677_09urteil20090813.html...
downloading https://nrwe.justiz.nrw.de/ovgs/vg_arnsberg/j2009/5_K_942_09urteil20090813.html...
WARNING: blank search results page https://nrwesuche.justiz.nrw.de/index.php
Thank you for the detailed report; the websites change from time to time, which requires the scrapers to be adjusted. I'll look into it, update the code and get back to you.
Awesome, thanks.