archive-query-log icon indicating copy to clipboard operation
archive-query-log copied to clipboard

Elasticsearch storage backend

Open janheinrichmerker opened this issue 1 year ago • 3 comments

Restucture the crawing and parsing to store structured data in Elasticsearch indices instead of in the file system. Also store WARCs in S3 instead of raw files. The new storage backends should be flexible enough to allow for re-parsing parts of the dataset without having to delete anything. The second key requirement is to be able to scale up massively by only interacting with standard ES/S3 APIs instead of having to mount a shared file system on all nodes.

janheinrichmerker avatar Nov 02 '23 16:11 janheinrichmerker

Codecov Report

Attention: Patch coverage is 51.77305% with 68 lines in your changes missing coverage. Please review.

Project coverage is 56.36%. Comparing base (668de7e) to head (fbd3c6f). Report is 46 commits behind head on main.

Files with missing lines Patch % Lines
archive_query_log/legacy/results/parse.py 21.73% 18 Missing :warning:
archive_query_log/legacy/queries/parse.py 19.04% 17 Missing :warning:
archive_query_log/legacy/download/iterable.py 35.00% 13 Missing :warning:
archive_query_log/legacy/urls/iterable.py 53.84% 12 Missing :warning:
archive_query_log/legacy/model/parse.py 78.57% 3 Missing :warning:
archive_query_log/legacy/__init__.py 77.77% 2 Missing :warning:
archive_query_log/legacy/model/__init__.py 87.50% 1 Missing :warning:
archive_query_log/legacy/services/__init__.py 75.00% 1 Missing :warning:
archive_query_log/legacy/util/text.py 80.00% 1 Missing :warning:
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #25       +/-   ##
===========================================
- Coverage   89.68%   56.36%   -33.32%     
===========================================
  Files          61       16       -45     
  Lines        2724      864     -1860     
===========================================
- Hits         2443      487     -1956     
- Misses        281      377       +96     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 03 '23 11:11 codecov[bot]

Closes #9

janheinrichmerker avatar Nov 15 '23 20:11 janheinrichmerker

Fixes #6

janheinrichmerker avatar Nov 27 '23 17:11 janheinrichmerker