archive-query-log issues

Elasticsearch storage backend

3

Restucture the crawing and parsing to store structured data in Elasticsearch indices instead of in the file system. Also store WARCs in S3 instead of raw files. The new storage...

janheinrichmerker

Defragment S3 WARC storage

1

Look for "smaller" WARC files on S3 that can be merged.

janheinrichmerker

enhancement

- https://github.com/GLAM-Workbench/web-archives - https://github.com/iipc/awesome-web-archiving - https://docs.google.com/spreadsheets/d/1vnMaHxYcDZJoGPR5RERGXJ5CYGcdT6TxLqKb0plpwyU/edit?usp=sharing (manually collected list) - https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives

janheinrichmerker

Brainstorming: What can be found in a SERP's HTML?

In this issue, I'd like to collect all types of metadata that can be found in a SERP URL, and keep track of what is supported in the AQL implementation....

janheinrichmerker

help wanted

question

Parse more metadata from the SERP URL

In this issue, I'd like to collect all types of metadata that can be found in a SERP URL, and keep track of what is supported in the AQL implementation....

janheinrichmerker

enhancement

Download referenced webpages from search results

- [x] Which snapshot should be selected? Is a snapshot even available? -> nearest snapshots before and after the SERPs timestamp - [ ] Download WARCs - [ ] Parse...

janheinrichmerker

enhancement

Add more search providers

1

Here are additional lists to consider: - https://seirdy.one/posts/2021/03/10/search-engines-with-own-indexes/ - https://www.searchenginemap.com/ - https://searchengine.party/ (includes templates for URL query parsers) - https://thenewleafjournal.com/a-2021-list-of-alternative-search-engines-and-search-resources/ (includes some templates for URL query parsers) - https://github.com/matomo-org/searchengine-and-social-list/blob/master/SearchEngines.yml (Matomo's...

janheinrichmerker

enhancement

Create list of previous SERP scraping resources

1

The list should focus on differences and similarities as well as restrictions (e.g., API key required). Some tools and datasets to start with include (unordered): - https://github.com/MarioVilas/googlesearch - https://github.com/eliasdabbas/advertools (https://advertools.readthedocs.io/en/master/advertools.serp.html,...

janheinrichmerker

documentation

Investigate exclusion reasons

Investigate the difference between exclusion reasons "Not archived", "No valid snapshot", and "No valid SERP". @schmiseb Do you remember what each reason means?

janheinrichmerker

question

Split "big" search providers

Some providers such as Google currently contain several (very different) services. We should split them into specific services such as Google Scholar, Google Books, etc.

janheinrichmerker

enhancement

archive-query-log
archive-query-log copied to clipboard

Metadata

Elasticsearch storage backend

Defragment S3 WARC storage

Add more archives

Brainstorming: What can be found in a SERP's HTML?

Parse more metadata from the SERP URL

Download referenced webpages from search results

Add more search providers

Create list of previous SERP scraping resources

Investigate exclusion reasons

Split "big" search providers

← Metadata

Owner

Metadata

archive-query-log archive-query-log copied to clipboard

Metadata

← Metadata

Owner

Metadata

archive-query-log
archive-query-log copied to clipboard