archive-query-log icon indicating copy to clipboard operation
archive-query-log copied to clipboard

📜 The Archive Query Log.

Results 19 archive-query-log issues
Sort by recently updated
recently updated
newest added

Restucture the crawing and parsing to store structured data in Elasticsearch indices instead of in the file system. Also store WARCs in S3 instead of raw files. The new storage...

Look for "smaller" WARC files on S3 that can be merged.

enhancement

- https://github.com/GLAM-Workbench/web-archives - https://github.com/iipc/awesome-web-archiving - https://docs.google.com/spreadsheets/d/1vnMaHxYcDZJoGPR5RERGXJ5CYGcdT6TxLqKb0plpwyU/edit?usp=sharing (manually collected list) - https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives

In this issue, I'd like to collect all types of metadata that can be found in a SERP URL, and keep track of what is supported in the AQL implementation....

help wanted
question

In this issue, I'd like to collect all types of metadata that can be found in a SERP URL, and keep track of what is supported in the AQL implementation....

enhancement

- [x] Which snapshot should be selected? Is a snapshot even available? -> nearest snapshots before and after the SERPs timestamp - [ ] Download WARCs - [ ] Parse...

enhancement

Here are additional lists to consider: - https://seirdy.one/posts/2021/03/10/search-engines-with-own-indexes/ - https://www.searchenginemap.com/ - https://searchengine.party/ (includes templates for URL query parsers) - https://thenewleafjournal.com/a-2021-list-of-alternative-search-engines-and-search-resources/ (includes some templates for URL query parsers) - https://github.com/matomo-org/searchengine-and-social-list/blob/master/SearchEngines.yml (Matomo's...

enhancement

The list should focus on differences and similarities as well as restrictions (e.g., API key required). Some tools and datasets to start with include (unordered): - https://github.com/MarioVilas/googlesearch - https://github.com/eliasdabbas/advertools (https://advertools.readthedocs.io/en/master/advertools.serp.html,...

documentation

Investigate the difference between exclusion reasons "Not archived", "No valid snapshot", and "No valid SERP". @schmiseb Do you remember what each reason means?

question

Some providers such as Google currently contain several (very different) services. We should split them into specific services such as Google Scholar, Google Books, etc.

enhancement