archive-query-log
archive-query-log copied to clipboard
📜 The Archive Query Log.
Restucture the crawing and parsing to store structured data in Elasticsearch indices instead of in the file system. Also store WARCs in S3 instead of raw files. The new storage...
- https://github.com/GLAM-Workbench/web-archives - https://github.com/iipc/awesome-web-archiving - https://docs.google.com/spreadsheets/d/1vnMaHxYcDZJoGPR5RERGXJ5CYGcdT6TxLqKb0plpwyU/edit?usp=sharing (manually collected list) - https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives
In this issue, I'd like to collect all types of metadata that can be found in a SERP URL, and keep track of what is supported in the AQL implementation....
In this issue, I'd like to collect all types of metadata that can be found in a SERP URL, and keep track of what is supported in the AQL implementation....
- [x] Which snapshot should be selected? Is a snapshot even available? -> nearest snapshots before and after the SERPs timestamp - [ ] Download WARCs - [ ] Parse...
Here are additional lists to consider: - https://seirdy.one/posts/2021/03/10/search-engines-with-own-indexes/ - https://www.searchenginemap.com/ - https://searchengine.party/ (includes templates for URL query parsers) - https://thenewleafjournal.com/a-2021-list-of-alternative-search-engines-and-search-resources/ (includes some templates for URL query parsers) - https://github.com/matomo-org/searchengine-and-social-list/blob/master/SearchEngines.yml (Matomo's...
The list should focus on differences and similarities as well as restrictions (e.g., API key required). Some tools and datasets to start with include (unordered): - https://github.com/MarioVilas/googlesearch - https://github.com/eliasdabbas/advertools (https://advertools.readthedocs.io/en/master/advertools.serp.html,...
Investigate the difference between exclusion reasons "Not archived", "No valid snapshot", and "No valid SERP". @schmiseb Do you remember what each reason means?
Some providers such as Google currently contain several (very different) services. We should split them into specific services such as Google Scholar, Google Books, etc.