chirpycardinal
chirpycardinal copied to clipboard
Downloading the dumps to start the elasticsearch indexing
It is not clear to me how I should download the dumps for pageview and wikidata. Here is the link to dumps for pageview but It's not clear which one is the right one?
https://dumps.wikimedia.org/other/pageviews/2020/2020-12/
Should I merge them in the specific time of 20201201 that is hardcoded in define_es.py?
Each of the .gz files represent an hour on a certain date. And each line in the file represents the number of pageviews for a page in that hour. So the more data you have on pageviews, the less noisy the number will be. I would recommend having at least a weeks worth of pageviews to cover any weekly fluctuations, but a month is probably ideal.
Picking the same month as the other dumps would be best. The page names are most likely to match that way.
The time 20201201 that is hardcoded in define_es was the last time we recreated the index but you don't necessarily have to get that specific month.
If you would like to change the hardcoded paths in wiki, you would need to do it in the following places
- https://github.com/stanfordnlp/chirpycardinal/search?q=enwiki-20201201-articles
- https://github.com/stanfordnlp/chirpycardinal/search?q=enwiki-20201201-sections
While having date names in indices was fine for us, most users of the open source are going to index once so the dates are adding unnecessary complexity. I'll consider removing the dates from the index names in the next release.