chirpycardinal icon indicating copy to clipboard operation
chirpycardinal copied to clipboard

Downloading the dumps to start the elasticsearch indexing

Open roholazandie opened this issue 3 years ago • 2 comments

It is not clear to me how I should download the dumps for pageview and wikidata. Here is the link to dumps for pageview but It's not clear which one is the right one?

https://dumps.wikimedia.org/other/pageviews/2020/2020-12/

Should I merge them in the specific time of 20201201 that is hardcoded in define_es.py?

roholazandie avatar Jun 21 '21 22:06 roholazandie

Each of the .gz files represent an hour on a certain date. And each line in the file represents the number of pageviews for a page in that hour. So the more data you have on pageviews, the less noisy the number will be. I would recommend having at least a weeks worth of pageviews to cover any weekly fluctuations, but a month is probably ideal.

Picking the same month as the other dumps would be best. The page names are most likely to match that way.

The time 20201201 that is hardcoded in define_es was the last time we recreated the index but you don't necessarily have to get that specific month.

AshwinParanjape avatar Jun 22 '21 19:06 AshwinParanjape

If you would like to change the hardcoded paths in wiki, you would need to do it in the following places

  • https://github.com/stanfordnlp/chirpycardinal/search?q=enwiki-20201201-articles
  • https://github.com/stanfordnlp/chirpycardinal/search?q=enwiki-20201201-sections

While having date names in indices was fine for us, most users of the open source are going to index once so the dates are adding unnecessary complexity. I'll consider removing the dates from the index names in the next release.

AshwinParanjape avatar Jun 22 '21 19:06 AshwinParanjape