wikipediatrend icon indicating copy to clipboard operation
wikipediatrend copied to clipboard

Consistent zero counts for Dec 31 2008

Open ukohler opened this issue 5 years ago • 5 comments
trafficstars

Dear Peter,

thank you very much for providing access to the "older" wiki pageview stats through by means of an API. Too sad, that Wikipedia itself has not managed to include them in their own API, so far. I am creating an Ado-File for Stata, right now. While doing so I found some characteristics of the responses of the API that I would like to understand better. As I see in your own examples, there are zero counts for some days even for terms that are heavily requested. I would see this as a nuisance but it seems to be that some of these zero counts are consistent. One example is the zero counts for Dec 31 2008. Since this is a leap year, the entry should be in the 366th position in the page_view_count-field of the JSON-response. There is an entry there, but the entry seems to be consistently zero (I checked with Angela_Merkel, Albert_Einstein, Bazooka, and Lothar_Matthäus for various languages (de, fr, en))

Any insights would be extremely helpful.

Many regards Ulrich Kohler

ukohler avatar Jun 19 '20 08:06 ukohler

Thanks for posting.

Since the counts are based on raw server request logs - and by raw I mean raw bytes - and I am pretty sure I read in the page dumb 'documentation' that server might have been not as robust as nowadays: I would not be too surprised by zero counts (also for things with high traffic like Albert Einstein and such) - in general.

Those patterns for leap days are however very suspicious. I will have to look into the dumbs (Analytics Datasets, dumbs for Feb. 2008) and I will have to look into my code (Wikipedia Dumbs Download and Extraction Repository).

petermeissner avatar Jun 25 '20 19:06 petermeissner

So, in 2008 there are definitely dumbs for February 29th. Which would suggest, that its not a problem with the raw data.

The problem I have to solve is this: The database does not retain dates for individual counts. To keep database size reasonable 'small' (~300GB), the time series are stored as comma separated series of counts per page id and year, e.g.:

page_id year counts
1 2008 1,2,100,354,...
1 2009 5,2,0,1,...
1 2010 7,21,10,44,...
... ... ...

... so I have to dig into the code to find out whats going on - at the current point I suspect I messed up something. Hopefully its very systematic and I can either fix it or I can adjust the data/API.

petermeissner avatar Jun 25 '20 19:06 petermeissner

... I have browsed the code again and had a look at more pages ...

  • I was worried that I used a series of "1 to 365" to either generate download URLs or for count aggregation - but looking through the code I only found the use of proper date series generation functions.
  • Just as a side node: A count of 0 can mean two things: "There was nor request." or "There was no data." - There is no way to distinguish those two by looking at the logs.

petermeissner avatar Jun 25 '20 20:06 petermeissner

Just for information: I made my observations while doing some interactive debugging of my Stata program. At that time I asked for "Angela Merkel", "Albert Einstein", "Bazooka" and "Lothar Matthäus" for languages "de", "fr", "en", but I don't remember whether I did all article-language combinations. Meanwhile I have a verification script that asks for the following page statistics:

Bazooka (de, en) Franz Beckenbauer (de, es) Günter Netzer (de, es) Donald Trump (en, de) Emmanuel Macron (en, de) Boris Johnson (en, de) Jair Bolsonaro (en, de) Xi Jinping (en, de) Angela Merkel (en, de) Vladimir Putin (en, de)

I observe the zero counts in 2008 for all these pages. Of course, these observations are made in the return of my Stata program, so there is a probability above zero that the problems stem from there. However, I don't see this for 2012, so I don't think that is something systematic in my program.

Drop me a note, if you want to see the counts in my data, or the exact dates of my calls to your API.

Uli

ukohler avatar Jun 26 '20 06:06 ukohler

Thanks for the additional info.

I was doing some digging into the job logs and found the following:

INSERT INTO public.upload_jobs (job_id, job_start_ts, job_end_ts, job_status, job_progress, job_type, job_run_node, job_target_node, job_file, job_ts_update, job_pace_sec_per_mio, job_comment) VALUES (10936, '2018-10-27 19:43:06', '2018-10-27 19:43:07', 'error', 0, 'gz, all', 'pm2', 'pm2', '/data/wpd/todo/pagecounts-20081231-.*.gz', '2018-10-27 19:43:07', NULL, '
-----------------

 /data/wpd/todo/pagecounts-20081231-.*.gz 

----------------

-----------------

 Fehler in duty_to_do_function() : 
  Expected number of .gz files for date 24 but found 25.
 
----------------
3: (function () 
   {
       if (!exists("date") | class(date) == "function") {
           date <- ""
       }
       if (!exists("lang")) {
           lang <- ""
       }
       em <- geterrmessage()
       fname <- paste0("Rscript_", paste(date, paste(lang, collapse = "_"), 
           sep = "_"), ".error")
       sink(file = fname)
       cat("\n-----------------\n\n", file, "\n\n----------------\n")
       cat("\n-----------------\n\n", em, "\n----------------\n")
       traceback(2)
       sink()
       cat("\n-----------------\n\n", em, "\n----------------\n")
       cat(readLines(fname), sep = "\n")
       if (exists("job_id")) {
           cat("\n-----------------\njob_id:", job_id, "\n-----------------\n")
           wpd_job_update(job_id = job_id, job_status = "error", 
               job_comment = paste(readLines(fname), collapse = "\n"), 
               job_end_ts = as.character(Sys.time()))
           if (!interactive()) {
               Sys.sleep(4)
           }
       }
       else {
           job_id <- "unknown"
           cat("\n-----------------\njob_id:", job_id, "\n-----------------\n")
       }
       if (!interactive()) {
           wpd_notify(wpd_current_node(), "[", job_id, "]", date, 
               "--", file, "--", paste(lang, collapse = ", "), "--", 
               paste(readLines(fname), collapse = "\n"))
           q(save = "no")
       }
   })()
2: stop("Expected number of .gz files for date 24 but found ", length(files), 
       ".")
1: duty_to_do_function()');

Cool things ...

  • number 1: I have job logs: upload_jobs.zip
  • number 2: I did quality assertions to test some assumptions about the data
  • number 3: I now know why the counts are all 0.

... The logs basically say that there are more files than expected. There is supposed to be 1 file per hour of the day. If there are less we cannot be sure if counts are comparable. If there are more the numbers are most likely weird as well.

So, the 0-counts for 2008 are not by accident but by design and for all languages because all languages are in one file.

petermeissner avatar Jun 26 '20 19:06 petermeissner