Certain Metrics API calls return unexpected or no counts in some Dataverse repositories
Some of the new Metrics API calls added in Dataverse 5.5 have returned unexpected counts or no counts when I tried them in a few Dataverse repositories. Some examples from the Harvard Dataverse Repository specifically:
The endpoint to get the count of all file downloads in a given Dataverse collection returned unexpectedly high counts. E.g:
- https://dataverse.harvard.edu/api/info/metrics/downloads/?parentAlias=adriancorrendo returns 28,172,523 when it should return ~342
- https://dataverse.harvard.edu/api/info/metrics/downloads/?parentAlias=harvard returned 28,800,772, when a database query returned the more expected 2,321,135
- https://dataverse.harvard.edu/api/info/metrics/downloads/?parentAlias=EHS returned 28,177,683 when a database query returned the more expected 10,705
The "filedownload" calls return a CSV only for certain Dataverse collections:
- https://dataverse.harvard.edu/api/info/metrics/filedownloads/?parentAlias=EHS
- https://dataverse.harvard.edu/api/info/metrics/filedownloads/monthly/?parentAlias=EHS
- https://dataverse.harvard.edu/api/info/metrics/filedownloads/?parentAlias=dkoretz
- https://dataverse.harvard.edu/api/info/metrics/filedownloads/monthly/?parentAlias=dkoretz
But return nothing if:
- You don't specify a Dataverse collection:
- https://dataverse.harvard.edu/api/info/metrics/filedownloads
- Or you specify certain other collections:
- https://dataverse.harvard.edu/api/info/metrics/filedownloads/?parentAlias=adriancorrendo
- https://dataverse.harvard.edu/api/info/metrics/filedownloads/monthly/?parentAlias=adriancorrendo
- https://dataverse.harvard.edu/api/info/metrics/filedownloads/?parentAlias=data_ncov
- https://dataverse.harvard.edu/api/info/metrics/filedownloads/monthly/?parentAlias=data_ncov
The same is true if you try to get the same data in JSON:
- Works:
curl -H 'Accept:application/json' https://dataverse.harvard.edu/api/info/metrics/filedownloads/?parentAlias=EHS - Doesn't work:
curl -H 'Accept:application/json' https://dataverse.harvard.edu/api/info/metrics/filedownloads/?parentAlias=data_ncov
From what I can tell so far, the "filedownload" calls work only when all files in the collection have PIDs. The collections in Harvard Dataverse Repository with the aliases EHS and dkoretz have files that all have file PIDS, while the collections with the aliases adriancorrendoa and data_ncov contain files that do not have PIDs. (Many of the older collections in the repository have files with PIDs while recently created or recently active collections tend to have files that don't PIDs since the repository turned off file PID registering a year or two ago.)
The other types of calls seem to return the right counts, including the call for getting all file downloads in the installation (https://dataverse.harvard.edu/api/info/metrics/downloads) and for getting all file downloads in a given Dataverse collection by month (e.g. https://dataverse.harvard.edu/api/info/metrics/downloads/monthly/?parentAlias=EHS, https://dataverse.harvard.edu/api/info/metrics/downloads/monthly/?parentAlias=data_ncov and https://dataverse.harvard.edu/api/info/metrics/downloads/monthly/?parentAlias=adriancorrendo)
@jggautier - one quick question - when the metrics box on the front page shows 28,808,823 Downloads, why would you expect a call to get the downloads for ?parentAlias=harvard to not be close to that?
Turning on FINE logging for the MetricsServiceBean will log the exact queries being produced for the various calls above. Knowing that would help debug whether there's a flaw in the logic or some db issues. Similarly, it looks like at least some of the calls returning nothing are the result of 500 server errors which should be logged. Getting that log info will also be helpful in debugging.
Hey @qqmyers. Ah, I hadn't noticed that the Metrics API guides say that calls with ?parentAlias= "return the number of datasets in the Dataverse collection with alias ‘abc’ and in sub-collections within it."
So https://dataverse.harvard.edu/api/info/metrics/downloads/?parentAlias=harvard should return the same download count shown at the top of the repository's front page?
Is there a way to use the Metrics APIs to get only the counts of things owned by a given Dataverse collection and not things owned by its subcollections?
I don't think there's any option for just the collection itself right now - probably not too hard to implement.
W.r.t. the counts matching. In theory, I think everything should match. In practice it's possible that the new queries handle legacy data/bad db records slightly differently. It's been a while since I wrote this but the type of things where there could be differences are, for example, when old entries have a null date. There could also be bugs - which is where the FINE logging could help. If you see a difference, looking at the raw queries being done would help in deciding whether they are accurately handling new data and just treating old records slightly differently than internal queries, or are really doing something incorrectly.
Thanks @qqmyers!
https://github.com/IQSS/dataverse/issues/9536 may be related
#9017 probably fixed the file downloads part of the issue
Today I finally got around to seeing if https://github.com/IQSS/dataverse/pull/9017 fixed the file downloads parts of this GitHub issue. It looks like the PR did confirm that some parts of the metrics API wasn't accounting for files that didn't have PIDs.
All of the example API calls I wrote about in the original comment now return expected results. Closing this issue.