pepy Non-normalised package name

Non-normalised package name

Open jayvdb opened this issue 5 years ago • 4 comments

I would like to use the data to correlate with openSUSE package names, which use the 'real' name supplied in setup.py, i.e. not-normalised.

I've been doing a bit of research at https://github.com/hugovk/top-pypi-packages/issues/4, and the raw data from bigquery can include this, with a very small perf hit, or it can be added afterwards by doing lookups against PyPI directly, which would be a significant extra effort to build and more work for the batch jobs.

I'm putting in the effort - I want to know where I should put the effort, here or somewhere else. If here, would you prefer it coming from bigquery, or added afterwards from PyPI. The latter makes sense if there are other details from PyPI that you believe are needed in this project which cant be obtained from the bigquery dataset.

May 10 '19 12:05 jayvdb

I would really like to be able to search with the original name and the normalized one.

I checked that in BigQuery there isn't any column where is only the original project name, and as you mentioned in the other issue, it can be extracted it from url or filename. Do you have a query to see the output and to test it manually?

Maybe you also could normalize the original names that are in setup.py, I think it would work.

May 10 '19 20:05 psincraian

I've read up on bq, and it supports the sql features i need to achieve this in a single query. Im on mobile atm, so will test theory tmr.

However, your project may benefit from a simpler approach, which will be slightly more processing. Instead of grouping by project name, group by project name and filename. The counts will be downloads of each file. Then select the project name and filename and store both in your schema. Then add the counts for each project in your batch processing to get the totals as before. Now you can also show version stats, or wheel or sdist stats, if those interest you.

May 12 '19 14:05 jayvdb

The minimal query change for this project would be something like

#standardSQL
SELECT substr(max(file.filename),1,LENGTH(file.project)) as unnormalised_name, count(*) AS downloads, file.project as normalised_name
FROM `the-psf.pypi.downloads20190505`
WHERE file.project = 'pyyaml'
GROUP BY file.project

but with 20190505 being a substitution like currently occurs in https://github.com/psincraian/pepy/blob/master/pepy/infrastructure/bq_downloads_extractor.py

and WHERE file.project = 'pyyaml' removed , as it is here to reduce the amount of work being done while debugging.

Also substr(max(file.filename),1,LENGTH(file.project)) as unnormalised_name can be replaced with simply max(file.filename) as max_filename to let the app logic see the real filename and do the trimming - the app logic could also normalise the un-normalised string and verify that it is the same as the normalised name in the bigquery dataset. A checksum if you like.

That query assumes that the 'max' filename is the best filename, but really it is mostly relying on the 'fact' that all filenames per project have the same prefix (which isnt quite correct - often there are packages which used _ that have switched to using -, so their filenames change for different version), or that the differences are trivial. Before deploying, some sanity check queries are needed to verify that those assumptions are true, or good enough. There is a more complicated query which can return the filename of the most downloaded file, but it is more development and runtime effort, and I doubt it is any better than the above.

Another approach, which seems to 'cost' more time/money/etc, is collecting all filenames into a cell in each row.

#standardSQL
SELECT file_project_name as project_name, sum(file_download_count) AS downloads, ARRAY_AGG(filename) as filenames
FROM
(
  SELECT file.project as file_project_name, file.filename as filename, count(*) AS file_download_count
  FROM `the-psf.pypi.downloads20190505`
  WHERE file.project = 'pyyaml'
  GROUP BY file.project, file.filename
) as file_downloads
GROUP BY file_project_name

Then the application server can at least run diagnostics on all filenames per project to ensure they are all appropriate and can some some basic analytics on the types of files being downloaded, but it lacks the ability to understand importance of any deductions as it doesnt have the download counts per file.

Now that I see the results, I am even more convinced that tools like pepy really shouldnt be letting bigquery aggregate by project name, but should instead gather per-file stats and then use app logic to aggregate the file stats smartly.

e.g. here is the results for pyyaml on 20190505

#standardSQL
SELECT file.project as file_project_name, file.filename as filename, count(*) AS file_download_count
FROM `the-psf.pypi.downloads20190505`
WHERE file.project = 'pyyaml'
GROUP BY file.project, file.filename
ORDER BY file_download_count desc

There are lots of oddities in that dataset, and it is worth collecting the raw data and finding ways to show it. But that is a bigger schema change, and requires a migration to handle backwards compatibility any retained historical data which isnt available and re-fetched.

May 13 '19 03:05 jayvdb

Hi @jayvdb,

Thanks for the clarification and the details. I really like the idea to have the downloads grouped by filename and then show more relevant stats to the user. Regarding the historical data I am not worried, we can delete it. All this data is in bigquery and I only need the downloads of the last 30 days.

The lean approach is to have the project name in pepy as another column, this should be straightforward to implement and to show the results. If you want to have it quickly you can create a pull request 😃

I will open another issue to have the downloads stats per version to discuss it 😉

May 20 '19 21:05 psincraian

pepy pepy copied to clipboard

Non-normalised package name

pepy
pepy copied to clipboard