top-pypi-packages
top-pypi-packages copied to clipboard
no August release
looking at the history, it seems as though the intention is to publish updated data on the first of each month.
Last update was 1st July.
https://github.com/samuelcolvin/watchfiles/pull/277#issuecomment-2269264835
Strange, still hitting the limit two weeks later:
google.api_core.exceptions.Forbidden: 403 Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas
That was for 7,000 projects.
Google BigQuery gives 1 TB free quota per month.
Here's some data on past runs. The gaps indicate when switching to fetch data for more projects.
The 5k growth seemed fairly linear, so when switching to 8k I made a rough calculation that the 0.5 TB available would last about 40 months. Instead, since the turn of 2024 the growth has increased rapidly and is now over 1 TB for 8k.
I've started the 90-day, $300 (€277.46) free trial to get things back on track and fetched the July data:
pypinfo -v --json --indent 0 --limit 8000 -sd 2024-07-02 -ed 2024-07-31 "" project > top-pypi-packages-30-days.json
jq -c . < top-pypi-packages-30-days.json > top-pypi-packages-30-days.min.json
jq -r '.rows[] | [.download_count, .project] | @csv' top-pypi-packages-30-days.json >> top-pypi-packages-30-days.csv
./deploy.sh
https://github.com/hugovk/top-pypi-packages/releases/tag/2024.08
This indeed is over 1 TB, it's 1.023 TB:
At some point whilst on the free trial, I'll do some experiments to pick a lower number of projects that's still under 1 TB, but it's concerning it grew to use ~0.5 TB in ~0.5 year.
Let's keep this open until that is done, and to check the September update goes out as planned on Sunday.
Here's a couple of pypinfo commands, limiting to just the top 10 and 1 day, otherwise identical except for the --all flag.
--all means "Show downloads by all installers, not only pip".
By default, it only fetches downloads from pip, and that's what has been used in this repo (since day 1? Or perhaps since that flag was introduced? Would have to check. At least, the default has been been used here for a long time).
Only pip
❯ pypinfo --limit 10 --days 1 "" project
Served from cache: False
Data processed: 58.21 GiB
Data billed: 58.21 GiB
Estimated cost: $0.29
| project | download_count |
|---|---|
| boto3 | 37,251,744 |
| aiobotocore | 16,252,824 |
| urllib3 | 16,243,278 |
| botocore | 15,687,125 |
| requests | 13,271,314 |
| s3fs | 12,865,055 |
| s3transfer | 12,014,278 |
| fsspec | 11,982,305 |
| charset-normalizer | 11,684,740 |
| certifi | 11,639,584 |
| Total | 158,892,247 |
All installers
❯ pypinfo --all --limit 10 --days 1 "" project
Served from cache: False
Data processed: 46.63 GiB
Data billed: 46.63 GiB
Estimated cost: $0.23
| project | download_count |
|---|---|
| boto3 | 39,495,624 |
| botocore | 17,281,187 |
| urllib3 | 17,225,121 |
| aiobotocore | 16,430,826 |
| requests | 14,287,965 |
| s3fs | 12,958,516 |
| charset-normalizer | 12,781,405 |
| certifi | 12,647,098 |
| setuptools | 12,608,120 |
| idna | 12,510,335 |
| Total | 168,226,197 |
So we can see the default pip-only costs an extra 25% data processed and data billed, and costs an extra 25% in dollars.
Unsurprisingly, the actual download counts are higher for all installers. The ranking has changed a bit, but I expect we're still getting more-or-less the same packages in the top thousands of results.
Queries
Here's the same commands, but run with --test: "--test simply prints the query".
❯ pypinfo --test --limit 10 --days 1 "" project
SELECT
file.project as project,
COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
AND details.installer.name = "pip"
GROUP BY
project
ORDER BY
download_count DESC
LIMIT 10
❯ pypinfo --test --all --limit 10 --days 1 "" project
SELECT
file.project as project,
COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
project
ORDER BY
download_count DESC
LIMIT 10
These queries are the same, except the default has an extra AND details.installer.name = "pip" condition.
It sounds reasonable it would cost more to do extra filtering work.
Installers
Let's look at the installers:
❯ pypinfo --all --limit 100 --days 1 "" installer
Served from cache: False
Data processed: 29.49 GiB
Data billed: 29.49 GiB
Estimated cost: $0.15
| installer_name | download_count |
|---|---|
| pip | 1,121,198,711 |
| uv | 117,194,833 |
| requests | 29,828,272 |
| poetry | 23,009,454 |
| None | 8,916,745 |
| bandersnatch | 6,171,555 |
| setuptools | 1,362,797 |
| Bazel | 1,280,271 |
| Browser | 1,096,328 |
| Nexus | 593,230 |
| Homebrew | 510,247 |
| Artifactory | 69,063 |
| pdm | 62,904 |
| OS | 13,108 |
| devpi | 9,530 |
| conda | 2,272 |
| pex | 194 |
| Total | 1,311,319,514 |
pip still by far the most popular, and unsurprising uv is up there too, with about 10% of pip's downloads.
The others are 10% of uv or less. A lot of them are mirroring services, that we wanted to exclude before.
I think given uv's importance, and my expectation that it will continue to take a bigger share of the pie, plus especially the extra cost for filtering by just pip, means that we should switch to fetching data for all downloaders. Plus the others don't account for that much of the pie.
Next time, I'll also do some comparison between different totals: should we keep 8,000, go back to 5,000 or some other number?
Next, let's see how number of projects affects the cost. (This data fetched yesterday, 2024-11-09.)
Days = 1, limit = 1000...8000
Let's begin checking just one day for different limits: 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000. We'll use --all, so not filtering for just pip.
Here's the basic query:
❯ pypinfo --test --all -v --json --indent 0 --days 1 --limit 8000 "" project
Credentials location set to "/Users/hugo/Dropbox/bin/data/pypinfo-ea8f23d28f2b.json".
SELECT
file.project as project,
COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
project
ORDER BY
download_count DESC
LIMIT 8000
Let's use a helper script like this:
import os
days = 1
for limit in (1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000):
outfile = f"days-{days}-limit-{limit}-all.json"
if os.path.exists(outfile):
print(f"{outfile} exists, skipping")
continue
cmd = (
f"pypinfo --all --json --indent 0 --days {days} --limit {limit} '' "
f"project > {outfile}"
)
print(cmd)
os.system(cmd)
Result: interestingly, the cost is the same for all limits (1000-8000): 0.31.
Days = 1, limit = 1000...8000, pip
Let's repeat but removing --all, so it's filters for only pip:
Result: cost again the same for all limits, and cost increased: 0.39.
Days = 30, limit = 8000...1000
Next, back to --all, repeat for 30 days, and start with 8000 first in case we were only paying for incremental changes:
days = 30
for limit in (8000, 7000, 6000, 5000, 4000, 3000, 2000, 1000):
...
Result: Again, the cost is the same regardless of limit: 4.89
Days = 1, 30, limit = 1
Let's also check for limit = 1
Result: Same cost as above, 0.31 for 1 day, and 4.89 for 30 days.
Days = 1...30, limit = 1
Now we know the cost is the same regardless of the limit (at least up to 8000), let's see how the cost varies over the number of days.
import os
import os
# for limit in (1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000):
for days in range(1, 31):
for limit in (1,):
outfile = f"days-{days}-limit-{limit}-all.json"
if os.path.exists(outfile):
print(f"{outfile} exists, skipping")
continue
cmd = (
f"pypinfo --all --json --indent 0 --days {days} --limit {limit} '' "
f"project > {outfile}"
)
print(cmd)
os.system(cmd)
Fetch results from JSONs into CSV:
import json
print("days\testimated_cost\tbytes_billed\tbytes_processed")
for days in range(1, 31):
for limit in (1,):
outfile = f"days-{days}-limit-{limit}-all.json"
with open(outfile) as f:
data = json.load(f)
out = [
days,
data["query"]["estimated_cost"],
f"{data["query"]["bytes_billed"]:,}",
f"{data["query"]["bytes_processed"]:,}",
]
print("\t".join(map(str, out)))
Gives:
days estimated_cost bytes_billed bytes_processed
1 0.31 67,423,436,800 67,423,031,967
2 0.49 106,777,542,656 106,776,936,726
3 0.66 144,589,193,216 144,588,849,934
4 0.84 183,362,387,968 183,362,056,051
5 0.94 205,844,905,984 205,844,089,492
6 1.05 230,423,527,424 230,423,239,221
7 1.20 263,483,031,552 263,482,009,301
8 1.37 301,120,618,496 301,119,942,981
9 1.56 340,849,065,984 340,848,563,030
10 1.73 379,594,997,760 379,594,187,471
11 1.91 419,779,575,808 419,778,888,347
12 2.02 442,795,819,008 442,795,580,469
13 2.11 463,927,771,136 463,927,460,363
14 2.28 500,829,257,728 500,828,317,940
15 2.47 541,425,926,144 541,424,988,775
16 2.64 580,075,388,928 580,075,257,785
17 2.83 621,216,268,288 621,215,221,560
18 3.01 660,446,642,176 660,445,832,190
19 3.13 686,769,045,504 686,768,435,724
20 3.24 710,389,268,480 710,388,303,159
21 3.41 747,798,265,856 747,798,009,119
22 3.59 789,304,049,664 789,303,120,956
23 3.78 830,582,292,480 830,581,546,594
24 3.97 871,580,565,504 871,579,537,799
25 4.14 908,331,057,152 908,330,312,456
26 4.25 933,782,093,824 933,781,258,405
27 4.35 956,512,075,776 956,511,698,396
28 4.52 992,670,121,984 992,669,344,697
29 4.70 1,032,543,272,960 1,032,542,495,766
30 4.89 1,073,809,981,440 1,073,809,418,894
Result: it's the number of days that affects cost.
Google: "The first 1 TiB of query data processed per month is free." 1 TiB = 1,099,511,627,776 bytes, and all of the values in the above table are below this limit. Note, this is with --all.
The last run in this repo, without --all, was billed at 1,376,442,646,528 bytes, over the limit:
https://github.com/hugovk/top-pypi-packages/blob/81c3f01321d609d5c20b64f79ac66d5b26b98687/top-pypi-packages-30-days.json#L2-L7
So we can remove --all and we'll be okay... for now. But as we can see from https://github.com/hugovk/top-pypi-packages/issues/36#issuecomment-2320919458, as time goes on, the line goes up.
We're at 1,073,809,981,440 and the limit is not far off at 1,099,511,627,776.
In the future, to stay under 1 TiB we'd need to reduce the number of days by some unclear amount.
Well, let's add --all for now: https://github.com/hugovk/top-pypi-packages/pull/39.
Here's the data collected from these tests: data.zip
To repeat with the limit increasing by powers of ten, up to 1,000,000, getting data for all 531,022 on PyPI.
| limit | projects count | estimated_cost | bytes_billed | bytes_processed |
|---|---|---|---|---|
| 1 | 1 | 0.20 | 43,447,746,560 | 43,447,720,943 |
| 10 | 10 | 0.20 | 43,447,746,560 | 43,447,720,943 |
| 100 | 100 | 0.20 | 43,447,746,560 | 43,447,720,943 |
| 1000 | 1,000 | 0.20 | 43,447,746,560 | 43,447,720,943 |
| 8000 | 8,000 | 0.20 | 43,447,746,560 | 43,447,720,943 |
| 10000 | 10,000 | 0.20 | 43,447,746,560 | 43,447,720,943 |
| 100000 | 100,000 | 0.20 | 43,447,746,560 | 43,447,720,943 |
| 1000000 | 531,022 | 0.20 | 43,447,746,560 | 43,447,720,943 |
Result: Again, same cost for 1 project or 531,022 projects.
Next thought, perhaps the query can be improved:
SELECT
file.project as project,
COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
project
ORDER BY
download_count DESC
LIMIT 1000
Data fetched today, 2024-11-10: data2.zip
Omitting the LIMIT is the same as fetching all, effectively the same as LIMIT 1000000, and the same cost (0.20):
SELECT
file.project as project,
COUNT(*) as download_count,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -2 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
project
ORDER BY
download_count DESC
I've switched to fetching data for all installers: https://github.com/hugovk/top-pypi-packages/pull/39
And written this up at https://dev.to/hugovk/a-surprising-thing-about-pypis-bigquery-data-2g9o