arewegameyet
arewegameyet copied to clipboard
Investigate more scalable ways of pulling data for the site
Currently all of the GitHub and Crates.io data used on the site is retrieved via a clever template macro. This is simple and keeps the build self-contained, but has a few big issues:
- As the size of the site grows, we may hit a point where the build will trigger rate limits due to the number of requests. I've had this happen to me locally when developing the templates, and it's a pain.
- It makes the site increasingly slow to build, since every build has to refetch the data.
- It's hard to maintain/add to, due to the limited logic you can express in a templating engine.
I'm wondering if we might be able to find a better way of grabbing this data (e.g. via an external script or a Rust program). This could also allow us to store the site's data in a nicer format, rather than these massive manually ordered data.toml files.
If we did this, there's more efficient options we could use for pulling the API data:
- GitHub has a GraphQL API, and I believe that might allow us to query multiple repos in a single request.
- Crates.io's index can be accessed via Git, avoiding the API altogether.
This might be overengineering things, but it's worth thinking about, I think!
GitHub has a GraphQL API, and I believe that might allow us to query multiple repos in a single request.
I played around with the graphql explorer.
{
r1: repository(owner: "ChariotEngine", name: "Chariot") {
...repoFields
}
r2: repository(owner: "duysqubix", name: "MuOxi") {
...repoFields
}
}
fragment repoFields on Repository {
url
homepageUrl
description
}
Crates.io's index can be accessed via Git, avoiding the API altogether.
Afaik the index contains only the versions,name,deps,features,yanked.
But https://crates.io/data-access mentions a database dump that is updated every 24h.
The tarball contains a crates.csv that could be processed to get the description, repository_url, homepage etc.
i wrote a script to combine the data from crates.io and the github's graphql api into a single csv file.
Convert content/ecosystem/data.toml to data.csv
The categories are joined to a string with : as separator.
toml get content/ecosystem/data.toml items | \
jq -r 'map(. + { categories: .categories | join(":")}) | (map(keys) | add | unique) as $cols | map(. as $row | $cols | map($row[.])) as $rows | $cols, $rows[] | @csv' | \
xsv select name,source,categories,homepage_url,gitter_url > data.csv
Tools: toml-cli, jq, xsv
Generate the final result.csv
The crates.csv from db-dump.tar.gz could be cached with the actions/cache action.
#!/usr/bin/bash
# Get the names of all github items and also save them for later.
repos=$(xsv search -s source github data.csv | xsv select name | tee names.csv | tail -n +2)
# Build a graphql query for all github repos
i=0
echo "{" > github.query
for r in ${repos[@]}; do
name=$(echo $r | cut -d "/" -f1)
repo=$(echo $r | cut -d "/" -f2)
cat <<QUERY >> github.query
r$i: repository(owner: "${name}", name: "${repo}") {
...repoFields
}
QUERY
let i=${i}+1
done
cat <<TAIL >> github.query
}
fragment repoFields on Repository {
description
repository: url
homepage: homepageUrl
}
TAIL
# Execute graphql query and tranform the result to csv plus the names.csv from before.
gh api graphql -f query="$(cat github.query)" | \
jq -r '[.data[]] |
(map(keys) | add | unique) as $cols |
map(. as $row | $cols | map($row[.])) as $rows |
$cols, $rows[] | @csv' | xsv cat columns names.csv - > github.csv
# Join the github data
xsv join name data.csv name github.csv | xsv select '!name[1]' > joined-github.csv
# Select the needed columns from db-dump.tar.gz's crates.csv
xsv select name,description,repository,homepage crates.csv > partial-crates.csv
# Join the crates.io data
xsv join name data.csv name partial-crates.csv | xsv select '!name[1]' > joined-crates.csv
# Concat rows and sort by name
xsv cat rows joined-crates.csv joined-github.csv | xsv sort -s name > result.csv
Tools: github-cli, jq, xsv