arewegameyet icon indicating copy to clipboard operation
arewegameyet copied to clipboard

Investigate more scalable ways of pulling data for the site

Open 17cupsofcoffee opened this issue 4 years ago • 2 comments

Currently all of the GitHub and Crates.io data used on the site is retrieved via a clever template macro. This is simple and keeps the build self-contained, but has a few big issues:

  • As the size of the site grows, we may hit a point where the build will trigger rate limits due to the number of requests. I've had this happen to me locally when developing the templates, and it's a pain.
  • It makes the site increasingly slow to build, since every build has to refetch the data.
  • It's hard to maintain/add to, due to the limited logic you can express in a templating engine.

I'm wondering if we might be able to find a better way of grabbing this data (e.g. via an external script or a Rust program). This could also allow us to store the site's data in a nicer format, rather than these massive manually ordered data.toml files.

If we did this, there's more efficient options we could use for pulling the API data:

  • GitHub has a GraphQL API, and I believe that might allow us to query multiple repos in a single request.
  • Crates.io's index can be accessed via Git, avoiding the API altogether.

This might be overengineering things, but it's worth thinking about, I think!

17cupsofcoffee avatar Dec 08 '20 22:12 17cupsofcoffee

GitHub has a GraphQL API, and I believe that might allow us to query multiple repos in a single request.

I played around with the graphql explorer.

{
  r1: repository(owner: "ChariotEngine", name: "Chariot") {
    ...repoFields
  }
  r2: repository(owner: "duysqubix", name: "MuOxi") {
    ...repoFields
  }
}

fragment repoFields on Repository {
  url
  homepageUrl
  description
}

Crates.io's index can be accessed via Git, avoiding the API altogether.

Afaik the index contains only the versions,name,deps,features,yanked. But https://crates.io/data-access mentions a database dump that is updated every 24h. The tarball contains a crates.csv that could be processed to get the description, repository_url, homepage etc.

nickelc avatar Jan 18 '21 14:01 nickelc

i wrote a script to combine the data from crates.io and the github's graphql api into a single csv file.

Convert content/ecosystem/data.toml to data.csv

The categories are joined to a string with : as separator.

toml get content/ecosystem/data.toml items | \
    jq -r 'map(. + { categories: .categories | join(":")}) | (map(keys) | add | unique) as $cols | map(. as $row | $cols | map($row[.])) as $rows | $cols, $rows[] | @csv' | \
    xsv select name,source,categories,homepage_url,gitter_url > data.csv

Tools: toml-cli, jq, xsv

Generate the final result.csv

The crates.csv from db-dump.tar.gz could be cached with the actions/cache action.

result.csv.txt

#!/usr/bin/bash

# Get the names of all github items and also save them for later.
repos=$(xsv search -s source github data.csv | xsv select name | tee names.csv | tail -n +2)

# Build a graphql query for all github repos
i=0
echo "{" > github.query
for r in ${repos[@]}; do
    name=$(echo $r | cut -d "/" -f1)
    repo=$(echo $r | cut -d "/" -f2)
    cat <<QUERY >> github.query
  r$i: repository(owner: "${name}", name: "${repo}") {
    ...repoFields
  }
QUERY
    let i=${i}+1
done
cat <<TAIL >> github.query
}

fragment repoFields on Repository {
  description
  repository: url
  homepage: homepageUrl
}
TAIL

# Execute graphql query and tranform the result to csv plus the names.csv from before.
gh api graphql -f query="$(cat github.query)" | \
    jq -r '[.data[]] |
        (map(keys) | add | unique) as $cols |
        map(. as $row | $cols | map($row[.])) as $rows |
        $cols, $rows[] | @csv' | xsv cat columns names.csv - > github.csv

# Join the github data
xsv join name data.csv name github.csv | xsv select '!name[1]' > joined-github.csv

# Select the needed columns from db-dump.tar.gz's crates.csv
xsv select name,description,repository,homepage crates.csv > partial-crates.csv

# Join the crates.io data
xsv join name data.csv name partial-crates.csv | xsv select '!name[1]' > joined-crates.csv

# Concat rows and sort by name
xsv cat rows joined-crates.csv joined-github.csv | xsv sort -s name > result.csv

Tools: github-cli, jq, xsv

nickelc avatar Jan 25 '21 12:01 nickelc