libraries.io
libraries.io copied to clipboard
Getting list of all CPAN package names is broken
Paging through the CPAN releases API no longer works for results greater than 10,000
Code location: https://github.com/librariesio/libraries.io/blob/master/app/models/package_manager/cpan.rb#L17
Example url:
https://fastapi.metacpan.org/v1/release/_search?fields=distribution&from=10000&q=status%3Alatest&size=5000&sort=date%3Adesc
Error:
{
"message": "[Request] ** [http://127.0.0.1:9200]-[500] {\"error\":{\"root_cause\":[{\"type\":\"query_phase_execution_exception\",\"reason\":\"Result window is too large, from + size must be less than or equal to: [10000] but was [15000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.\"}],\"type\":\"search_phase_execution_exception\",\"reason\":\"all shards failed\",\"phase\":\"query\",\"grouped\":true,\"failed_shards\":[{\"shard\":0,\"index\":\"cpan_v1_01\",\"node\":\"euEoqisPSk68CnedNAzoZA\",\"reason\":{\"type\":\"query_phase_execution_exception\",\"reason\":\"Result window is too large, from + size must be less than or equal to: [10000] but was [15000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.\"}}]},\"status\":500}, called from sub Search::Elasticsearch::Role::Client::Direct::__ANON__ at /home/metacpan/metacpan-api/lib/MetaCPAN/Server/Controller.pm line 125. With vars: {'request' => {'method' => 'GET','ignore' => [],'path' => '/cpan/release/_search','serialize' => 'std','qs' => {'q' => 'status:latest','fields' => 'distribution','sort' => 'date:desc','size' => 5000,'from' => 10000},'body' => undef},'status_code' => 500}\n"
}
The docs suggest using the scroll api: https://github.com/metacpan/metacpan-api/blob/master/docs/API-docs.md#being-polite but the links to the docs are dead.
More recent scroll api docs here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html but I couldn't seem to get it to accept scroll_id as a parameter:
{
"message": "[Param] ** Unknown param (scroll_id) in (search) request. See docs at: http://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html, called from sub Search::Elasticsearch::Role::Client::Direct::__ANON__ at /home/metacpan/metacpan-api/lib/MetaCPAN/Server/Controller.pm line 125."
}
Perhaps this will help. I updated the links for the metacpan-api documentation.
#!/bin/bash
COUNT="5000"
OUTPUT_FILE="/tmp/metacpan-dists.jsonl"
echo "Request: 1";
JSON0="$(curl -s "https://fastapi.metacpan.org/v1/release/_search?scroll=1m&size=$COUNT&q=status:latest&fields=distribution")";
TOTAL=$( echo $JSON0 | jq '.hits.total' )
echo "Total dists: $TOTAL"
REQUESTS_N=$(( ( $TOTAL + $COUNT - 1 )/$COUNT ))
echo "Will make $REQUESTS_N requests total";
SCROLL_ID=$(echo $JSON0 | jq -r '._scroll_id');
echo $JSON0 | jq '.hits.hits | .[].fields.distribution' > $OUTPUT_FILE
for i in $( seq 2 $REQUESTS_N ); do
echo "Request: $i";
JSON="$(curl -s -XPOST 'https://fastapi.metacpan.org/v1/_search/scroll?scroll=1m' -d $SCROLL_ID)";
SCROLL_ID=$(echo $JSON | jq -r '._scroll_id');
echo $JSON | jq '.hits.hits | .[].fields.distribution | .[]' >> $OUTPUT_FILE;
done
sort -u $OUTPUT_FILE | wc -l
Definition of the money is the same as the best possible option for a good friend