libraries.io icon indicating copy to clipboard operation
libraries.io copied to clipboard

Getting list of all CPAN package names is broken

Open andrew opened this issue 7 years ago • 2 comments

Paging through the CPAN releases API no longer works for results greater than 10,000

Code location: https://github.com/librariesio/libraries.io/blob/master/app/models/package_manager/cpan.rb#L17

Example url:

https://fastapi.metacpan.org/v1/release/_search?fields=distribution&from=10000&q=status%3Alatest&size=5000&sort=date%3Adesc

Error:

{
"message": "[Request] ** [http://127.0.0.1:9200]-[500] {\"error\":{\"root_cause\":[{\"type\":\"query_phase_execution_exception\",\"reason\":\"Result window is too large, from + size must be less than or equal to: [10000] but was [15000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.\"}],\"type\":\"search_phase_execution_exception\",\"reason\":\"all shards failed\",\"phase\":\"query\",\"grouped\":true,\"failed_shards\":[{\"shard\":0,\"index\":\"cpan_v1_01\",\"node\":\"euEoqisPSk68CnedNAzoZA\",\"reason\":{\"type\":\"query_phase_execution_exception\",\"reason\":\"Result window is too large, from + size must be less than or equal to: [10000] but was [15000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.\"}}]},\"status\":500}, called from sub Search::Elasticsearch::Role::Client::Direct::__ANON__ at /home/metacpan/metacpan-api/lib/MetaCPAN/Server/Controller.pm line 125. With vars: {'request' => {'method' => 'GET','ignore' => [],'path' => '/cpan/release/_search','serialize' => 'std','qs' => {'q' => 'status:latest','fields' => 'distribution','sort' => 'date:desc','size' => 5000,'from' => 10000},'body' => undef},'status_code' => 500}\n"
}

The docs suggest using the scroll api: https://github.com/metacpan/metacpan-api/blob/master/docs/API-docs.md#being-polite but the links to the docs are dead.

More recent scroll api docs here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html but I couldn't seem to get it to accept scroll_id as a parameter:

{
"message": "[Param] ** Unknown param (scroll_id) in (search) request. See docs at: http://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html, called from sub Search::Elasticsearch::Role::Client::Direct::__ANON__ at /home/metacpan/metacpan-api/lib/MetaCPAN/Server/Controller.pm line 125."
}

andrew avatar Feb 07 '18 13:02 andrew

Perhaps this will help. I updated the links for the metacpan-api documentation.

#!/bin/bash

COUNT="5000"
OUTPUT_FILE="/tmp/metacpan-dists.jsonl"

echo "Request: 1";
JSON0="$(curl -s "https://fastapi.metacpan.org/v1/release/_search?scroll=1m&size=$COUNT&q=status:latest&fields=distribution")";

TOTAL=$( echo $JSON0 | jq '.hits.total' )
echo "Total dists: $TOTAL"
REQUESTS_N=$(( ( $TOTAL + $COUNT - 1 )/$COUNT ))
echo "Will make $REQUESTS_N requests total";

SCROLL_ID=$(echo $JSON0 | jq -r '._scroll_id');

echo $JSON0 | jq '.hits.hits | .[].fields.distribution' > $OUTPUT_FILE

for i in $( seq 2 $REQUESTS_N ); do
	echo "Request: $i";
	JSON="$(curl -s -XPOST 'https://fastapi.metacpan.org/v1/_search/scroll?scroll=1m' -d $SCROLL_ID)";
	SCROLL_ID=$(echo $JSON | jq -r '._scroll_id');
	echo $JSON | jq '.hits.hits | .[].fields.distribution | .[]' >> $OUTPUT_FILE;
done

sort -u $OUTPUT_FILE | wc -l

zmughal avatar May 18 '21 05:05 zmughal

Definition of the money is the same as the best possible option for a good friend

M00NZ1R94 avatar Oct 29 '21 02:10 M00NZ1R94