openVirus Get the DOAJ articles into Solr

Get the DOAJ articles into Solr

Open deadlyvices opened this issue 4 years ago • 5 comments

Download the DOAJ article dump and uncompress it
Do a data-driven indexing into Solr on the Azure box to see how it copes
Investigate the article schema further and write a schema.xml based on it
Re-index with a predefined schema
Document access to the Solr index

Apr 02 '20 08:04 deadlyvices

The articles are combined into massive JSON files. They will need to be extract from each file. I'll have to think about how to do this, probably using KNIME.

Apr 02 '20 10:04 deadlyvices

Currently pulling apart the docs using KNIME. However this may be an option: https://lucene.apache.org/solr/guide/8_0/indexing-nested-documents.html

Apr 02 '20 11:04 deadlyvices

It’s probably too large to POST to Solr in one big chunk. If you have any problems I should be able to split it with a Python script (and a lotta RAM).

Apr 02 '20 11:04 anjackson

Ah hang on, it’s broken into batches of 100,000. That might work.

Apr 02 '20 12:04 anjackson

In case it helps, if you can run jq, you can split the single JSON file into JSONLines format so each line is one element of the original array:

 jq -cn --stream 'fromstream(1|truncate_stream(inputs))' doaj_article_data_2020-04-01/article_batch_1.json > doaj_article_data_2020-04-01/article_batch_1.jsonl

You could then split the jsonl file into smaller chunks, and then use those. I believe Solr supports jsonl format so you should be able to POST them directly into Solr.

Apr 04 '20 11:04 anjackson

openVirus openVirus copied to clipboard

Get the DOAJ articles into Solr

openVirus
openVirus copied to clipboard