openVirus icon indicating copy to clipboard operation
openVirus copied to clipboard

Get the DOAJ articles into Solr

Open deadlyvices opened this issue 4 years ago • 5 comments

  • Download the DOAJ article dump and uncompress it
  • Do a data-driven indexing into Solr on the Azure box to see how it copes
  • Investigate the article schema further and write a schema.xml based on it
  • Re-index with a predefined schema
  • Document access to the Solr index

deadlyvices avatar Apr 02 '20 08:04 deadlyvices

The articles are combined into massive JSON files. They will need to be extract from each file. I'll have to think about how to do this, probably using KNIME.

deadlyvices avatar Apr 02 '20 10:04 deadlyvices

Currently pulling apart the docs using KNIME. However this may be an option: https://lucene.apache.org/solr/guide/8_0/indexing-nested-documents.html

deadlyvices avatar Apr 02 '20 11:04 deadlyvices

It’s probably too large to POST to Solr in one big chunk. If you have any problems I should be able to split it with a Python script (and a lotta RAM).

anjackson avatar Apr 02 '20 11:04 anjackson

Ah hang on, it’s broken into batches of 100,000. That might work.

anjackson avatar Apr 02 '20 12:04 anjackson

In case it helps, if you can run jq, you can split the single JSON file into JSONLines format so each line is one element of the original array:

 jq -cn --stream 'fromstream(1|truncate_stream(inputs))' doaj_article_data_2020-04-01/article_batch_1.json > doaj_article_data_2020-04-01/article_batch_1.jsonl

You could then split the jsonl file into smaller chunks, and then use those. I believe Solr supports jsonl format so you should be able to POST them directly into Solr.

anjackson avatar Apr 04 '20 11:04 anjackson