reader speed-up solr indexing and reduce memory overhead

The implementation for solr indexing in reader uses solr's Data Import Handler (DIH) which, very recently, has been deprecated and seems to be destined to become a 3rd party package. Within DIH, the current implementation makes use of SortedMapBackedCache for the one-to-many tables in the database scheme, which can work quite well, but the tables that are targeted have become too large to fit comfortably into this cache architecture. One option would be to replace SortedMapBackedCache with a different and more sophisticated caching scheme, for example, MapDB, but it does not seem like good timing to add custom java to DIH. I would propose using sqlite's implementation of database views to allow the values from these tables to be brought together in one field from the database, and leverage DIH's support for script-based transformers to break apart the values to then populate the index. This approach would require 6 views as follows:

CREATE VIEW view_authors AS SELECT document_id, GROUP_CONCAT(REPLACE(author,',','_')) as authors FROM authors GROUP BY document_id;

CREATE VIEW view_keywords AS SELECT document_id, GROUP_CONCAT(keyword) as keywords FROM wrd GROUP BY document_id;

CREATE VIEW view_entities AS SELECT document_id, GROUP_CONCAT(DISTINCT(entity)) as entities FROM ent GROUP BY document_id;

CREATE VIEW view_types AS SELECT document_id, GROUP_CONCAT(DISTINCT(type)) as types FROM ent GROUP BY document_id;

CREATE VIEW view_sources AS SELECT document_id, GROUP_CONCAT(source) as sources FROM sources GROUP BY document_id;

CREATE VIEW view_urls AS SELECT document_id, GROUP_CONCAT(REPLACE(url,' ','')) as urls FROM urls GROUP BY document_id;

The views have a little bit of streamlining. For authors, for example, the comma character is replaced by an underscore in the view, and then added back in during the solr processing to avoid conflicts with the default field separator. The suggested DIHconfigfile.xml is attached (with a ".txt" extension since github won't accept ".xml"). This keeps the entire indexing implementation within standard solr without requiring custom java and, from my very limited testing, appears to be dramatically faster and less memory intensive.

DIHconfigfile.xml.txt

Jul 29 '20 19:07 artunit

Very cool. You are a bonus to the team. 'More later.

Jul 29 '20 21:07 ericleasemorgan

Art, if I create a version of the CORD database which implements your SQL views, then will you be able to give your new import routine a go?

Jul 30 '20 15:07 ericleasemorgan

The numbers in slack are based on a copy of the CORD database with the views added, it would be useful to try this with @ralphlevan 's solrcloud implementation which I think talks to the "main" CORD database.

Jul 30 '20 18:07 artunit

Ralph and Art, how about if:

I duplicate ./etc/cord.db
I add the views
Y'all try your new indexing technique on the view-added version of ./etc/cord.db

How does that sound?

Jul 30 '20 19:07 ericleasemorgan

I think it's just a matter of swapping DIHconfig.xml and updating the paths but I defer to @ralphlevan - I can zap my copy of cord to save disk space.

Jul 30 '20 19:07 artunit

Eric, you know how to make changes to DIHconfig.xml. Update the zookeepers, delete the old database, make a new database using the cord configset and fire off the Data Input Handler. Nothing ventured, nothing gained!

Jul 30 '20 20:07 ralphlevan

Art & Ralph, again, I appreciate the good work y'all have done, and creating VIEWS is a great example, but for right now I will forego any churn in the CORD database and its indexing because I believe we need to focus on output right now. It takes a long time to build the carrels, and I want to spend it on carrel creation.

Aug 04 '20 16:08 ericleasemorgan

Perfectly understandable.

Aug 04 '20 18:08 artunit

reader reader copied to clipboard

speed-up solr indexing and reduce memory overhead

reader
reader copied to clipboard