atom icon indicating copy to clipboard operation
atom copied to clipboard

Solr support in AtoM

Open anvit opened this issue 9 months ago • 0 comments

Work in progress branch for adding support for Solr for searching within AtoM

Completed:

  • Docker configuration that starts Solr and Zookeeper (Solr uses this for coordinating and syncing between multiple Solr nodes when run in the cloud mode) containers.
  • A Solr plugin (arSolrPlugin) which serves as the Solr equivalent of arElasticSearchPlugin. It talks to Solr and has functions that allow indexing and searching.
  • A solr:populate task (arSolrPopulateTask) which indexes AtoM data into Solr. The indexed data can be seen at the Solr dashboard at http://localhost:8983/solr. The solr dashboard also allows searching the indexed data.
  • A set of classes that act as the equivalent of Elastica within AtoM. These are located in the arSolrPlugin/lib/client folder. The query classes essentially set up query parameters for API requests to Solr, arSolrClient accepts configuration which would allow it to communicate with Solr, and has methods which allow sending different API requests to Solr.

Work in progress:

  • arSolrSearchTask is CLI task allows searching the solr index for a few query types. Since queries can get fairly complicated, especially with Boolean queries, this was meant for quick cli testing until Solr was officially supported by the AtoM interface, an so it isn't very customizable. However this could potentially be useful for writing tests in the future.
  • Unit tests for several solr query have been added. Solr's Boolean Query, Result and Result Set, and the Solr Client currently do not have any tests written for them.

TODO

Within arSolrPlugin

High priority (essential for browse or search actions):

  • [ ] Add a class for handling nested search: Currently there is no class for handling nested search in the query classes we have for Solr. Solr doesn't have a built in nested query like ElasticSearch does since it doesn't treat nested fields in a special way. This means that while it could be possible to perform those searches using a simple boolean query that targets those nested fields, we would need to ensure we'e matching results within the same nested unit (for instance, we would need to ensure when searching for date ranges that we don't mix one start date with an end date from a different event for the same information object).
  • [ ] Add authentication to Solr Client (arSolrPlugin): Currently username and password are ignored as the current solr setup doesn't set those up either.
  • [ ] Change getDateRangeQuery's Nested Query call (arSolrPluginQuery): Since there is no nested query class for solr yet, this will need to be updated once that functionality is in place.

Medium priority (not essential for basic search but still important):

  • [ ] updateByQuery method/function (arSolrPlugin.class): This class will need a method to handle updating specific documents by query.
  • [ ] Create Diacritics analyzer (arSolrPlugin.class)
  • [ ] Create Brazilian Portuguese analyzer (arSolrPlugin.class): Solr doesn't have a default pt_BR analyzer but has specific filter classes we can use.
  • [ ] Ensure pdfs are also indexed by solr (arSolrPlugin.class): Will need to use Apache Tika to work with external docs.

Low priority (used by CLI tasks or other non search specific actions within AtoM):

  • [ ] getScrolledSearchResultIdentifiers uses Elastica\Scroll (arSolrPluginUtil) : This doesn't have a solr equivalent and will need to be handled.
  • [ ] Search, MultiSearch (see apps/qubit/modules/search/actions/autocompleteAction.class.php)
  • [ ] Bulk (for bulk document updates)
  • [ ] AbstractScript (see lib/job/arUpdatePublicationStatusJob.class.php, https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting.html)

Lowest priority (good to have features):

  • [ ] Add support for Solr server mode: Currently the docker config as well as a couple of collection based things assume that it will only be run in cloud mode. (Cloud mode uses multiple Solr nodes which is most similar to how ElasticSearch usually be configured with AtoM, Server mode has a single node, uses some slightly different API end points for a few requests, and doesn't need zookeeper)

Outside arSolrPlugin

  • AtoM extensively references Elastica, and the arElasticSearchPlugin is also deeply integrated into it. As of now, this is a list of all of the places outside the plugin itself that would need updates:

  • [ ] apps/qubit/modules/digitalobject/actions/imageflowComponent.class.php uses arElasticSearchPluginQuery, QubitSearch.

  • [ ] apps/qubit/modules/clipboard/actions/viewAction.class.php uses Elastica ResultSet, Response, Query, QueryTerms, QubitSearchPager, arElasticSearchPluginConfiguration.

  • [ ] apps/qubit/modules/default/actions/moveAction.class.php uses Elastica Query, BoolQuery, QueryTerm, QubitSearchPager, arElasticSearchPluginUtil, arElasticSearchPluginConfiguration.

  • [ ] apps/qubit/modules/default/actions/fullTreeViewAction.class.php uses Elastica QueryTerm, Elastica ResultSet (as arguments to methods), has several method names which reference ElasticSearch, arElasticSearchPluginQuery.

  • [ ] apps/qubit/modules/default/actions/browseAction.class.php uses arElasticSearchPluginQuery, arElasticSearchPluginConfiguration, QubitSearch.

  • 👆🏼 NOTE: replace L#134-L#147 (the section that essentially removes must clauses for i18n.languages queries) with a call to the removeMustWithTermField method in arSolrBoolQuery

  • [ ] apps/qubit/modules/repository/actions/holdingsAction.class.php uses Elastica QueryBool, QueryMatchAll, QueryTerm, Query, QubitSearch, arElasticSearchPluginConfiguration.

  • [ ] apps/qubit/modules/repository/actions/browseAction.class.php uses Elastica QueryMatchAll, Query, QueryTerm, arElasticSearchPluginUtil, QubitSearch.

  • [ ] apps/qubit/modules/repository/actions/maintainedActorsAction.class.php uses Elastica Query, QueryTerm, QubitSearch, QubitSearchPager, arElasticSearchPluginConfiguration.

  • [ ] apps/qubit/modules/taxonomy/actions/indexAction.class.php uses Elastica Query, BoolQuery, QueryTerm, arElasticSearchPluginUtil, arElasticSearchPluginConfiguration, QubitSearch, QubitSearchPager.

  • [ ] apps/qubit/modules/actor/actions/browseAction.class.php uses Elastica BoolQuery, QueryTerm, QueryExists, NestedQuery, arElasticSearchPluginUtil, QubitSearch, QubitSearchPager.

  • [ ] apps/qubit/modules/actor/actions/relatedInformationObjectsAction.class.php uses Elastica Query, BoolQuery, QueryTerm, NestedQuery, QubitSearchPager, QubitSearch, arElasticSearchPluginConfiguration.

  • [ ] apps/qubit/modules/search/actions/errorAction.class.php uses Elastica Exception, references ElasticSearch in error message.

  • [ ] apps/qubit/modules/search/actions/indexAction.class.php uses Elastica QueryTerm, QubitSearch, arElasticSearchPluginUtil.

  • [ ] apps/qubit/modules/search/actions/autocompleteAction.class.php uses Elastica Search, MultiSearch, Query, BoolQuery, Match, Term, QubitSearch.

  • [ ] apps/qubit/modules/search/actions/descriptionUpdatesAction.class.php uses Elastica Query, BoolQuery, QueryTerm, QueryRange, QubitSearch, QubitSearchPager, arElasticSearchPluginConfiguration.

  • [ ] apps/qubit/modules/term/actions/navigateRelatedComponent.class.php uses Elastica QueryTerm, QubitSearch, arElasticSearchPluginQuery.

  • [ ] apps/qubit/modules/term/actions/indexAction.class.php uses Elastica QueryTerms, Query, BoolQuery, QueryTerm, QubitSearch, QubitSearchPager.

  • [ ] apps/qubit/modules/informationobject/actions/inventoryAction.class.php uses Elastica BoolQuery, Query, QueryTerm, QueryTerms, QubitSearch, QubitSearchPager, arElasticSearchPluginConfiguration.

  • [ ] apps/qubit/modules/informationobject/actions/autocompleteAction.class.php uses Elastica Query, BoolQuery, MatchAll, QueryTerm, arElasticSearchPluginUtil, QubitSearch, QubitSearchPager.

  • [ ] lib/filter/QubitMeta.class.php references Elastica Exception.

  • [ ] lib/QubitLftSyncer.class.php uses Elastica Bulk, QueryTerm, Document, QubitSearch, arElasticSearchPluginQuery.

  • [ ] lib/search/QubitSearchPager.class.php uses Elastica ResultSet.

  • [ ] lib/helper/QubitHelper.php references Elastica Result.

  • [ ] lib/job/arUpdateEsActorRelationsJob.class.php references Elastica exception, QubitSearch, arElasticSearchActorPdo.

  • [ ] lib/job/arActorExportJob.class.php uses Elastica QueryTerms, arElasticSearchPluginUtil, QubitSearch.

  • [ ] lib/job/arRepositoryCsvExportJob.class.php uses Elastica QueryTerms, arElasticSearchPluginQuery, arElasticSearchPluginUtil, QubitSearch.

  • [ ] lib/job/arUpdatePublicationStatusJob.class.php uses Elastica AbstractScript, QueryTerm, QubitSearch.

  • [ ] lib/job/arInformationObjectExportJob.class.php uses Elastica QueryTerm, QueryTerms, arElasticSearchPluginUtil, arElasticSearchPluginQuery, QubitSearch.

  • [ ] lib/task/tools/updatePublicationStatusTask.class.php uses Elastica AbstractScript, QueryTerm, QubitSearch.

  • [ ] lib/task/propel/propelGenerateSlugsTask.class.php uses Elastica Query, BoolQuery, QueryTerm, QubitSearch.

  • [ ] lib/model/QubitInformationObject.php uses Elastica BoolQuery, Query, QueryMatch, QubitSearch.

  • [ ] lib/model/QubitTerm.php uses Elastica BoolQuery, QueryTerm, QubitSearch.

  • [ ] lib/task/search/arSearchStatusTask.class.php uses arElasticSearchPluginConfiguration, looks for class names starting with arElasticSearch in objectsAvailableToIndex.

  • [ ] lib/task/tools/installTask.class.php uses arElasticSearchPluginConfiguration.

  • [ ] lib/job/arUpdateEsIoDocumentsJob.class.php uses arElasticSearchInformationObject.

  • [ ] lib/job/arUpdateEsActorRelationsJob.class.php uses arElasticSearchActorPdo.

  • [ ] lib/job/arActorExportJob.class.php uses arElasticSearchPluginUtil, arElasticSearchPluginQuery.

  • [ ] lib/arInstall.class.php references arElasticSearchPlugin's search.yml and uses arElasticSearchConfigHandler.

  • [ ] lib/task/import/csvImportTask.class.php uses arElasticSearchInformationObjectPdo, QubitSearch.

  • [ ] lib/QubitMetsParser.class.php uses arElasticSearchPluginUtil.

  • [ ] lib/search/QubitSearch.class.php uses arElasticSearchPlugin.

  • [ ] lib/search/QubitSearchEngine.class.php references ElasticSearch.

  • [ ] lib/QubitFlatfileImport.class.php references ElasticSearch.

  • [ ] lib/task/propel/propelGenerateSlugsTask.class.php references ElasticSearch

  • [ ] config/ProjectConfiguration.class.php sets up arElasticSearchPlugin.

  • [ ] plugins/qbAclPlugin/lib/QubitAclSearch.class.php uses Elastica Query, BoolQuery, QueryTerm.

  • [ ] plugins/sfSkosPlugin/test/unit/importTest.php uses Elastica Exception, QubitSearch.

  • [ ] plugins/arRestApiPlugin/lib/QubitApiAction.class.php uses Elastica Query.

  • [ ] plugins/arRestApiPlugin/modules/api/actions/informationobjectsBrowseAction.class.php uses arElasticSearchPluginConfiguration, arElasticSearchPluginQuery.

  • [ ] plugins/qtAccessionPlugin/modules/accession/actions/browseAction.class.php uses Elastica Query, BoolQuery, QueryMatchAll, QubitSearch, QubitSearchPager, arElasticSearchPluginUtil, arElasticSearchPluginConfiguration.

  • [ ] test/unit/escapeTermTest.php tests arElasticSearchPluginUtil::escapeTerm


In addition to the list above, other tasks that would need to be completed in order to switch to Solr:

  • [ ] Set solr to be a default plugin that is on by default
  • [ ] Update installTask to set up a config file for solr in the root config folder (similar to ES), and change the arSolPluginPluginConfiguration to point to this file
  • [ ] Create a new vagrant setup for development with solr
  • [ ] Update AtoM Docs: New documentation would need to be added that details installation and configuration. ElasticSearch advanced queries would also no longer work but could be replaced with documentation for solr's query syntax that would allow performing complex custom queries.

anvit avatar May 16 '24 17:05 anvit