mudrod Implement SolrDriver/GoraDriver for PO.DAAC Integration

As requested by the project team, we should look into extending Mudrod storage functionality such that we can use Apache Solr as an indexing server. Justification is simple, this is what is in use at PO.DAAC. We should review both Solrj and Apache Gora as options before hardcoding anything.

Feb 01 '17 21:02 lewismc

Just doing a little preliminary research, this is going to be very difficult. Doing a search for "import org.elasticsearch" yields 27 different files where we are dependent directly on the Elastic Search libraries. This means at a minimum we will need to alter these 27 files. I haven't done any further analysis as to how hard it will be to extract ES from these files.

Occurrences of 'import org.elasticsearch' in Project
- mudrod-core
  - esiptestbed.mudrod.driver
    - ESDriver.java
  - esiptestbed.mudrod.integration
    - LinkageIntegration.java
  - esiptestbed.mudrod.metadata.pre
    - ApiHarvester.java
  - esiptestbed.mudrod.metadata.structure
    - MetadataExtractor.java
  - esiptestbed.mudrod.ontology.process
    - OntologyLinkCal.java
  - esiptestbed.mudrod.recommendation.pre
    - ImportMetadata.java
    - NormalizeVariables.java
    - SessionCooccurence.java
  - esiptestbed.mudrod.recommendation.process
    - VariableBasedSimilarity.java
  - esiptestbed.mudrod.recommendation.structure
    - HybridRecommendation.java
    - MetadataOpt.java
    - RecomData.java
  - esiptestbed.mudrod.ssearch
    - ClickstreamImporter.java
    - Dispatcher.java
    - Searcher.java
  - esiptestbed.mudrod.ssearch.ranking
    - TrainingImporter.java
  - esiptestbed.mudrod.utils
    - ESTransportClient.java
    - LinkageTriple.java
  - esiptestbed.mudrod.weblog.pre
    - CrawlerDetection.java
    - HistoryGenerator.java
    - ImportLogFile.java
    - LogAbstract.java
    - RemoveRawLog.java
    - SessionGenerator.java
    - SessionStatistic.java
  - esiptestbed.mudrod.weblog.structure
    - Session.java
    - SessionExtractor.java

Feb 03 '17 00:02 fgreg

No joke, it is a non trivial codebase amendment. We have two options,

essentially rip all ES stuff out and do a direct replacement with Solrj, or
make an attempt to abstract the functionality out into a core Driver interface, which would live in esiptestbed.mudrod.driver.

The other issue we need to consider is what the tradeoff's are in terms of performance between the Spark + ES integration we currently have (parrallize log ingestion and subsequent processing) Vs the Spark + Solr alternative (which we still have to design and implement).

I have previously used Lucidworks spark-solr for achieving this. It would be a great please to start. Right now I think that it may be best for us to

Feb 03 '17 05:02 lewismc

mudrod mudrod copied to clipboard

Implement SolrDriver/GoraDriver for PO.DAAC Integration

mudrod
mudrod copied to clipboard