data.gov icon indicating copy to clipboard operation
data.gov copied to clipboard

Dissect Solr Performance through New Relic

Open nickumia-reisys opened this issue 2 years ago • 2 comments

User Story

In order to gain insight into why Solr has stability issues, the Data.gov Solr team wants to integrate NR into our Solr deployment and investigate performance metrics to isolate problem areas.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • [ ] GIVEN a Solr is provisioned
    WHEN I log into NR
    THEN I can see a dashboard of performance metrics for Solr

Background

There have been numerous issues with Solr where we could not identify the cause of the problem and were developing blindly. This issue would give us insight into what function calls or Solr operations are causing various problems and help us identify which parameters should be tuned for that particular optimization.

Historical Issues:

  • https://github.com/GSA/data.gov/issues/3647
  • https://github.com/GSA/data.gov/issues/3875
  • https://github.com/GSA/data.gov/issues/3636
  • https://github.com/GSA/data.gov/issues/3603
  • https://github.com/GSA/data.gov/issues/3917
  • https://github.com/GSA/data.gov/issues/3797
  • https://github.com/GSA/data.gov/issues/3784
  • https://github.com/GSA/data.gov/issues/3783
  • https://github.com/GSA/data.gov/issues/3770
  • https://github.com/GSA/data.gov/issues/3920

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

  • [ ] #4473
  • [ ] Test how metrics are being loaded into NR
  • [ ] Identify a way of injecting the NR license during deployment
  • [ ] Investigate staging/production Solr and see what are the problem areas.

Reference: https://tech.olx.com/improving-solr-performance-f4202d28b72d

nickumia-reisys avatar Sep 15 '22 23:09 nickumia-reisys

Through an interactive discussion with NR Support, it was determined that there are solr optimizations we can do:

  1. Speed Optimization image
    1. Number of Solr Calls
      • For a homepage load, there are 21 calls to Solr. This should not need to be more than 2. In other words, the CKAN core code is inefficient. If we wanted to optimize this, it would be a daring endeavor to not break a code feature of CKAN. Serious thought would be needed for this effort.
    2. Speed of Solr Calls
      • Each Solr call takes around 1s on average. This means our Solr deployment is pretty inefficient. For context, a ("similar"?) DB call is made 140 times, but takes less than 100ms for all of those call cumulatively. We probably won't be able to match the DB speed. However, the performance-to-cost ratio may or may not be worth increasing the size of the Solr instance. We are giving 4 vCPU and AWS has support for upto 16 vCPU. image
  2. Better Performance Monitoring
    1. See https://github.com/GSA/data.gov/issues/4473 for more details.

These points don't talk to why there is a (memory leak?) in Solr and what we can do to resolve that as yet. The second point will allow us to do more debugging.

nickumia-reisys avatar Sep 29 '23 19:09 nickumia-reisys

Depends upon this ticket: https://github.com/GSA/data.gov/issues/3956

btylerburton avatar Dec 14 '23 21:12 btylerburton