SOLR-17492: Introduce recommendations of WAYS of running Solr from small to massive
https://issues.apache.org/jira/browse/SOLR-17492
Description
Add recommendations of best practices for deploying Solr
Solution
I am starting with my approach that I shared at Community/Code NA, just to get us moving. I would love the wisdom of the community. We have many areas where different folks have knowledge, and it's all pretty tribal. I'd like to get it all written down so folks don't have to relearn the same thing over and over.
Tests
no tests, but do need eye balls!
We have diagrams generated in our Markdown!
First pass in done! I have put in as NOTE: a number of places where more input is needed. I think this could be a good page to discuss as a group at a Community Meetup, make sure we are going in a direction that the community supports.
Whether you're just getting started with Solr or looking to fine-tune an existing setup, these practical tips and real-world scenarios may help you get the most out of this powerful search platform.
Best Practices for Using Solr
1.Run Solr as a Cluster for Better Performance Solr works best when deployed as a cluster. Start with at least three nodes for fault tolerance and scalability, and scale horizontally as your needs grow.
-
Sharding and Replication: Break your data into shards for parallel processing and use replicas for redundancy. A good starting point is two replicas per shard, but adjust this based on your workload.
-
Optimize Indexing: Carefully plan your schema to ensure efficient indexing and querying. Use dynamic fields and copy fields where appropriate to keep things flexible without overloading your system.
-
Caching for Speed: Solr provides powerful caching options like query, document, and filter caches. Use these for frequently accessed data to speed up query times significantly.
-
Tune the JVM: Since Solr is Java-based, JVM tuning is crucial. Adjust heap size to balance memory usage and garbage collection. Monitor GC logs and experiment with policies like G1GC or CMS for optimal performance.
2. Always Use Solr in Cloud Mode For a robust, scalable setup, Solr Cloud Mode is the way to go. This setup requires ZooKeeper, which manages cluster coordination, leader election, and configuration.
-
ZooKeeper’s Role: ZooKeeper ensures your Solr cluster runs smoothly by handling shard placement, failover, and configuration changes dynamically.
-
Backups and Security: -Always back up your Solr and ZooKeeper data regularly. Use Solr's built-in backup tools or external snapshot mechanisms for safety. -Secure your cluster with SSL/TLS, and set up role-based access control, ideally with tools like Apache Ranger. If Ranger isn’t an option, manual permissions management works too.
-
Monitoring is Essential: Keeping an eye on your Solr cluster is crucial for ensuring smooth operations. A great place to start is the Solr Web UI, which provides a user-friendly interface to monitor metrics like query performance, index health, and cache usage. It's easy to use and perfect for quickly spotting any issues. For more advanced needs, you may integrate tools like Prometheus and Grafana for custom dashboards and alerting. However, I should mention that I don’t have direct experience with Prometheus or Grafana specifically when working with Solr.
Using Scenarios: Real-World Applications of Solr 1. Managing Solr for a Large Dataset I used open-source Solr as a search engine for a mobile app. Instead of interacting with Solr directly, I managed the setup via ZooKeeper APIs. Here’s what that looked like:
- Cluster Configuration: The cluster handled over 100 TB of data spread across 11 physical machines, each running 16 Solr instances.
- Sharding and Replication: Data was stored in shards, with each shard having two replicas to ensure fault tolerance and load balancing.
- Data Storage: Data was stored directly on the local file system, which was a great fit for this use case.
- Management Approach: Instead of accessing Solr directly, I managed the system via ZooKeeper APIs. This approach, even with an embedded ZooKeeper, worked efficiently under heavy load.
2.Using Solr with Cloudera and HDFS Another scenario involved deploying Solr in a Cloudera ecosystem with HDFS for storage. Here’s what worked and what didn’t:
- Cluster Management: ZooKeeper handled cluster coordination, while Ranger (and previously Sentry) managed permissions.
- Challenges: Occasionally, node failures caused HDFS file locks, which were difficult to resolve without downtime. These required manual fixes and a lot of patience!
If you’ve got questions or need help with something specific, just let me know. I’m happy to share more!
This PR has had no activity for 60 days and is now labeled as stale. Any new activity will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the [email protected] mailing list. To exempt this PR from being marked as stale, make it a draft PR or add the label "exempt-stale". If left unattended, this PR will be closed after another 60 days of inactivity. Thank you for your contribution!
This remains on my "must do" list for Solr 10, and I will pick it up as we get closer ;-).
This PR has had no activity for 60 days and is now labeled as stale. Any new activity will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the [email protected] mailing list. To exempt this PR from being marked as stale, make it a draft PR or add the label "exempt-stale". If left unattended, this PR will be closed after another 60 days of inactivity. Thank you for your contribution!
This PR is now closed due to 60 days of inactivity after being marked as stale. Re-opening this PR is still possible, in which case it will be marked as active again.
I am kind of waiting for the 10x release cycle to spin up to push this along. There are some things I would change/update in this doc if we get some nicer ZK quorum stuff and role stuff done...
@tboeghk this is what we talked about in line for lunch!! Would really appreciate your perspective.
In addition to the great summary of @ardatezcan1 above here are my practical tips and real-world scenarios to run Solr in a high rpm and low to medium dataset environment (like ecommerce appliations).
Best practises using Solr in high rpm environments
Before starting to optimize your Solr setup, make sure to have strong observability in place. In addition to the Solr Prometheus and Grafana setup I strongly recommend setting up the Node Exporter to gather and correlate machine metrics.
- Use Solr in cloud mode: Running Solr in cloud mode and in a Zookeeper ensemble is a prerequisite to the following best-practices. Cloud mode enables easy addition and removal of Solr cluster nodes depending on the current traffic.
-
Sharding: Request processing in Solr is a single threaded operation. The larger your dataset the more latency you'll add to request processing. The only (sustainable) way to make query processing a multi-threaded operation is to shard your index. Depending on your workload, you could simply run multiple Solr instances on the same machine. I recommend a single Solr instance per machine though.
-
Sharding strategies: If your query processing strategy uses collapse (and expand or grouping), make sure to put all documents to a grouping key on the same shard. Adjust the document routing and
router.fieldto your grouping key.
-
Sharding strategies: If your query processing strategy uses collapse (and expand or grouping), make sure to put all documents to a grouping key on the same shard. Adjust the document routing and
-
Indexing and optimization strategies: Indexing into a live collection adds significant latency to your search requests. Each commit flushes the internal caches and those caches keep Solr running fast. Avoid any unnecessary cache flushes!
- Optimize your index: Manually optimizing your index is not recommended but delivers the best performance as deleted documents are pruned from the index.
- Rotate collections: For smaller to medium datasets it might be a good strategy to periodically index your data into a new collection instead of updating an existing one. That way, requests caches stay warm for the lifetime of a collection and a manual optimize is possible. Use collection aliases to switch clients to the new collection.
-
Use dedicated node setups: In high traffic environments, a separation of concerns gets more important. Use dedicated node types and machine sizings/setup for optimal perfomance tailored to the machines role.
-
Indexer: Solely used for indexing products. Set up as
TLOGreplica type. Must not be used for request processing. ExcludeTLOGnode types from request processing using theshards.preferenceparameter configured at your request handlers. -
Data: Set up as a
PULLreplica. Replicates it's index from the indexer nodes via Solr cloud. UsingTLOGandPULLreplicas avoids that index data is being pulled off data nodes (as withNRTreplicas). - Coordinator: In sharded Solr cloud setups, these nodes coordinate the distributed request flow and assemble the final search request result. This is a very CPU intensive operation and is usually shared among the data nodes. The usage of dedicated coordinator nodes separates the compute overhead of coordinating distributed requests off of the data nodes. Adding coordinator nodes to a Solr cloud setup will drop the resource usage on data nodes significantly. To make full use of coordinator nodes, direct all incoming request traffic to these nodes.
-
Indexer: Solely used for indexing products. Set up as
- JVM tuning: I highly recommend running Solr on G1GC garbage collector. Keep in mind the golden rule of keeping 50% heap for disk cache on data and indexer nodes. As coordinator nodes are stateless, you can boost their performance significantly with the ZGC garbage collector. It slashes collection pauses from milli- to nanoseconds.
-
Cloud setup: Most Solr cloud setups will run in some kind of cloud environment. Here are some tipps to setup an elastic Solr cloud environment.
- Autoscaling: Use a dedicated autoscaling group for each node type and each shard. Use tags to mark which instance should replicate which shard. Configure your heap settings dynamically and configure a wide range of instance types. Build a custom script to replicate data upon instance start. Use the Solr collections api to remove a node from the cluster during instance termination.
- Spot instances: Coordinator and data nodes are great to run as spot instances. This will save a big bunch of cloud spendings.
- ARM instance types: Utilize ARM instance types wherever possible. The Solr Docker image is also pre-built for ARM architectures. ARM cpus offer the best bang for the buck and a more consistent response latency (as their CPU is not power managed).
If you need more information or help to compile the whole information into a single document let me know!
@tboeghk so I have a new taking-solr-to-production.adoc doc, that tries to be a opinonated scaling. I think that a LOT of what you mentioned makes sense at the Moving Beyond the Basic Cluster scaling point... Which I listed as in the six to 12 nodes in your cluster.. I know all "best practices" could be done earlier, but I'm trying to frame this as "When you get to this size, you need to do this"... THoughts? The number of nodes to me, while a simplistic measure, is also the easiest to expliain versus query load, index load, data load that would be more complex to decide "where am I"....
@tboeghk and @ardatezcan1 I've update this branch to run with the latest version of Solr. My goal is to get this doc in (in one form or another) before Solr 10 comes out. If either of you wants to edit the doc to factor in your suggestions, please feel free. Otherwise I will try and farm your comments and add them, but it'll be more from my own personal perspective.
For those who haven't seen it, we are now generating diagrams from ascii mark up! I am excited to make it easier to add diagrams to Solr that don't require a binary image that then is hard to update.
Some good progress.. If https://github.com/apache/solr/pull/2391 happens then this is good to go. If 2391 doesnt' before 10, then I'll edit this and then merge it.
Hi all who have contributed to this long lived PR! With Solr 10 being close to being released, I wanted to bend this towards something mergable. I've edited the doc down, and there is only one TBD that needs editing before this can be merged.
The doc is narrower than this PR suggests, however I think there is a "Extreme Scale' or some such doc that oculd be made that would take in a lot of the feedback provided.
In order to not have "forward looking" text in Ref Guide, need #2391 to get in... I am going to take a stab at it tomorrow..
See https://issues.apache.org/jira/projects/SOLR/issues/SOLR-17507 for when we get this in. Maybe break it up into two, one side for small examples, and then in 10.1 or later the full doc?