OpenSearch [BUG] 2.11.1-2.19.0 searching searchable snapshots fails because search node run out of disk space after some time

trafficstars

Describe the bug

We have dedicated search nodes within the cluster with 100GB of SAN-storage as a cache layer. On normal nodes opensearch would have gone red because the watermarks would have been exceeded. No such notice here, it keeps on filling the disks until it is full after which you get errors like:

We have not set node.search.cache.size because these nods don't do anything other than searching.

Also, Object object isn't really helpful ghehe :)

Related component

Storage:Snapshots

To Reproduce

Leave the cluster running for a while, do several searches to fill the cache -> error.

Expected behavior

On a dedicated search node use like 90% of the available storage but nothing more. Prevent stability issues.

Additional Details

Plugins Default docker 2.11.1 images with S3 plugin installed.

Screenshots N/A

Host/Environment (please complete the following information): Kubernetes with docker images 2.11.1

Additional context Add any other context about the problem here.

Dec 27 '23 09:12 sandervandegeijn

@sandervandegeijn Have you configured a value for cluster.filecache.remote_data_ratio?

Can you share the output of GET _nodes/stats/file_cache when a node gets into this state?

Dec 27 '23 17:12 andrross

I have implemented the node.search.cache.size setting, setting it to 95GB of the 100GB available on the PVC. I'll see if that holds. Ideally I would like to use a percentage and if not set by me, a proper default would be nice.

The cluster.filecache.remote_data_ratio doesn't do what I need if I understand the docs correctly, I don't want to be limited by the cache size. Cache is a bonus and should not prohibit the functioning of the cluster.

Dec 27 '23 19:12 sandervandegeijn

The cluster.filecache.remote_data_ratio doesn't do what I need if I understand the docs correctly, I don't want to be limited by the cache size.

Unfortunately there are still limits here. If the actively-referenced data at any one time exceeds what is available on the machine then you will run into problems like what you've described here. The result of the _node/stats/file_cache API will help determine if that is what is going on here. The remote_data_ratio setting is meant as a safeguard to prevent assigning more remote shards that can reasonably be supported.

Dec 28 '23 17:12 andrross

Will take a look at it tomorrow. What would reasonable mean, what would be the limiting factor? I'm currently on a cluster with:

6 data nodes with 1TB full flash SAN and 16GB RAM
3 master / coordinating nodes and 8GB RAM
10 search nodes with 100GB full flash SAN and 16GB RAM. Backend storage is a s3 cluster with spinning disks and another flash cache in front of the spinning disks.
2 ingest nodesand 2GB RAM

There's around 3.5TB on the hot nodes and some 15TB on the S3 cluster (growing with ±0.5TB/day, final size will be around 100TB). The search nodes makes use of our S3 solution, I'm pulling more than 20gbit/s (and probably more now, implemented more search nodes, no more recent measurements) from the S3 appliance. I/O response times are stable. All nodes can take 12 cpu cores if need be.

Performance is fine, doing a search on source.ip = x.x.x.x over the whole dataset takes around 10-15s which is more than acceptable.

Dec 28 '23 17:12 sandervandegeijn

What would reasonable mean, what would be the limiting factor?

The strictly limiting factor will be shown by the file_cache.active_in_bytes. That is the size of all the actively-referenced (i.e. unevictable) data. If your search workload leads to referencing more data at any point in time than there is storage available on the node, then you'll start seeing failures. In practice, as the active_in_bytes size approaches the total cache size then performance might start to degrade as well as you'll see increased cache misses which means that search requests will have to pull down data on-demand more frequently.

Dec 28 '23 19:12 andrross

Okay, I've got stats for some of the nodes:

"MVyDvVRmT3OP-wzqw2elag": {
      "timestamp": 1703846452872,
      "name": "opensearch-search-nodes-0",
      "transport_address": "10.244.206.182:9300",
      "host": "10.244.206.182",
      "ip": "10.244.206.182:9300",
      "roles": [
        "search"
      ],
      "attributes": {
        "shard_indexing_pressure_enabled": "true"
      },
      "file_cache": {
        "timestamp": 1703846452872,
        "active_in_bytes": 81091561675,
        "total_in_bytes": 0,
        "used_in_bytes": 81703930059,
        "evictions_in_bytes": 265176318281,
        "active_percent": 99,
        "used_percent": 0,
        "hit_count": 313719,
        "miss_count": 63017
      }
    },
    "w1voQkb2SUKuBjDvK_Mfqw": {
      "timestamp": 1703846452883,
      "name": "opensearch-search-nodes-8",
      "transport_address": "10.244.207.169:9300",
      "host": "10.244.207.169",
      "ip": "10.244.207.169:9300",
      "roles": [
        "search"
      ],
      "attributes": {
        "shard_indexing_pressure_enabled": "true"
      },
      "file_cache": {
        "timestamp": 1703846452883,
        "active_in_bytes": 72728861570,
        "total_in_bytes": 0,
        "used_in_bytes": 73525779330,
        "evictions_in_bytes": 257238910529,
        "active_percent": 99,
        "used_percent": 0,
        "hit_count": 349777,
        "miss_count": 59203
      }
    },
    "J3g3YbS7Q_qbvLniDz19RQ": {
      "timestamp": 1703846452883,
      "name": "opensearch-search-nodes-9",
      "transport_address": "10.244.204.127:9300",
      "host": "10.244.204.127",
      "ip": "10.244.204.127:9300",
      "roles": [
        "search"
      ],
      "attributes": {
        "shard_indexing_pressure_enabled": "true"
      },
      "file_cache": {
        "timestamp": 1703846452883,
        "active_in_bytes": 75778946930,
        "total_in_bytes": 0,
        "used_in_bytes": 76533921650,
        "evictions_in_bytes": 258573952202,
        "active_percent": 99,
        "used_percent": 0,
        "hit_count": 362107,
        "miss_count": 60526
      }
    }
  }

From what you've written, is the data always transferred to the local storage of the nodes? I was under the impression it would get the data from S3 and it would write the most used parts to the local storage for performance gains. That there will be more cache misses when the local storage / S3 ratio changes is obvious (and a performance penalty I would happily pay to be able to archive the data).

Since I've implemented node.search.cache.size at effectively 95% of the local storage I've not seen errors anymore (so far).

Dec 29 '23 10:12 sandervandegeijn

Okay, had another run in with full disks on the search nodes. From a sysadmin's perspective, the cache should be a bonus for performance optimisation and should not fill up the entire disk of the search node regardless of the total storage in use on the S3 side.

Can I make this happen? :)

Jan 08 '24 21:01 sandervandegeijn

I was under the impression it would get the data from S3 and it would write the most used parts to the local storage for performance gains...From a sysadmin's perspective, the cache should be a bonus for performance optimisation and should not fill up the entire disk of the search node regardless of the total storage in use on the S3 side.

@sandervandegeijn Fair points! However, the current architecture requires that the blocks pulled from the remote object store be written to disk, and then open as files on disk to serve searches. This means we don't have to have all actively-referenced data loaded into memory because we can rely on the operating system to page this data between the disk and page cache as appropriate.

I believe the scenario you are encountering is that the sum of the actively-referenced data is exceeding the disk space on the node. At a minimum, I believe a better experience would be to fail new searches that come it that require downloading new data, as opposed to letting the system run out of disk space and fail unpredictably. What do you think?

Separate, it does look like there's another bug here where your file cache stats are showing "total_in_bytes": 0. I've not seen that before but "total_in_bytes" is supposed to be the size of the file cache and it is clearly not zero here.

Jan 10 '24 22:01 andrross

Okay, uhm... This kind of things are better to discuss in person and with a whiteboard, but I'll try;

Failing gracefully and handling the situation is always better than just letting it blow up. But in this case it's not even doing searches in parallel, this is one search that's going through ±20TB of data where 3.5TB is on full flash storage (within the cluster) and the rest on S3. The search nodes (10) have full flash storage (100GB each), so 1TB of active storage in addition to the hot nodes. It explains why I'm seeing 20-25gbit/s from the S3 storage: basically it's pulling all data in as fast as possible. It also explains why doing concurrent searches over the whole dataset is almost impossible (acceptable).

Without having any knowledge of the caveats and I intend this as constructive feedback

This 1:20 ratio is about as much as I can stomach in terms of storage costs, ideally I would like something 1:100 where I can live with slower performance. I really don't care that it would take 5-10 minutes for 1 search, it's archive, cold cold storage. Lower ratios negate the advantage of the lower storage costs on the S3 side and it's not going to be practical. I'm also talking to another university that wants to store & search 5 petabyte (!) of logs. Searching 5PB this way seems almost impossible. You need too much primary storage to be cost effective when the ratio's are too low.

That's a shame, it would be the ideal cost effective solution to be able to do this. Thinking through this, it seems we have three problems:

All referenced data needs to be on disk, so large searches also require large amounts of primary storage in the cluster, maybe almost in the 1:<10 ratio.
The search nodes that run out of storage seem to trigger a red state on the cluster
It needs to pull all the data from the object storage

Are there ways to improve the implementation in such a way that not always all storage needs to be copied to primary storage, but data can just be used directly from the object storage and where the cache in the search nodes prevent excessive reads on the most used data? The first two problems are the main things to solve, the third one would be very nice to solve if possible (by reading only the relevant data as you would on block storage if possible). You could work around the third one with massive bandwidth, although it's not efficient.

If this is not possible, I'm thinking about totally different ways of scaling - maybe doing one primary cluster for hot ingest and then a cluster per time unit (month?) with slower storage while using cross cluster search to search through everything at once.

Have you experimented with these kinds of cluster for petabytes of data?

edit: ai, my cluster went red because of this it seems. I would really like that the search nodes are a "bonus" and never cause the cluster to stop working. As long as I have regular nodes that are functioning, the cluster should remain functional and keep ingesting data. The logs/monitoring can spam me to kingdom come for the search nodes not functioning. Updated the list of problems.

Jan 11 '24 12:01 sandervandegeijn

Thanks for the feedback! To clarify, and to get deeper into the technical details...

Searchable snapshots use the Lucene Directory and IndexInput interfaces to abstract away the remote versus local details. When an IndexInput instance is opened, if the specific part of the file being pointed to is not already on disk, it pulls down that part of the file in an 8MiB chunk. When the "cursor" of the IndexInput moves past the current 8MiB part it will then download the next 8MiB part, keeping previous parts cached on disk. So in the absolute simplest case, if you had a single search on a single shard that had a single open IndexInput, then only 8MiB of data would be actively referenced at any one time. Reality is generally more complicated than that, but I would not expect a single search to have to actively reference all data of the index. I think we need to dig deeper into specifically what is happening in this case, because it does not match my expectation for how the system should perform.

Jan 11 '24 21:01 andrross

Thanks for the explanation. Would it be necessary to write those chunks to disk or would it be possible to do this in memory and either discard the data immediately or write it to disk when it's a block that used frequently as a cache? The current problem is that - from what I understand - when you do a search over a large dataset a large portion of that dataset needs to be on primary storage.

For us, the point of using searchable snapshots like this is to save on storage costs (full flash vs S3 is about 20x). This would enable us to keep data available for 180 days (security forensics use case).

I'm happy to provide details or schedule a call to show you around the cluster.

Jan 12 '24 11:01 sandervandegeijn

Okay, unfortunately I'm having to scrap the current setup. The search node problems cause too many interruptions, cluster went down again because of it. I'll try to leave the data on the S3 side and maybe even take snapshots so I can restore everything in a second cluster to be able to help with debugging. But this is getting too cumbersome for production use I'm afraid.

Jan 14 '24 20:01 sandervandegeijn

The current problem is that - from what I understand - when you do a search over a large dataset a large portion of that dataset needs to be on primary storage.

@sandervandegeijn This should not be the case. With a very large data set and a relatively small primary storage cache, the expected behavior is that you might see a lot of thrashing - i.e. re-downloading of data that was evicted from primary storage - but that should manifest as slow queries and not cluster instability or disk exhaustion. Obviously something is not working correctly here though.

Jan 16 '24 21:01 andrross

Okay is what I would have expected as well. With a smaller cache, cache misses will be more frequent and more traffic will be generated. That's to be expected.

Jan 17 '24 09:01 sandervandegeijn

I had to delete the data as well (for now). Do you have any means to reproduce this at scale?

Jan 27 '24 11:01 sandervandegeijn

[Triage - attendees 1 2 3 4 5 6 7 8] @sandervandegeijn Thanks for filing this issue

Jan 31 '24 16:01 peternied

Is it possible to solve this in the 2.12 release?

Feb 11 '24 09:02 sandervandegeijn

Guys, is there any chance this will be solved in 2.13? I can't use the future right now because it makes the clusters unstable.

Mar 07 '24 22:03 sandervandegeijn

Guys? :)

Mar 22 '24 16:03 sandervandegeijn

@sandervandegeijn what do you think about initiating a draft pull request to tackle this challenge, or perhaps motivating someone directly impacted by it to contribute a pull request? I'd be happy to help review the changes.

Mar 29 '24 19:03 peternied

Afraid I'm not a very capable java dev, so it's quite a task for me. Second excuse: I can't build opensearch on my arm-based macbook as far as I can tell (which is a requirement for the coming weeks, not taking the Lenovo P15 hernia inducing brick to the camping ;) )

Mar 29 '24 20:03 sandervandegeijn

@sandervandegeijn Hmm I thought ARM was supported for core, but I'm a windows person that ssh's into linux machines 🤷. If you have time give it a try and feel free to create issues if it doesn't work. Maybe you know someone at your university that could take on the challenge while you are away. Enjoy camping in any case!

Mar 29 '24 21:03 peternied

Promise I'll give it a try, but I hope it won't come to my java skills to get this fixed for 2.14 ghehe.

Mar 29 '24 21:03 sandervandegeijn

@anasalkouz can we please look into prioritizing handling disk issues with small disk cache compared to the remote data being addressed?

Apr 08 '24 18:04 Pallavi-AWS

First step here should be to replicate the problem by configuring a small disk cache relative to the remote data being addressed. We should see the cache being over-subscribed (and in the worst case it can result in completely filling the storage device containing the file cache). Once we have a reproduction in a controlled environment, the avenues to explore/consider that I can think of are:

Can we fail search requests in lieu of over-subscribing the local storage? This isn't a great user experience but it is much better than node/cluster instability.
Are we possibly leaking file references somewhere? We have a reference counting mechanism to determine if cached files are unused, but it is always possible there is a bug in that logic.
Can we be more aggressive about cleaning up unreferenced cache files? Because Lucene doesn't directly close IndexInput instances it creates via the clone() method, we rely on a GC cleaner to close those instances. That means that potentially unreferenced files may still be counted as referenced until the garbage collector decides to collect the instance. GC is generally pretty lazy about cleaning up things when it still has available heap, which could be an aggravating problem here.
The node stats output above shows "total_in_bytes": 0. That field should be the total configured size of the cache, so this certainly looks like a bug. While potentially innocuous, we should get a reproduction and fix this and determine if it is in anyway related to the larger problem observed here.

Any other ideas are welcome as well.

Apr 11 '24 22:04 andrross

Thanks Andrew, good to have some avenues to explore. I'm busy on building a formal test environment of our setup this week and next week (rebuilding the CI/CD as well, so it's going to take a little more effort than usual).

When that's done I can assist with testing without disrupting our production workflows. Don't hesitate to ask :)

Apr 12 '24 09:04 sandervandegeijn

Thanks Andrew, looks we have some steps/actions we can do to move this forward. Thanks Sundar, I think having testing environment will definitely help to easily reproduce and effectively test any fixes. Hopefully, we can have someone to look into the issue in the next couple of weeks.

Apr 12 '24 18:04 anasalkouz

Hi guys, 2.14 is right around the corner. Any chance this will be resolved in 2.14? Thanks!

Apr 27 '24 10:04 sandervandegeijn

(and in the worst case it can result in completely filling the storage device containing the file cache)

I have this issue with 2.12.

Here is allocation report:

% curl logs:9200/_cat/allocation\?v
shards disk.indices disk.used disk.avail disk.total disk.percent host                        ip           node
   283      287.3gb     4.7gb     59.1gb     63.9gb            7 v87-search.co.com 10.139.20.28 v87.co.com
  2319        2.2tb     2.3tb      1.1tb      3.4tb           66 v81-data.co.com   10.139.20.11 v81-data.co.com
...

At the same time, OS at VM report (it was filled 100% on weekend's snapshot/restore cycle):

v87:[/opt/os-search]# df -h /mnt/search/
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdc1       64G   60G  4.3G  94% /mnt/search

Here is cache size limit:

v87:[/opt/os-search]# grep cache config/opensearch.yml 
node.search.cache.size: 20gb

P.S.: is there a way to format the console output properly?

Apr 29 '24 16:04 rlevytskyi

@andrross that looks like the underlying problem! Nice, is it easy to fix?

May 06 '24 08:05 sandervandegeijn

OpenSearch OpenSearch copied to clipboard

[BUG] 2.11.1-2.19.0 searching searchable snapshots fails because search node run out of disk space after some time

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

OpenSearch
OpenSearch copied to clipboard