vespa Vespa content pod created snapshot folder owned by nobody user in a high RAM situation

Describe the bug var/db/vespa/search/cluster.lexdoc/n0/documents/testDocument/0.ready/attribute/testField13_s_attr/snapshot-2298** Size: 148 Blocks: 0 IO Block: 4096 directory Device: fd09h/64777d Inode: 447003 Links: 1 Access: (0755/drwxr-xr-x) Uid: (65534/ nobody) Gid: (65534/ nobody)

To Reproduce Push the content node's RAM beyond limits and you may be able to see this Our RAM was just slightly lower than 80% and we pushed a schema change to convert numerics to non-paged attributes. This led to multiple hours of searchnode restarts initially without any warning and spike in RAM i.e. crash, and subsequently warnings like these appeared in a few restarts. Lastly we reached the state described above where the podman container itself crashed

Expected behavior Understand that ram needs to be in check, but the last thing we expect is creation of file by a user other than the vespa user

Screenshots 1713931119.902770 vespa.content 10 - vespa-start-services warning Could not create a file in directory 'var/db/vespa/search/cluster.lexdoc/n0/documents/testDocument/0.ready/attribute/testField13_s_attr/snapshot-2298' 1713931119.902798 vespa.content 10 - vespa-start-services warning problem: open var/db/vespa/search/cluster.lexdoc/n0/documents/testDocument/0.ready/attribute/testField13_s_attr/snapshot-2298/tmp.probe.656871207.tmp: permission denied 1713931119.902802 vespa.content 10 - vespa-start-services warning Currently running as user: vespa 1713931119.902805 vespa.content 10 - vespa-start-services warning Wanted var/db/vespa/search/cluster.lexdoc/n0/documents/testDocument/0.ready/attribute/testField13_s_attr/snapshot-2298 to be owned by user id: 1000 1713931119.902807 vespa.content 10 - vespa-start-services warning Wanted var/db/vespa/search/cluster.lexdoc/n0/documents/testDocument/0.ready/attribute/testField13_s_attr/snapshot-2298 to have group id: 1000 1713931119.902811 vespa.content 10 - vespa-start-services warning Wanted var/db/vespa/search/cluster.lexdoc/n0/documents/testDocument/0.ready/attribute/testField13_s_attr/snapshot-2298 to have permissions: -rwxr-xr-x 1713931119.902813 vespa.content 10 - vespa-start-services warning current status of var/db/vespa/search/cluster.lexdoc/n0/documents/testDocument/0.ready/attribute/testField13_s_attr/snapshot-2298 is: 1713931119.904633 vespa.content 10 - vespa-start-services warning File: var/db/vespa/search/cluster.lexdoc/n0/documents/testDocument/0.ready/attribute/testField13_s_attr/snapshot-2298 Size: 148 Blocks: 0 IO Block: 4096 directory Device: fd09h/64777d Inode: 447003 Links: 1 Access: (0755/drwxr-xr-x) Uid: (65534/ nobody) Gid: (65534/ nobody) Access: 2024-04-23 22:18:09.444566411 +0000 Modify: 2024-04-23 22:18:09.429566725 +0000 Change: 2024-04-23 22:18:09.429566725 +0000 Birth: 2024-04-23 22:18:09.406567205 +0000

1713931119.904645 vespa.content 10 - vespa-start-services warning this is a fatal error! open var/db/vespa/search/cluster.lexdoc/n0/documents/testDocument/0.ready/attribute/testField13_s_attr/snapshot-2298/tmp.probe.656871207.tmp: permission denied

Environment (please complete the following information):

OS: RHEL8 v8
Infrastructure: Podman 4.4.1

Vespa version 8.221.29

Additional context Just before this problem occurred, for several hours nodes restart was accompanied with warnings like these due to high RAM

[2024-04-23 18:12:47.075] WARNING searchnode vespa-proton-bin Wanted 18446744073709551615 as limit for core file size but cannot exceed current hard limit: 0 [2024-04-23 18:18:18.792] WARNING searchnode proton.proton.server.disk_mem_usage_filter Write operations are now blocked: 'memoryLimitReached: { action: "add more content nodes", reason: "memory used (0.925613) > memory limit (0.9)", stats: { mapped: { virt: 52275617792, rss: 4988928}, anonymous: { virt: 58651811840, rss: 46603702272}, physicalMemory: 50349035520, memoryUsed: 0.925613, memoryLimit: 0.9}}'

[2024-04-24 03:17:59.062] WARNING searchnode proton.persistence.util Slow processing of message GetIterCommand(). Processing time: 154.5 s (>=5.0 s)

[2024-04-24 03:19:40.088] WARNING searchnode proton.deadlock.detector Thread filestormanager has gone 255035 milliseconds without registering a tick. Global slack not expended yet. Warning for now. Attempting to dump stack of thread...\nBacktrace:\n /opt/vespa/lib64/libsearchlib.so(+0x77704e) [0x7f4d6010204e]\n /opt/vespa/sbin/vespa-proton-bin() [0x6a0d22]\n /opt/vespa/sbin/vespa-proton-bin(proton::DocumentRetriever::populate(unsigned int, document::Document&, document::Field::Set const&) const+0x91) [0x5f3281]\n /opt/vespa/sbin/vespa-proton-bin() [0x5f36ec]\n /opt/vespa/lib64/libsearchlib.so(+0x90150e) [0x7f4d6028c50e]\n /opt/vespa/lib64/libsearchlib.so(search::FileChunk::read(__gnu_cxx::__normal_iterator<search::LidInfoWithLid const*, std::vector<search::LidInfoWithLid, std::allocatorsearch::LidInfoWithLid > >, unsigned long, search::FileChunk::ChunkInfo, search::IBufferVisitor&) const+0xf1) [0x7f4d60291591]\n /opt/vespa/lib64/libsearchlib.so(search::FileChunk::read(__gnu_cxx::__normal_iterator<search::LidInfoWithLid const*, std::vector<search::LidInfoWithLid, std::allocatorsearch::LidInfoWithLid > >, unsigned long, search::IBufferVisitor&) const+0xa1) [0x7f4d602916f1]\n /opt/vespa/lib64/libsearchlib.so(search::LogDataStore::read(std::vector<unsigned int, std::allocator > const&, search::IBufferVisitor&) const+0x2d5) [0x7f4d60296e25]\n /opt/vespa/lib64/libsearchlib.so(search::docstore::BackingStore::visit(std::vector<unsigned int, std::allocator > const&, document::DocumentTypeRepo const&, search::IDocumentVisitor&) const+0x29) [0x7f4d6028b899]\n /opt/vespa/sbin/vespa-proton-bin(proton::DocumentRetriever::visitDocuments(std::vector<unsigned int, std::allocator > const&, search::IDocumentVisitor&, storage::spi::ReadConsistency) const+0x4e) [0x5f1f8e]\n /opt/vespa/sbin/vespa-proton-bin(proton::DocumentIterator::fetchCompleteSource(proton::IDocumentRetriever const&, std::vector<std::unique_ptr<storage::spi::DocEntry, std::default_deletestorage::spi::DocEntry >, std::allocator<std::unique_ptr<storage::spi::DocEntry, std::default_deletestorage::spi::DocEntry > > >&)+0x765) [0x619335]\n /opt/vespa/sbin/vespa-proton-bin(proton::DocumentIterator::iterate(unsigned long)+0x62) [0x619c62]\n /opt/vespa/sbin/vespa-proton-bin(proton::PersistenceEngine::iterate(storage::spi::IteratorId, unsigned long) const+0x155) [0x60b8a5]\n /opt/vespa/lib64/libstorage.so(storage::ProviderErrorWrapper::iterate(storage::spi::IteratorId, unsigned long) const+0x37f) [0x7f4d5e84801f]\n /opt/vespa/lib64/libstorage.so(storage::SimpleMessageHandler::handleGetIter(storage::GetIterCommand&, std::unique_ptr<storage::MessageTracker, std::default_deletestorage::MessageTracker >) const+0x5e) [0x7f4d5e84934e]\n /opt/vespa/lib64/libstorage.so(storage::PersistenceHandler::handleCommandSplitByType(storage::api::StorageCommand&, std::unique_ptr<storage::MessageTracker, std::default_deletestorage::MessageTracker >) const+0x59f) [0x7f4d5e83df2f]\n /opt/vespa/lib64/libstorage.so(storage::PersistenceHandler::processMessage(storage::api::StorageMessage&, std::unique_ptr<storage::MessageTracker, std::default_deletestorage::MessageTracker >) const+0x14b) [0x7f4d5e83e16b]\n /opt/vespa/lib64/libstorage.so(storage::PersistenceHandler::processLockedMessage(storage::FileStorHandler::LockedMessage) const+0x1a5) [0x7f4d5e83e675]\n /opt/vespa/lib64/libstorage.so(storage::PersistenceThread::run(storage::framework::ThreadHandle&)+0xd2) [0x7f4d5e83ec62]\n /opt/vespa/lib64/libstorage.so(storage::framework::defaultimplementation::ThreadImpl::run()+0x53) [0x7f4d5e983f83]\n /lib64/libstdc++.so.6(+0xc2b23) [0x7f4d5113bb23]\n /opt/vespa/lib64/vespa/malloc/libvespamalloc.so(+0x8c30) [0x7f4d60888c30]\n /lib64/libpthread.so.0(+0x81ca) [0x7f4d5c8d31ca]\n /lib64/libc.so.6(clone+0x43) [0x7f4d50753e73]\n [(nil)] (Repeated 5 times since 1713928779.703860)

Apr 24 '24 17:04 nehajatav

As far as I know, the search core does not set or modify the ownership of its data directories or files. However—is this by any chance running on an XFS filesystem? We have observed this exact scenario very spuriously on XFS during periods with lots of IO and file syncing, but only when the kernel file system buffer cache and/or directory entries cache is being concurrently flushed/evicted.

The observed result was a directory with partially bogus inode metadata, in particular an owner that does not make sense (we've seen uid/gid of both 65534 and 0, neither of which are correct). Also note the Links: 1 field—I would expect a correct directory inode to have at least 2 links since a directory should normally include a hard-link to itself (note; does not apply to all file systems—Btrfs for instance seems to just have 1).

The scenario you describe is one where the kernel is under a very high memory pressure, and where it's likely to try to evict file system caches to get more available memory for applications. If so, the preconditions are similar to what we have observed.

Note: this is inherently just a theory, and we have not had the time to do a deep dive into figuring this out (particularly since doing so would involve reading lots of kernel code). I've tried reproducing this issue by writing a program that stresses the kernel file system cache management through lots of concurrent nested directory creation, fsyncing and forced cache eviction, but never managed to trigger it.

Apr 24 '24 18:04 vekterli

It is XFS indeed

Apr 26 '24 22:04 nehajatav

Entirely anecdotally, we have not observed this issue for some time on our systems. We do regular fleet-wide OS upgrades, so it's possible the underlying issue has been fixed in the upstream kernel.

What's the kernel version you are running? I.e. uname -rv.

For instance, https://github.com/torvalds/linux/commit/28b4b0596343d19d140da059eee0e5c2b5328731 (which I have no idea if is related, but does sound strikingly familiar...) appears to be part of kernel v6.2 and beyond. After doing some digging it looks like this particular fix was backported into RHEL8 kernel version 4.18.0-513.5.1.

Apr 30 '24 12:04 vekterli

Since this is assumed to be an in-kernel XFS bug and thus outside of Vespa's control, I'm marking this issue as closed. Hopefully the underlying bug has been fixed in newer versions of the kernel—there is at least some empirical evidence to suggest this.

May 28 '24 12:05 vekterli

vespa vespa copied to clipboard

Vespa content pod created snapshot folder owned by nobody user in a high RAM situation

vespa
vespa copied to clipboard