concurrentqueue icon indicating copy to clipboard operation
concurrentqueue copied to clipboard

Endless loop in ImplicitProducer* get_or_add_implicit_producer()

Open oviano opened this issue 2 years ago • 21 comments

I've been using this nice queue for a while now; one of my usages is as part of a log file writing system. I have a blocking concurrent queue where a thread waits for items before writing them to a text file.

I'm getting a fairly rare situation where about half a dozen or so "producer" threads all need to write at the same time, typically during a shutdown sequence and they all seem to be stuck going round and round in this function.

Not sure how to debug this!

oviano avatar Mar 09 '22 20:03 oviano

Can you identify in which loop they are stuck exactly? Can you reproduce in isolation?

cameron314 avatar Mar 09 '22 20:03 cameron314

		assert(mainHash != nullptr);  // silence clang-tidy and MSVC warnings (hash cannot be null)
		for (auto hash = mainHash; hash != nullptr; hash = hash->prev) {
			// Look for the id in this hash
			auto index = hashedId;
			while (true) {		// Not an infinite loop because at least one slot is free in the hash table
				index &= hash->capacity - 1;

Yes it was basically looping around the outer while(true) { shown above, with index basically going up each iteration and then back to 1 again.

I haven't yet attempted to reproduce in isolation, I will try that tomorrow. I've being chasing a bug in my project where it was entering a stuck/high CPU state and I had left three instances running in the VS debugger and I've literally only caught it this evening for the first time.

oviano avatar Mar 09 '22 20:03 oviano

Managed to trap it again; it loops endlessly because the condition (probedKey == id) is never met for it to return, and neither is (probedKey == details::invalid_thread_id) for it to break.

details::invalid_thread_id is defined as zero, this is Windows.

The contents of hash->entries[0...31] is mostly with key = 0xFFFFFFFFU but a few valid entries, but none where the key==id.

oviano avatar Mar 09 '22 22:03 oviano

Thanks, I'll have a look soon.

cameron314 avatar Mar 09 '22 22:03 cameron314

In the above last case I encountered it there was actually only one thread looping in this manner. There were no other threads referencing the queue apart from the consumer which was sitting waiting to dequeue.

Not too sure how it works but it seems that either the thread never had a producer hash created or something (another thread?) invalidated it when it shouldn’t have done?

oviano avatar Mar 10 '22 06:03 oviano

Is it conceivable that the thread exit handler code is clearing down the hash entry after a new thread has started and is already re-using the same thread id/hash?

This seems to happen when threads are stopped and new threads immediately created.

i will disable the thread local stuff and see if the problem automagically goes away….

oviano avatar Mar 10 '22 08:03 oviano

Apologies for the running commentary - I added some trace to log the thread ID when the thread is subscribed to the thread exit listener, and when the exit handler is called.

What I am finding is that the threads that are stuck in this loop were never seemingly given a producer.

What seems to have happened is that all the previous threads that were stopped earlier have left their producers in the reusable (invalid_thread_id2) state, and so there is never an available empty one (invalid_thread_id) to allow the loop to break.

Does that make sense as an explanation?

oviano avatar Mar 10 '22 10:03 oviano

Thanks for looking into this. That makes sense, and is looking increasingly like a bug! I'll take a look at lunch.

cameron314 avatar Mar 10 '22 12:03 cameron314

It's possibly because I'm using it in a way that is less conventional, i.e. as part of a logging system, where it's normal for threads that come and go during execution to want to write logging items to the queue.

Maybe that's different from the envisaged use of long-running threads in a more stable producer -> consumer model.

oviano avatar Mar 10 '22 12:03 oviano

Nah, it seems to be a legitimate bug. I'll try to work out a fix, but in the meantime you can comment out the define of MOODYCAMEL_CPP11_THREAD_LOCAL_SUPPORTED in the header.

Thanks for finding this!

cameron314 avatar Mar 10 '22 12:03 cameron314

The implicit producer hash table should always have a load factor of no more than 75%. The infinite loop you see indicates it reached 100%.

invalid_thread_id is used to indicate an empty slot, and invalid_thread_id2 is used as a tombstone for a removed value (and may be later overwritten similar to an empty slot). The count is not decremented when a tombstone is written, so there should always be at least one truly free slot in the hash at all times, preventing that infinite loop.

However, the count was also not being incremented when a free (not reused!) slot was taken for a reused producer, which ultimately resulted in the count not reflecting the reality of the actual hash entries, causing it to fill up to 100% capacity in pathological cases. This should now be fixed.

I also fixed an (unlikely) race during insertion (and replacement with the tombstone during deletion) where stale keys could be read, causing the loop to iterate past a usable value (during insertion) or overwrite a valid value (during deletion). This race could lead to a similar violation of hash invariants.

cameron314 avatar Mar 11 '22 04:03 cameron314

Let me know if this resolves the issue you're seeing. Thanks for your help investigating this (I cannot reproduce locally).

cameron314 avatar Mar 11 '22 04:03 cameron314

Thanks. I just had a go at reproducing the issue again with the new code and I wasn't able to reproduce it, so it looks like it's fixed.

I will report back again if I get any more issues!

oviano avatar Mar 11 '22 08:03 oviano

So a different issue but perhaps related to this area/your changes:

Screenshot 2022-03-12 at 15 01 13

oviano avatar Mar 12 '22 15:03 oviano

Hmm. Which VS version? Can you reproduce in isolation?

cameron314 avatar Mar 12 '22 15:03 cameron314

VS2022.

I will try and make it happen again.

Do you have tests that create and delete threads randomly and use this queue? My code isn’t really doing this on a grand scale. It’s a video streaming server with just one or two viewers, so someone connects and it fires up a bunch of threads for various reasons such as capturing, decoding, sending the stream etc, then when the connection ends the threads exit.

Unless it’s exceedingly rare and I was very (un)lucky I would think it ought to appear in such a test scenario.

oviano avatar Mar 12 '22 15:03 oviano

Ah, think I found the issue. Upon destruction, the implicit producer attempts to deregister from the TLS-based registry, but it's usually running on a different thread, so it's not actually deregistered, leaving freed memory in the linked list which later causes a crash.

cameron314 avatar Mar 12 '22 19:03 cameron314

Right - I was actually trying to figure out if it was something along those lines but hadn’t proved it yet.

Then I tried reproducing it again to get more hints but couldn’t get it happening again.

Certainly in my use case the thread that creates the queue is often different to the one that is the producer.

PS looking at the history of this it seems you’ve had quite a fight with TLS over the years…

oviano avatar Mar 12 '22 19:03 oviano

Yeah, it seems this code was less well tested than I remembered :-/ I only recently re-enabled MOODYCAMEL_CPP11_THREAD_LOCAL_SUPPORTED by default which is why you're hitting these issues now. So, thanks for being the inadvertent guinea pig!

I'll build out a fix for this issue soon.

cameron314 avatar Mar 12 '22 19:03 cameron314

Yeah, it seems this code was less well tested than I remembered :-/ I only recently re-enabled MOODYCAMEL_CPP11_THREAD_LOCAL_SUPPORTED by default which is why you're hitting these issues now. So, thanks for being the inadvertent guinea pig!

I'll build out a fix for this issue soon.

It's no problem, I'm pleased to help.

It's a really nice implementation, and there aren't too many cross-platform queues like this so being able to drop this into places where I was previously using std::mutex + std::condition_variable + std::queue/list/etc is really useful.

oviano avatar Mar 12 '22 19:03 oviano

Should be fixed. Let me know if you come across any other issues.

cameron314 avatar Mar 13 '22 14:03 cameron314

Can you merge these changes into master? I've encountered similar symptoms and would prefer to use a release.

rob-stewart avatar Jun 12 '23 12:06 rob-stewart

Can you merge these changes into master? I've encountered similar symptoms and would prefer to use a release.

They are already in master. Created a new release (v1.0.4). Closing this issue for now as no further problems have been reported.

cameron314 avatar Jun 19 '23 20:06 cameron314