glorp Suggestion for fixing 'Instance of DescriptorSystem did not understand #inflector' error

In attempt to establish connection to the database, when sessionForLogin message is sent to DescriptorSystem class, following error occurred: 'Instance of DescriptorSystem did not understand #inflector'. The cause of this error is sending inflector message inside joinFor: toTables: fromConstraints: toConstraints: method to DescriptorSystem itself despite the fact that inflector message is not implemented in DescriptorSystem. After further browsing the code I realised that inflector is present inside ActiveRecordDescriptorSystem (the subclass of DescriptorSystem) as instance variable (with getter and setter) so I decided to move it to the instance variables of DescriptorSystem, create getter and setter and instantiate it inside initialization method (inflector := ActiveRecordInflector new.). After that, everything worked fine. I thought that this information may be valuable for you either as fix for the described problem or a hint for some better solution. :)

Nov 16 '22 20:11 matija-mijalkovic

@Wyverald what do you think about this alternative proposal? I essentially took your "use a Thread" change and ran with it -- it turned out to be much simpler than any of the other alternatives and passes the

for i in $(seq 1 1000); do echo RUN $i; ghbazel build --noenable_bzlmod configure_with_bazel_transitive:all; done

test you used as a canary. It also passes a synthetic test that makes Bazel fetch 100 Starlark repositories in parallel then making one of them fail, but that also passes at HEAD so I'm not sure if that one is a good benchmark.

I made a number of non-obvious changes:

I made it so that when cleaning up the SkyFunctionState due to memory pressure, the worker thread is joined. This is so that the state space is a bit less complicated and also so that we don't leave resources (a virtual thread) hanging around. This made it possible to simplify the handling of InterruptedException since it's now impossible that the worker thread returns one to the Skyframe thread under normal circumstances (the carrier thread can still be SIGINT'ed, I think, so we can't ignore that possibility)
When creating a new state, I now pass the SkyFunction.Environment into it in the same way as for a restart. This makes it easy to handle InterruptedException in the Skyframe thread during that BlockingQueue.put() call.

I also have a few questions:

Is SkyKeyComputeState.close() called by Skyframe in any other circumstance than memory pressure? I'm not sure if it would make a difference since it's idempotent and I like being able to look at StarlarkRepositoryFunction.compute() and understand all the possible states that SkyFunction can be in, but still, I'm curious.
Why is RepoFetchingSkyKeyComputeState.workedThread volatile? My understanding is that it's accessed only from close(), join() and when the object is created, neither of which should interfere with the other.

Apr 26 '24 08:04 lberki

This one still has one deadlock (that I know of, anyway) which is shared by every alternative: if the worker thread gets hit by a SIGINT in delegateEnvQueue.take() in signalForFreshEnv(), it will go to the putInterruptibly() call in catch (Throwable e) in the lambda that is passed to Thread.start(). In that case, the worker thread will hang in signalQueue.put(), the Skyfram thread, in delegateEnvQueue.put(). Let's see if I can fix that...

Apr 26 '24 09:04 lberki

This alternative is free of any deadlocks I know of.

The reason why I stopped using SynchronousQueue is that we'd need two of them and I think that doesn't work: in signalForFreshEnv(), for example, an interrupt can arrive either in .put() or in .take(), but when it does, the next synchronization call will be .put() to tell the Skyframe thread about the interrupt. So the Skyframe thread would need to be prepared to handle both cases, but it can't wait on two SynchronousQueues at the same time. Using only one synchronization object avoid that problem.

WDYT?

Apr 26 '24 10:04 lberki

Is SkyKeyComputeState.close() called by Skyframe in any other circumstance than memory pressure? I'm not sure if it would make a difference since it's idempotent and I like being able to look at StarlarkRepositoryFunction.compute() and understand all the possible states that SkyFunction can be in, but still, I'm curious.

I don't think so, but I'm not 100% sure about anything at this point...

Why is RepoFetchingSkyKeyComputeState.workedThread volatile? My understanding is that it's accessed only from close(), join() and when the object is created, neither of which should interfere with the other.

The comment there explains it. close() can technically be called by multiple things at the same time (high memory pressure watcher & host thread, for example), and we use volatile to make sure there are no races between threads setting null.

Apr 29 '24 02:04 Wyverald

Is SkyKeyComputeState.close() called by Skyframe in any other circumstance than memory pressure? I'm not sure if it would make a difference since it's idempotent and I like being able to look at StarlarkRepositoryFunction.compute() and understand all the possible states that SkyFunction can be in, but still, I'm curious.

I don't think so, but I'm not 100% sure about anything at this point...

Argh, it looks like we do invalidate stateCache from UnnecessaryTemporaryStateDropper which happens in response to JMX events, etc., which I mentally translate it to "whenever and on whatever thread the JVM damn well pleases", which doesn't translate well with blocking in close(), but that's the lesser evil than letting side effects run amok...

Why is RepoFetchingSkyKeyComputeState.workedThread volatile? My understanding is that it's accessed only from close(), join() and when the object is created, neither of which should interfere with the other.

The comment there explains it. close() can technically be called by multiple things at the same time (high memory pressure watcher & host thread, for example), and we use volatile to make sure there are no races between threads setting null.

But then it should be an AtomicReference and we should use getAndSet(), right? volatile does not guarantee that two threads don't read the same value. Technically, Future.cancel() is idempotent, but I'd much rather be obviously correct than relying on guarantees other code promises (but not necessarily holds...)

Apr 29 '24 07:04 lberki

Update: I tried to use AtomicReference for state.workerThread and failed at the last step -- it turns out that with test_keep_going_weird_deadlock(), we have a case where a single state instance can have multiple worker threads and I couldn't get that case to work with AtomicReference in a way that was obviously correct.

I think it still all works out because Thread.join() is idempotent (you can join a single thread multiple times just fine), but it makes me a bit less comfortable. However, it looks like we want to handle that case cleanly, it's better done in a separate change; this one is complicated enough as it is.

Apr 29 '24 08:04 lberki

Discarding in favor of #22215 and #22100.

May 03 '24 06:05 lberki