haskell-language-server Fix hanging after exit is sent #4152

Trying to fix #4152

Reactor thread offered a db connect to hieDB, if it exist, the db connection exit with it. Operation might need the db connection should only run if the reactor thread is still running.

Otherwise a race condition might happened: Shutdown handler would stop the reactor thread, cancel the shake seesion. If an operation is still trying to reach the db, in this case, the config update callback would try to call setSomethingModified which need db connections. It hangs HLS forever.

This PR ships the config update callback to the reactor thread to fix the deadlock. Alternative would be to make the shutdown handler take away the ideaVar.

Mar 25 '24 23:03 soulomoon

We need to send the update of config to reactor thread to avoid the race condition. More detaily, the setSomethingModified should run under the reactor thread, since the WithDb, IndexQueue offered by runWithDb will be disconnected once the reactor thread is shutdown.

Mar 26 '24 01:03 soulomoon

I don't understand how the fix works (will leave review to people who understand it better) but great that you managed to resolve it. Do you have any estimate how much time this saves across one test suite run?

Mar 26 '24 16:03 jhrcek

Do you have any estimate how much time this saves across one test suite run?

I have not measured, But I don't think there is much, only 3 sec for each time a server trying to close when there is config update going on, most test won't update the config, and there might be long enough time before we exit for the loading of the initial config. Instead of the initial motivation to speed up the testcase. The PR is rather a bug fix that is to solve the potential dead lock situation from happenning during the shutdown and exit of HLS.

Mar 26 '24 18:03 soulomoon

Hmm, this is tricky. I'm not sure this is quite the right fix. Can we instead make it so that we throw in this case instead of hanging?

Mar 27 '24 12:03 michaelpj

Alternatively we could make the shutdown handler take away the ideStateVar to prevent the race condition. This would not change the thread the configupdate callback is running on，since we have only one users for this ideStateVar. Do you think the alternative would be a better approach? @michaelpj https://github.com/soulomoon/haskell-language-server/pull/5

I think we should fix this race condition one way or the other.

Mar 27 '24 15:03 soulomoon

I think that's a better approach. But I think maybe we can work at a lower level of granularity? Basically, I think the pattern we want is something like this:

During shutdown, various resources or capabilities may become unavailable
If we try to access those resources or capabilities then we should fail instead of getting stuck

In this case I think you've narrowed it down to the hiedb connection, so I think what we want is something like:

During shutdown, the database connection will become unavailable
If we try to access the database connection after this, we should fail

What you're suggesting with the ideVar is one possible way of doing this: we remove the handle that our code uses to find the resources. I see two problems with it:

I'm sure in many other places we use readMVar. Now these will hang if they get triggered during shutdown. If they would all throw then I would be okay with it.
This prevents things from working that could work. Maybe that doesn't matter but I could imagine it causing other issues (e.g. what if we had code to flush the values of persistent rules at shutdown and now it can't access the rule system?)

So I think it would be best to try and localise the guard more. Ideally we would just make the thing that currently hangs fail instead. I don't know exactly where that is, but maybe you do?

Apr 02 '24 10:04 michaelpj

During shutdown, the database connection will become unavailable

If we try to access the database connection after this, we should fail.

Yes, I agree about this.

In summary, we actually have two problem to solve here.

First problem, Access to database connection should fail if it is closed instead of hang. (This is root from runWithDb and can be fixed there? unfortunatly, I am not familiar with it, maybe @wz1000 or @jhrcek do)
Second problem, The config update callback should not be invoke after the shutdown handler is called(It is not only trying to access the db connection, also it would try to restart the shake session).

I'm sure in many other places we use readMVar. Now these will hang if they get triggered during shutdown. If they would all throw then I would be okay with it.

This prevents things from working that could work. Maybe that doesn't matter but I could imagine it causing other issues (e.g. what if we had code to flush the values of persistent rules at shutdown and now it can't access the rule system?)

I have update this branch with the gurad fix. IMO, I think this can resolve your concerns @michaelpj : We are now having only one user for ideStateVar after we put the ideaState value in it. It is the config update callback.

Currently, ideStateVar is used to guard against the config update callback before the initilization(and it provides access to ideaState).

After the our gurad fix, we would have a second user: the shutdown handler. So there would be a second functionality for the ideStateVar.

Guard against the config update callback after the shutdown.

After writing the above, I believe ideStateVar is an ideal guard to solve the second problem,

Apr 02 '24 11:04 soulomoon

Access to database connection should fail if it is closed instead of hang.

Did you manage to figure out exactly which function is hanging?

The config update callback should not be invoke after the shutdown handler is called(It is not only trying to access the db connection, also it would try to restart the shake session).

Again, I'm not sure this is necessarily a problem. What happens if we restart the shake session after it is closed? It should fail! So I think it's fine if we allow "higher-level" things to run, so long as the "lower-level" things that can't work fail appropriately.

So there would be a second functionality for the ideStateVar.

... and now this is an invariant we have to remember to use whenever we work with ideStateVar! In that way it makes our lives more complicated, I feel.

Apr 02 '24 12:04 michaelpj

Again, I'm not sure this is necessarily a problem. What happens if we restart the shake session after it is closed? It should fail! So I think it's fine if we allow "higher-level" things to run, so long as the "lower-level" things that can't work fail appropriately.

🤔 If we only make the "lower-level" to fail instead of hanging, the "higher-level" -- the main LSP thread handling the callback would still fail. We would be left with something error like this:

Test suite hls-refactor-plugin-tests: RUNNING...
refactor
  initialize response capabilities
       code action:     FAIL
      Exception: onConfigChange: db already shutdown

Did you manage to figure out exactly which function is hanging?

The exact function that is hanging: onConfigChange -> setSomethingModified -> writeTQueue. runWithDb offers two variables WithHieDbShield withHieDb,hieChan to the outside world, if the thread running runWithDb is done, the writing to hieChan would hang(Not exactly the Access to database connection for correction). ~~I don't understand what is going on underneath the writeTQueue~~

Err... The truth is even wilder than my imagination, the reactor thread do not even have time to ship out putMVar dbMVar (WithHieDbShield withHieDb',hieChan'), before it is cancelled by the shutdown handler.

And restarting shake session hangs too, just not as common.

Apr 02 '24 16:04 soulomoon

I am creating a minimal example to see if I can reproduce the hanging.

This is a minimal example of the hanging:

{-# LANGUAGE RecordWildCards #-}
{-# LANGUAGE PackageImports #-}
{-# LANGUAGE LambdaCase #-}

import Control.Concurrent.STM
import Control.Monad (forever)
import Data.Functor (void)
import Control.Monad.IO.Unlift (MonadUnliftIO)
import           UnliftIO.Concurrent 
import UnliftIO (async, waitAnyCancel)
import GHC.IO (unsafeInterleaveIO)
import System.Timeout (timeout)

type Que = TQueue String
runWithDb :: (Que -> IO ()) -> IO ()
runWithDb k = do
    q <- newTQueueIO
    k q
main :: IO ()
main = do
    queVar <- newEmptyMVar 
    lifetime <- newEmptyMVar 
    die <- newEmptyMVar 
    mainQue <- newTQueueIO
    ~que <- unsafeInterleaveIO $ takeMVar queVar
    _ <- flip forkFinally (const $ pure ()) $ do
        untilMVar lifetime $ runWithDb $ \q -> do
            threadDelay 1000000
            putMVar queVar q
            forever $ do
                putStrLn "Waiting for message"
                msg <- atomically $ readTQueue mainQue
                putStrLn $ "Received: " ++ msg
    putMVar lifetime ()
    _ <- forkIO $ do 
        x <- atomically $ readTQueue que 
        putStrLn $ "Received: " ++ x
        putMVar die ()
    timeout 3000000 (takeMVar die) >>= \case
        Just _ -> putStrLn "Exiting"
        Nothing -> putStrLn "Timeout in 3 s"
    return ()

untilMVar :: MonadUnliftIO m => MVar () -> m () -> m ()
untilMVar mvar io = void $ do
    waitAnyCancel =<< traverse async [ io , readMVar mvar ]

Apr 03 '24 01:04 soulomoon

We would be left with something error like this:

Sure, but then we can work from there, e.g. we can ask: should we be catching exceptions during shutdown, or in the config handler, or whatever. At least the problem is not masked!

if the thread running runWithDb is done, the writing to hieChan would hang(Not exactly the Access to database connection for correction).

Huh, that is weird. AFAIK writing to a TChan shouldn't hang, ever. I'll take a look at your reproduction code...

I also note that there's some stuff in there about retryIfSqlitBusy that looks like it might retry for up to 10 seconds, which might be a problem?

Apr 03 '24 08:04 michaelpj

AFAIK writing to a TChan shouldn't hang, ever. I'll take a look at your reproduction code...

You are right writing the TChan won't hang if the TChan is already computed.

We have a very interesting situation: The problematic part is ~que <- unsafeInterleaveIO $ takeMVar queVar, it delay the computation of que. x <- atomically $ readTQueue que hangs, when it tries to actually compute the que and write to it.

Apr 03 '24 09:04 soulomoon

when it tries to actually compute the que and write to it.

I also am not sure how that can happen! Surely creating the queue and then writing to it should be fine?

Apr 03 '24 09:04 michaelpj

Notice the computation of the que goes like this here: takeMVar queVar. queVar might never be filled with the que value.

If the thread that would fill the queVar exit early.

I am not sure why unsafeInterleaveIO exist in the first place, this delay is causing a lot of confusion.🤔

AFAIK, It should be fine if we remove unsafeInterleaveIO like this https://github.com/haskell/haskell-language-server/pull/4153/commits/61b4808fdf211c67447435caaf6777a10bd7512e Then the main thread can actually wait for the creation and delivery of the queue from the reactor thread in the initilization.

Apr 03 '24 09:04 soulomoon

Yeah, that's super weird and I don't know why we do it like that. If it works the other way that looks reasonable to me, what do you think @wz1000 ?

In general it seems like anywhere we have a takeMVar we have a potential hang.

The other question of course is why the hanging thread doesn't get killed. I think it's probably because we process notifications synchronously, so if we get stuck processing a notification we will never see the exit notification that comes in. Tricky!

I think we could also make this simpler on the lsp side. The spec says we're allowed to ignore notifications after shutdown and error requests, so it could keep track of whether we have been shutdown, and if so block all further requests and notifications. That would stop us getting into this situation.

Apr 03 '24 09:04 michaelpj

@michaelpj I see it is talked about it in the biweekly meeting, is there something new you guys have came up with?

Apr 07 '24 02:04 soulomoon

That was just about the generic shutdown stuff. I would like to understand why there is a hang here. But it seems like at least one thing we should try is your commit that gets rid of the weirdness. Does that pass the tests? And does it solve the problem?

(I still think a comment from @wz1000 would be helpful here.)

Apr 07 '24 20:04 michaelpj

Yes, it passes the test and solves the problem.

Simply put, it hangs because we are trying to takeMvar of the queVar(It should contain the write chan of the db), but queVar is empty(Reactor thread is shutdown before it put the write chan to queVar). What worse, this takeMvar happens in update callback(Not in the callsite of takeMvar), because we use unsafeInterleaveIO to delay the takeMvar IO operation. That is why it is so hassling.

So two things have been done here so far:

update callback won't be running after the shutdown.
Removal of unsafeInterleaveIO. It waits until reactor thread(The one create the db connection) put the write chan of db to queVar in the initialization.

The main concern is that removal of unsafeInterleaveIO would break some invariants that we do not see ? I agree a comment from @wz1000 would be helpful here.

Also I would like to know if there would be unwanted side effect shipping the update callback to run in the reactor thread.

Apr 08 '24 02:04 soulomoon

What happens if we just do the second bit? I would still prefer to avoid complicating the config update callback if we can

Apr 08 '24 09:04 michaelpj

What happens if we just do the second bit? I would still prefer to avoid complicating the config update callback if we can

If only the second bit is done. There is still some hangs occasionally, It is some where inside the shakeRestart(In the proccess of update config), I still could not figure out where those hangs are exactly.

IMO, do a shakeRestart after the shutdownHandler is NB. We should avoid it all togather, that is why we need the first bit.

If a server receives requests after a shutdown request those requests should error with InvalidRequest. [from lsp specification]

I think we could also make this simpler on the lsp side. The spec says we're allowed to ignore notifications after shutdown and error requests, so it could keep track of whether we have been shutdown, and if so block all further requests and notifications. That would stop us getting into this situation.

Yes, I think we can implement this in LSP package, It might be an alternative to our first bit 🤔 WDYT @michaelpj

~~I am going ahead and try do to that in LSP package to see if it works~~

cool it has been done, I would switch to that and see if it works

Apr 08 '24 10:04 soulomoon

~~Great, it seems to be working with lsp updated with correct shutdown behaviour. @michaelpj Could you do a release of LSP package from the head, then I can update the lsp version to that new release. I have ajust the code based on the master head of LSP package~~

Nop, the race condition is still happenning between Reator thread(Doing the actual shutdown) and the main LSP thread doing the config callback

Thing seems to be more nasty than I can imagin, I'll do some futher digging.

Apr 08 '24 13:04 soulomoon

--> {
    "jsonrpc": "2.0",
    "method": "initialized",
    "params": {}
}
--> {
    "id": 0,
    "jsonrpc": "2.0",
    "method": "shutdown"
}
2024-04-08T14:20:03.274166Z | Info | LSP: received shutdown
2024-04-08T14:20:03.274196Z | Debug | Received shutdown message
--> {
    "id": 1,
    "jsonrpc": "2.0",
    "result": [
        {
            "cabalFormattingProvider": "cabal-gild",
            "checkParents": "CheckOnSave",
            "checkProject": true,
            "formattingProvider": "ormolu",
            "maxCompletions": 40,
            "plugin": {}
        }
    ]
}
<-- {
    "id": 0,
    "jsonrpc": "2.0",
    "result": null
}
--> {
    "jsonrpc": "2.0",
    "method": "exit"
}
2024-04-08T14:20:03.274515Z | Debug | LSP: set new config: {
  "cabalFormattingProvider": "cabal-gild",
  "checkParents": "CheckOnSave",
  "checkProject": true,
  "formattingProvider": "ormolu",
  "maxCompletions": 40,
  "plugin": {  }
}
2024-04-08T14:20:03.274588Z | Debug | Configuration changed: Config {checkParents = CheckOnSave, checkProject = True, formattingProvider = "ormolu", cabalFormattingProvider = "cabal-gild", maxCompletions = 40, plugins = fromList []}
2024-04-08T14:20:03.276135Z | Info | Reactor thread stopped
Server does not exit in 3s, canceling the async task...
Finishing canceling (took 0.41ss)

It seems we are still trying to "set new config" after the shutdown.

~~we are not blocking the An action to be run before invoking the handler, that is why we still hang. https://github.com/haskell/lsp/blob/1e5940b4c85d53f01831bca487f3cd0a9466d3de/lsp/src/Language/LSP/Server/Processing.hs#L452~~ Blocking the action won't help either. 🤔

Apr 08 '24 14:04 soulomoon

So this is what happens: 1 server sendRequest SMethod_WorkspaceConfiguration to client (callback is not ran yet). 2 client send shutdown to server, server shutdown. 3 response to SMethod_WorkspaceConfiguration is recieved in server,server calling the update callback in the main LSP thread, and stuck somewhere when we restart the shake session.

This can be prevent if we ignore the client response after server is shutdown in processMessage

Apr 09 '24 07:04 soulomoon

Right, I wondered about doing that (ignoring responses after shutdown). The spec doesn't say that we should, so I guess it's optional. I guess the reason it seems risky to me is that in principle a server might be waiting for a response to happen. That might be foolish, but it could happen. Suppose e.g.:

The server tries to start a progress session, and requests a progress token, blocking some thread until it receives it
The client sends shutdown
The client sends the progress token back (but the server ignores it)
The server hangs until the blocking in 1 is resolved some other way (e.g. a timeout)

Now, this is maybe bad behaviour from the server implementor, but I thought it seemed reasonable to still handle responses.

What I could do is have us not call the config update callback after shutdown :thinking:

But yeah, it would still be great if we could work out why shakeRestart is hanging. And I think your change to get rid of the unsafePerformIO is good anyway. Maybe put that in a separate PR and we can try and get that bit done?

Apr 09 '24 08:04 michaelpj

I am also considering adding a current thread ID to the hls log. It should help us with a more effective debugging eperience on such race issues.

Apr 10 '24 16:04 soulomoon

Something like this

ThreadId 19 | 2024-04-11T03:17:33.383209Z | Info | Started LSP server in 0.01s
ThreadId 19 | 2024-04-11T03:17:33.452743Z | Debug | ghc --print-libdir
ThreadId 19 | 2024-04-11T03:17:33.453779Z | Debug | Setting initial dynflags...
ThreadId 19 | 2024-04-11T03:17:33.453819Z | Debug | shouldRunSubset: True
ThreadId 29 | 2024-04-11T03:17:33.453879Z | Debug | Initializing exports map from hiedb
ThreadId 29 | 2024-04-11T03:17:33.454034Z | Debug | Done initializing exports map from hiedb. Size: 0

Apr 11 '24 03:04 soulomoon

Fixed at lsp https://github.com/haskell/lsp/pull/567 and upgraded lsp in #4166

May 01 '24 11:05 soulomoon

I found another hang https://github.com/haskell/haskell-language-server/actions/runs/8924119343/job/24509825742?pr=4202

It is hanging forever instead of 3 seconds becuase it is using ghcide-test-utils' testIde, but only the hls-test-utils implemented the 3 seconds maximum wait.

----------------- update Should be the same hang as https://github.com/haskell/haskell-language-server/pull/2497

It seems https://github.com/haskell/haskell-language-server/pull/2504 is not fixing the problem entirely, It might still happpend in some rare race condition:

try to read var and succefully cancel the original session
rebuild happend right after that.
Shake still gets unhappy!

we should wait and take away session var to make shake happy

No take away session would block throw stm indefinite wait error. Cancel session - session taking - session running recursion loop hell.

May 01 '24 15:05 soulomoon