pharo 10644 fixes process terminate

10644 fixes process terminate

Open isCzech opened this issue 2 years ago • 41 comments

After failing tests in #11308 (Zinc related, timed out, no idea why) I'm starting over: new Pharo11 image and merged the same changes as in #11308. The previous PR was based on an original PR #10644 sent from a very old image, so just making sure this is not an issue... (I have very limited github experience though). Thanks for your patience :)

May 30 '22 22:05 isCzech

still the same set of failing tests, don't know where to go form here, please advise

May 31 '22 07:05 isCzech

very strange. One idea is to check the build artefact and see if the tests fail interactively, too. I will try that..

May 31 '22 13:05 MarcusDenker

manually the tests pass; however, a few of them sometimes fail getting stuck on a delay semaphore, if you Alt+. and Abandon the debugger the next run will pass... Investigating...

ha, they fail in the current image too and behave similarly erratically; relieved ;)

May 31 '22 15:05 isCzech

Disable test execution environment to remove its own logic relying on ensure blocks:

ZnClient>>runCaseManaged
	"Here we are testing the test environment logic.
	So we should disable it for ourselves"
	
	^DefaultExecutionEnvironment beActiveDuring: [ self runCase]

It will reduce the amount of failures. I have the only failing test with same port error. I think the termination logic with special higher priority shows the problems here.

Jun 03 '22 22:06 dionisiydk

the tests that still fail don't seem to be related to this commit; they all pass when run in the image

Jun 04 '22 08:06 isCzech

Hi all, ran CI again and for some reason some of the previous tests cleared and one remained failing - no idea why, in the image it works ok. CI says: FLSerializer class(Object)>>doesNotUnderstand: #serializeToByteArray: but AFAICS in my image FLSerializer class has and knows #serializeToByteArray: so please advise... Is this not failing without this commit? How can I find out? Thanks!

Jun 05 '22 09:06 isCzech

ran CI again and for some reason some of the previous tests cleared and one remained failing - no idea why, in the image it works ok. CI says: FLSerializer class(Object)>>doesNotUnderstand: #serializeToByteArray: but AFAICS in my image FLSerializer class has and knows #serializeToByteArray: so please advise... Is this not failing without this commit? How can I find out? Thanks!

Possibly it is related to recent Roassar update. Does it fail in other PRs?

Jun 05 '22 12:06 dionisiydk

I do not know :) but in concurrent programming I have no idea what is negligible. :)

Jun 06 '22 12:06 Ducasse

In Playground, when trying to terminate p := [ [Semaphore new wait] ensure: [^2] ] fork we should prefer [p terminate] fork rather than the plain p terminate because the latter is run in the UI and thus the UI will wait on a semaphore until the termination of p is completed; however p will run into problems with the non-local return ^2 and will try to open a debugger in the UI - so as result the image will freeze, which is expected but scary. After Alt + . (Cmd + .) recovery the debugger opens Block Cannot Return as expected.

Jun 06 '22 14:06 isCzech

In Playground, when trying to terminate p := [ [Semaphore new wait] ensure: [^2] ] fork we should prefer [p terminate] fork rather than the plain p terminate

It's not much different from any other errors inside the #ensure block:

p := [ [Semaphore new wait] ensure: [1/0] ] fork.
p terminate.

Here it is ZeroDivide exception but in your example it was BlockCannotReturn (same as from DoIt"[ ^2 ] fork" ). Also halt inside ensure would lock the UI:

p := [ [Semaphore new wait] ensure: [self halt] ] fork.
p terminate

Notice that in current image we have a degradation which we must fix. Both my examples does not show the debugger at all. Errors or halt are eaten by the current #termination logic. With this PR the #terminate will lock. In old images before all changes to the termination (for exampe Pharo 7) we would have UnwindError forked in a separate process while the #terminate itself will be finished (and Halt was working as expected).

Jun 18 '22 13:06 dionisiydk

It's not much different from any other errors inside the #ensure block

Precisely... any interaction request with the UI from inside the unwind block during termination causes this UI unresponsiveness. It's just a consequence of running the termination from the Playground in the UI process... not a bug but one should be aware of that :)

Notice that in current image we have a degradation which we must fix.

Absolutely, that is the main deficiency of the current #terminate which was just 'phase 1' of my solution; baby steps... tried to get rid of the bugs in the previous ancient #terminate first and build on that.

I've implemented the same algorithm in Squeak and Cuis and so far I'm not aware of any issues.

Jun 18 '22 14:06 isCzech

So with all these changes the external termination become a simple trigger initiating a self-termination from the current execution point of a #process. Now a sender of #terminate waits the completion of a process and I think it is too strong criteria. It leads to the observed problems with the locked sender during the failed termination. I think we should relax this criteria to "sender waits for the completion of a termination". The failed termination should be considered as a completed one. It is always about unhandled error and it is passed to the user from the process itself. The sender of #terminate don't need to wait any decision here (or any other "unhandled fallback").

Thus I suggest to add another semaphore signal for the unhandled errors:

Process>>doTerminationFromAnotherProcess
....
   terminator := Semaphore new.
   trapContext := context bottomContext insertSender: (Context
							contextOn: context exceptionsToCaptureWhenStepping do: [:ex | 
                                                                        terminator signal. ex pass ]).
   trapContext insertSender: (Context contextEnsure: [terminator signal]).
   suspendedContext := [context unwindTo: nil. self endProcess] asContext.
...
```		
With this change any failed termination will show a debugger like if it would happen during a self-termination. #terminate will return to the sender once the failure is detected.
In future we can introduce special notification ProcessTerminationFailed to allow #terminate users to handle such cases.

Jun 21 '22 23:06 dionisiydk

Wow, I just wanted to try catching the UnhandledError at the bottom of the stack :D And yes, your suggestion WORKS with a small adjustment - I reversed the order at the bottom:

Process>>doTerminationFromAnotherProcess
....
   terminator := Semaphore new.
   context bottomContext insertSender: (Context contextEnsure: [terminator signal]).
   context bottomContext insertSender: 
      (Context contextOn: UnhandledError do: [:ex | terminator signal. ex pass ]).
   suspendedContext := [context unwindTo: nil. self endProcess] asContext.
...

However, one thing that bothers me: the reason UI becomes unresponsive is because of the way the UnhandledErrors are handled - i.e. inside the UI process; this generally has nothing to do with termination as such. So catching unhandled errors in #terminate is a bit of a trade-off (the great thing about this is it WORKS, however).

I wonder (theoretically only indeed): would it be a solution to restart the UI when an UnhandledError is handled, in the default action somehow? (Or even more theoretically to run the debugger as a separate process?)

Thanks for discussing this!

Jun 22 '22 07:06 isCzech

I reversed the order at the bottom

Oops, @dionisiydk, you're right about the order - #contextOn:do: comes first and #contextEnsure: at the very bottom It confused me in Squeak it works regardless of the order but in Pharo the order matters...

however, I'd still prefer to find a solution outside #terminate to keep the #terminate code independent of the implementation of dealing with unhandled exceptions (whether they are handled in the UI process or otherwise...)

Another example to explore the behavior: [self halt] ensure: [self error] close the debugger and see if the UI freezes...

Jun 22 '22 09:06 isCzech

Hi Denis, all, What would you think about terminating the debugged process in DebugSession>>#terminate by a forked helper process; like this:

DebugSession>>#terminate
	"Action that needs to be executed after the window containing this debug session is closed, 
	in order to terminate the right process."
	
	self interruptedProcess 
		ifNotNil: [ 
			"Assume the user closed the debugger. Simply kill the interrupted process."
			[self interruptedProcess terminate] fork.     "<---------- fork the termination and let the debugger close"
			Processor yield.
			"self interruptedProcess terminate.           <----------- instead of this current approach"
			self clear.
			Smalltalk installLowSpaceWatcher.  "restart low space handler" ]
		ifNil: [ 
			"Assume the interrupted process was resumed."
			"Kill the active process if the error was in the UI as there should be only one UI process."
			self isAboutUIProcess
				ifTrue: [Processor terminateActive] ]

This way you could keep Process>> #terminate clean, i.e. without dealing with the side effects of running the debugger and Playground examples inside the UI.

I'm not a debugger expert and would really appreciate any opinions about this approach. Many thanks!

Jun 22 '22 17:06 isCzech

However, one thing that bothers me: the reason UI becomes unresponsive is because of the way the UnhandledErrors are handled

Unresponsiveness in that case is a side effect of single UI process in the image. Until UI finishes the current action it is not able to process the next one - it can't open a debugger in our case.

But my point is not really about this locking issue. What should we expect from the following example:

process := [ [ Semaphore new wait ] ensure: [ 1 /0 ] ] forkAt: Processor activePriority + 2.
terminator := [ process terminate ] forkAt: Processor activePriority + 1.
result := terminator isTerminated " ==> true or false? "

With current logic (leading to UI lock) it returns false because the terminator process waits for a full completion of the victim process but the process is suspended on the debugger. Without any action from the user the process will wait forever. If the user will close the debugger it will initiate the termination again which at the end will signal the original semaphore and the #terminator will finish. With my proposal the #result will be true - the terminator simply does not care about failed termination. So the question is really about this semantics: should #terminate return (stop waiting) when the termination fails?

I am more inclined to follow the "ignore failed termination" idea (return when it fails). I don't see the value to wait for the UnhandledError to be sorted. It can be useful to detect the failed termination with a proper exception but the "infinite waiting" does not provide such a feature and therefore it looks useless to me. And we will simply avoid the lock issue if we will follow this idea.

Jun 22 '22 18:06 dionisiydk

Another example to explore the behavior: [self halt] ensure: [self error] close the debugger and see if the UI freezes...

yep. another example to follow my idea

Jun 22 '22 18:06 dionisiydk

What would you think about terminating the debugged process in DebugSession>>#terminate by a forked helper process

In my experience fixes like "let's simply fork it" does not work well. It always leads to more worse corner cases which is more difficult to debug. I think #terminate should be safe to use. And our simple examples show that it is too easy to go to a trouble now.

Jun 22 '22 18:06 dionisiydk

What would you think about terminating the debugged process in DebugSession>>#terminate by a forked helper process

In my experience fixes like "let's simply fork it" does not work well. It always leads to more worse corner cases which is more difficult to debug. I think #terminate should be safe to use. And our simple examples show that it is too easy to go to a trouble now.

Thank you, I don't have an opinion yet but I appreciate sharing your experience :)

Jun 22 '22 18:06 isCzech

With my proposal the #result will be true - the terminator simply does not care about failed termination. So the question is really about this semantics: should #terminate return (stop waiting) when the termination fails?

That summarizes the issue perfectly; again, due to my limited experience I don't have an opinion which way is preferable but I agree your proposal works perfectly and solves the issue; if this semantics is ok for you I'm more than happy to do it this way :)

The example presents it very clearly, thanks! Again, happy to go either way :)

Jun 22 '22 18:06 isCzech

That summarizes the issue perfectly; again, due to my limited experience I don't have an opinion which way is preferable but I agree your proposal works perfectly and solves the issue; if this semantics is ok for you I'm more than happy to do it this way :)

The example presents it very clearly, thanks! Again, happy to go either way :)

Great, deal :). This way we will also reduce the possible impact because the current and the old semantics in the image is working without the locking on a failure.

So please add a handle for UnhandledException (notice the base class). And on another iteration we will improve this logic and maybe again rethink it (who knows)

Jun 22 '22 19:06 dionisiydk

Deal :) One more note: This line must go: self isTerminating ifTrue: [ProcessAlreadyTerminating signal. ^self].

The point is in all examples above the process being terminated will actually be terminated twice (one termination is invoked by sending terminate and the second termination by closing the debugger; in case of [self halt] ensure: [self error] both terminations are invoked by closing the debugger). BUT the terminating flag would prevent the second (and any other subsequent) termination resulting in unterminated processes in the system.

Well, I don't actually see any reason for enforcing this check "if terminating then return immediately") (at least now; please correct me if I'm wrong). Multiple termination is perfectly legitimate and can/should be dealt with instead of blocking it.

Will update the code soon; thanks again!

Jun 22 '22 22:06 isCzech

Deal :) One more note: This line must go: self isTerminating ifTrue: [ProcessAlreadyTerminating signal. ^self].

This is interesting. It means that isTerminating flag can be removed at the end. I think it was introduced to prevent some race condition issues during simultaneous self-termination and external termination of the process. Maybe new version is safe now. But anyway as you noticed such a simple flag leads to other critical problems.

Jun 22 '22 22:06 dionisiydk

Hi Denis, please check this out: We're primarily trying to avoid freezing the UI in case of failed termination (termination with an error during unwind); in other cases, when the UI is not involved, we don't mind if the (non-UI) terminator process waits until the error during unwind is dealt with; the process may have an explicit handler for Unhandled exception so it'd probably be better to let the process deal with it and let the terminator process wait until it's done). So I've narrowed the "special care" on:do: context to on: UnhandledException do: "unblock UI":

doTerminationFromAnotherProcess
		...
		context bottomContext insertSender: 
			(Context contextOn: UnhandledException do: [:ex | 
				UIManager default uiProcess suspendingList == terminator ifTrue: [terminator signal]. ex pass]).
		context bottomContext insertSender: (Context contextEnsure: [terminator signal]).

In this case your example

process := [ [ Semaphore new wait ] ensure: [ 1/0 ] ] forkAt: Processor activePriority + 2.
terminator := [ process terminate ] forkAt: Processor activePriority + 1.
{terminator isTerminated. process isTerminated} " ==> true or false? "

would answer #(false false) before and #(true true) after closing the debugger.

I'll test it and push later if you won't find anything wrong with it. Thanks a lot!

Jun 23 '22 12:06 isCzech

Please note that terminating := true. remains in place for the time being because some code and tests throughout the core use #isTerminating instead of #isTerminated for some reason; I can't tell how much code outside the core does that too so I'd consider safer to keep it for now and possibly maybe deprecate later.

Jun 23 '22 15:06 isCzech

context bottomContext insertSender: (Context contextOn: UnhandledException do: [:ex | UIManager default uiProcess suspendingList == terminator ifTrue: [terminator signal]. ex pass]).

I don't like this extra logic. I think the semantics of #terminate synchronization should be the same independently of the terminator process. And generally the UIManager and the concept of uiProcess are too high level things to be used in such kernel internal place

Jun 23 '22 16:06 dionisiydk

the process may have an explicit handler for Unhandled exception so it'd probably be better to let the process deal with it and let the terminator process wait until it's done

We should either wait or not wait. Otherwise it can be a nightmare to debug why something works from workspace but does not work on a "server forking code" or other way.

Anyway we install handlers into the bottom context of the process. If somebody will decide to handle UnhandledException it will be done on a level above. Our trap is for real unhandled case where nobody caught the error.

Jun 23 '22 16:06 dionisiydk

[...] it can be a nightmare to debug why something works from workspace but does not work on a "server forking code" or other way. Our trap is for real unhandled case where nobody caught the error.

Hmm, that's true, thanks! :)

Jun 23 '22 16:06 isCzech

Hi @dionisiydk, all The UI unresponsiveness when terminating the debugger in this example [self halt] ensure: [self error] is just a mere consequence of a more general bug or deficiency: take a look at this example:

s := Semaphore new.
[1/0. s signal] fork.
s wait

There's no reason to expect the UI to freeze yet it freezes because when opening the debugger the system won't make sure a functional unsuspended and unblocked UI is available... I'd expect some sort of check in #spawnNewProcessIfThisIsUI: or somewhere similar but I have zero experience with the debugger code. My idea is just to resume a blocked UI; if you knew where to place a fix to make the above example work without freezing the UI then #terminate wouldn't need to deal with this issue at all! And generally this would mean a better user experience when running experiments from the UI :)

Any suggestion very much appreciated.

Jun 30 '22 22:06 isCzech

My idea is just to resume a blocked UI

Ok, it may be a bad idea... We may not want to resume a UI that is blocked for a reason - like waiting on a semaphore; maybe we just need a new UI that would open a debugger while letting the old UI finish the computation (after pressing Abandon or Proceed)... Thanks for any advice

Jun 30 '22 23:06 isCzech

pharo pharo copied to clipboard

10644 fixes process terminate

pharo
pharo copied to clipboard