zeppelin
zeppelin copied to clipboard
[ZEPPELIN-5981] Check binded interpreter inside Paragraph.jobAbort()
What is this PR for?
Inside an open zeppelin notebook, after running a paragraph by clicking the play icon button, canceling the running paragraph by clicking the pause icon button (or the Ctrl+Option+C shortcut) does not interrupt the running process and cancel the paragraph.
I've checked that CANCEL_PARAGRAPH message gets well passed through the websocket connection but server was somehow completely ignoring the message.
After some more investigation, I've found that the reason for that was the Paragraph's jobAbort() method, which does nothing due to the null interpreter member.
https://github.com/apache/zeppelin/blob/73e1aa37d403626d23138b3b346aec5ef2c12813/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/Paragraph.java#L505
I've checked some other methods of Paragraph such as completion(), execute(), jobRun(), recover() and they were initializing interpreter before doing something by calling getBindedInterpreter(), so I've changed the jobAbort() method to do the same.
https://github.com/apache/zeppelin/blob/73e1aa37d403626d23138b3b346aec5ef2c12813/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/Paragraph.java#L258
https://github.com/apache/zeppelin/blob/73e1aa37d403626d23138b3b346aec5ef2c12813/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/Paragraph.java#L333
https://github.com/apache/zeppelin/blob/73e1aa37d403626d23138b3b346aec5ef2c12813/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/Paragraph.java#L399
https://github.com/apache/zeppelin/blob/73e1aa37d403626d23138b3b346aec5ef2c12813/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/Paragraph.java#L810
Other than that, I have some doubts on whether it's right thing to return true on failing what was requested.
But after all, I just sticked to the original behavior (returning true for every case) to prevent some unexpected side effects.
What type of PR is it?
Bug Fix
Todos
N/A
What is the Jira issue?
- https://issues.apache.org/jira/browse/ZEPPELIN-5981
How should this be tested?
- Run a paragraph that can be canceled, like a python interpreter paragraph with
for i in range(10): time.sleep(1) - Click the cancel button and check if it gets canceled .
Screenshots (if appropriate)
N/A
Questions:
- Does the license files need to update? No
- Is there breaking changes for older versions? No
- Does this needs documentation? No
@elbakramer Thank you for the contribution. BTW, could you please check the CI? All tests for core-modules failed. I hope you investigate and fix them before starting reviews.
@jongyoul I think the failures are due to CI actions taking too long, hopefully not by the code change. Could you please check if it's the CI issue first? Or is there any way to investigate the CI or just rerun by myself?
@elbakramer I don't think we have any critical issues in CI now but let me restart it. FYI, I checked the log briefly and it looks like hanging for a long time in some tests.
Hello, I re-ran the test but there are still some error with the same situation before. I checked the current CI and it didn't have a problem. Could you please check why the CI stops and hangs?
@jongyoul
I think now CI does not hang and all tests for core modules finish properly with the green check icons. But I'm still curious whether the errors and warnings shown in the Annotations block below should still be resolved. Could you please check if it's ok or not?
https://github.com/elbakramer/zeppelin/actions/runs/7013954933
What I've found about the hang is that, it seems like the getBindedInterpreter() method works like the get-or-create manner and it unintentionally creates a new interpreter instance upon jobAbort() if none exists. And that unexpected new interpreter makes all tests hang. My first guess about that method was like get-or-null manner but turned out to be wrong.
In order to prevent that creation, I've added some checks before actually calling the method:
- do nothing if the
Jobis aborted already - only call
getBindedInterpreter()if- the target for the abort is considered to exist already
- and current
this.interpretermember isnull(and should bind to the existing one in order to cancel that)
I couldn't think of a direct way to check the existence of the interpreter instance. So it just indirectly checks whether if the job is currently running or not, assuming that there would be an interpreter instance if the job is in a running state.
Thank you for checking and fixing it. I think your analysis looks good. Let's see how the CI goes. I ran it again.
Could you please rebase this PR to trigger the latest CI again? one of them still fails.