zeppelin icon indicating copy to clipboard operation
zeppelin copied to clipboard

[ZEPPELIN-5981] Check binded interpreter inside Paragraph.jobAbort()

Open elbakramer opened this issue 1 year ago • 7 comments

What is this PR for?

Inside an open zeppelin notebook, after running a paragraph by clicking the play icon button, canceling the running paragraph by clicking the pause icon button (or the Ctrl+Option+C shortcut) does not interrupt the running process and cancel the paragraph.

I've checked that CANCEL_PARAGRAPH message gets well passed through the websocket connection but server was somehow completely ignoring the message.

After some more investigation, I've found that the reason for that was the Paragraph's jobAbort() method, which does nothing due to the null interpreter member.

https://github.com/apache/zeppelin/blob/73e1aa37d403626d23138b3b346aec5ef2c12813/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/Paragraph.java#L505

I've checked some other methods of Paragraph such as completion(), execute(), jobRun(), recover() and they were initializing interpreter before doing something by calling getBindedInterpreter(), so I've changed the jobAbort() method to do the same.

https://github.com/apache/zeppelin/blob/73e1aa37d403626d23138b3b346aec5ef2c12813/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/Paragraph.java#L258

https://github.com/apache/zeppelin/blob/73e1aa37d403626d23138b3b346aec5ef2c12813/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/Paragraph.java#L333

https://github.com/apache/zeppelin/blob/73e1aa37d403626d23138b3b346aec5ef2c12813/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/Paragraph.java#L399

https://github.com/apache/zeppelin/blob/73e1aa37d403626d23138b3b346aec5ef2c12813/zeppelin-zengine/src/main/java/org/apache/zeppelin/notebook/Paragraph.java#L810

Other than that, I have some doubts on whether it's right thing to return true on failing what was requested.

But after all, I just sticked to the original behavior (returning true for every case) to prevent some unexpected side effects.

What type of PR is it?

Bug Fix

Todos

N/A

What is the Jira issue?

  • https://issues.apache.org/jira/browse/ZEPPELIN-5981

How should this be tested?

  • Run a paragraph that can be canceled, like a python interpreter paragraph with for i in range(10): time.sleep(1)
  • Click the cancel button and check if it gets canceled .

Screenshots (if appropriate)

N/A

Questions:

  • Does the license files need to update? No
  • Is there breaking changes for older versions? No
  • Does this needs documentation? No

elbakramer avatar Nov 17 '23 12:11 elbakramer

@elbakramer Thank you for the contribution. BTW, could you please check the CI? All tests for core-modules failed. I hope you investigate and fix them before starting reviews.

jongyoul avatar Nov 21 '23 04:11 jongyoul

@jongyoul I think the failures are due to CI actions taking too long, hopefully not by the code change. Could you please check if it's the CI issue first? Or is there any way to investigate the CI or just rerun by myself? 스크린샷 2023-11-21 오후 2 33 18

elbakramer avatar Nov 21 '23 05:11 elbakramer

@elbakramer I don't think we have any critical issues in CI now but let me restart it. FYI, I checked the log briefly and it looks like hanging for a long time in some tests.

jongyoul avatar Nov 21 '23 05:11 jongyoul

Hello, I re-ran the test but there are still some error with the same situation before. I checked the current CI and it didn't have a problem. Could you please check why the CI stops and hangs?

jongyoul avatar Nov 23 '23 10:11 jongyoul

@jongyoul

I think now CI does not hang and all tests for core modules finish properly with the green check icons. But I'm still curious whether the errors and warnings shown in the Annotations block below should still be resolved. Could you please check if it's ok or not?

https://github.com/elbakramer/zeppelin/actions/runs/7013954933

스크린샷 2023-12-07 오후 6 01 01

What I've found about the hang is that, it seems like the getBindedInterpreter() method works like the get-or-create manner and it unintentionally creates a new interpreter instance upon jobAbort() if none exists. And that unexpected new interpreter makes all tests hang. My first guess about that method was like get-or-null manner but turned out to be wrong.

In order to prevent that creation, I've added some checks before actually calling the method:

  • do nothing if the Job is aborted already
  • only call getBindedInterpreter() if
    • the target for the abort is considered to exist already
    • and current this.interpreter member is null (and should bind to the existing one in order to cancel that)

I couldn't think of a direct way to check the existence of the interpreter instance. So it just indirectly checks whether if the job is currently running or not, assuming that there would be an interpreter instance if the job is in a running state.

elbakramer avatar Dec 07 '23 09:12 elbakramer

Thank you for checking and fixing it. I think your analysis looks good. Let's see how the CI goes. I ran it again.

jongyoul avatar Dec 07 '23 09:12 jongyoul

Could you please rebase this PR to trigger the latest CI again? one of them still fails.

jongyoul avatar Feb 05 '24 12:02 jongyoul