spring-batch icon indicating copy to clipboard operation
spring-batch copied to clipboard

Handling Graceful Shutdown in SpringBatch

Open reluxa opened this issue 1 year ago • 1 comments

Bug description I have an application which uses remote partitioned batch jobs which are sent to the workers via JMS. I also have ThreadPoolTaskExecutor configured on the worker side, so the chunks can be processed in parallel. I was testing the graceful shutdown behavior on the worker side.

One of the testcase was to test what is happening when the processing time of a step on remote side takes longer than the graceful period. The expected scenario in this case that after the graceful period expires then the partition step terminates end the step state is going to be STOPPED in the database.

In my case, the application just starts hanging, Spring is not able to fully close the spring context in this scenario. It's hanging in an endless loop in RepeatTemplate.executeInternal(). This calls TaskExecutorRepeatTemplate.getNextResult() there it tries calls runnable.expect() which calls queue.expect();. Since spring already tries to Interrupt everything this call will fail with an InterruptedException which then will be translated to a RepeatException.

https://github.com/spring-projects/spring-batch/blob/e6c27273fa2b3713c6f2d472bf3de1b18f8e5eba/spring-batch-infrastructure/src/main/java/org/springframework/batch/repeat/support/RepeatTemplate.java#L204-L217

Here couple of things can fail:

  • doHandle calls DefaultExceptionHandler https://github.com/spring-projects/spring-batch/blob/e6c27273fa2b3713c6f2d472bf3de1b18f8e5eba/spring-batch-infrastructure/src/main/java/org/springframework/batch/repeat/exception/DefaultExceptionHandler.java#L37-L39 This can be overridden by a custom ExceptionHandler so no NPE will be thrown.

  • in case DEBUG is enabled then NPE can also be thrown here, since the unwrapped throwable is null https://github.com/spring-projects/spring-batch/blob/e6c27273fa2b3713c6f2d472bf3de1b18f8e5eba/spring-batch-infrastructure/src/main/java/org/springframework/batch/repeat/support/RepeatTemplate.java#L288-L290 This can also be fixed by turning of DEBUG.

  • and finally here:

https://github.com/spring-projects/spring-batch/blob/e6c27273fa2b3713c6f2d472bf3de1b18f8e5eba/spring-batch-infrastructure/src/main/java/org/springframework/batch/repeat/support/RepeatTemplate.java#L215-L217

I would expect running to be set to false, however it won't happen the RepeatContext is still not complete.

  • Using reflect I was able to add an RepeatListener to RepeatTemplate which calls the context.setTerminateOnly() when the application is shutting down. This allows to break the endless loop here, but after that in AbstarctStep , it again tries to rethrow null after it extracted out the cause from this RepeateException https://github.com/spring-projects/spring-batch/blob/e6c27273fa2b3713c6f2d472bf3de1b18f8e5eba/spring-batch-core/src/main/java/org/springframework/batch/core/step/AbstractStep.java#L232

Environment Please provide as many details as possible: Spring Batch version, Java version, which database you use if any, etc

  • openjdk version "17.0.7" 2023-04-18 LTS
  • Spring Batch 5.0.2
  • Spring Boot 3.1.1
  • PostrgeSQL 15.3

Steps to reproduce See above

Expected behavior

  • after the graceful period Spring shall be able to forcefully close the ApplicationContext
  • no NPE or other exception is expected to be thrown.
  • the related step state shall be saved using STOPPED state in the datatabase.

Minimal Complete Reproducible example TBD, I will try to create a minimalistic example for this. springbatchissue.zip Steps to reproduce:

  • unzip
  • execute ./gradlew jibDockerBuild to create a docker image
  • start the stack using docker-compose up
  • check the logs for the worker, immediately after the first message is received by the worker execute kill -15 1 to kill it

You will see the app won't terminate after the graceful period ends. execute kill -3 1 and you will see that it's hanging in an endless loop

reluxa avatar Aug 10 '23 11:08 reluxa