spring-cloud-dataflow icon indicating copy to clipboard operation
spring-cloud-dataflow copied to clipboard

Getting deadlock error while executing a task

Open akshaylad26 opened this issue 1 year ago • 16 comments

Description: We are getting intermittent deadlock issue while executing tasks. We observed that this usually happens when our execution cleanup job is running in parallel. Below is the stacktrace. We are unable to reproduce this issue in lower environments, but it is frequently happening in client environments. Can you please check? If any details required, please let me know, I will extract it.

Release versions: We are using 2.11.4 release version.

org.springframework.dao.CannotAcquireLockException: PreparedStatementCallback; SQL [INSERT INTO BOOT3_BATCH_JOB_EXECUTION(JOB_EXECUTION_ID, JOB_INSTANCE_ID, START_TIME, END_TIME, STATUS, EXIT_CODE, EXIT_MESSAGE, VERSION, CREATE_TIME, LAST_UPDATED) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ]; Deadlock found when trying to get lock; try restarting transaction at org.springframework.jdbc.support.SQLExceptionSubclassTranslator.doTranslate(SQLExceptionSubclassTranslator.java:78) at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73) at org.springframework.jdbc.core.JdbcTemplate.translateException(JdbcTemplate.java:1572) at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:667) at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:960) at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:1015) at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:1020) at org.springframework.batch.core.repository.dao.JdbcJobExecutionDao.saveJobExecution(JdbcJobExecutionDao.java:243) at org.springframework.batch.core.repository.support.SimpleJobRepository.createJobExecution(SimpleJobRepository.java:178) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:343) at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:196) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:123) at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:391) at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:184) at org.springframework.batch.core.repository.support.AbstractJobRepositoryFactoryBean.lambda$getObject$0(AbstractJobRepositoryFactoryBean.java:204) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:184) at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:244) at jdk.proxy2/jdk.proxy2.$Proxy94.createJobExecution(Unknown Source) at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(SimpleJobLauncher.java:145) at org.springframework.batch.core.launch.support.TaskExecutorJobLauncher.run(TaskExecutorJobLauncher.java:59) at org.springframework.cloud.task.batch.handler.TaskJobLauncherApplicationRunner.execute(TaskJobLauncherApplicationRunner.java:147) at org.springframework.boot.autoconfigure.batch.JobLauncherApplicationRunner.executeLocalJobs(JobLauncherApplicationRunner.java:195) at org.springframework.boot.autoconfigure.batch.JobLauncherApplicationRunner.launchJobFromProperties(JobLauncherApplicationRunner.java:175) at org.springframework.cloud.task.batch.handler.TaskJobLauncherApplicationRunner.run(TaskJobLauncherApplicationRunner.java:112) at org.springframework.boot.autoconfigure.batch.JobLauncherApplicationRunner.run(JobLauncherApplicationRunner.java:165) at org.springframework.boot.SpringApplication.lambda$callRunner$4(SpringApplication.java:771) at org.springframework.util.function.ThrowingConsumer$1.acceptWithException(ThrowingConsumer.java:83) at org.springframework.util.function.ThrowingConsumer.accept(ThrowingConsumer.java:60) at org.springframework.util.function.ThrowingConsumer$1.accept(ThrowingConsumer.java:88) at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:783) at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:771) at org.springframework.boot.SpringApplication.lambda$callRunners$3(SpringApplication.java:759) at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) at java.base/java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:357) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:510) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:759) at org.springframework.boot.SpringApplication.run(SpringApplication.java:332) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1327) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1316) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49) at org.springframework.boot.loader.Launcher.launch(Launcher.java:95) at org.springframework.boot.loader.Launcher.launch(Launcher.java:58) at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:65) Caused by: com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:124) at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122) at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:916) at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1061) at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1009) at com.mysql.cj.jdbc.ClientPreparedStatement.executeLargeUpdate(ClientPreparedStatement.java:1320) at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdate(ClientPreparedStatement.java:994) at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeUpdate(ProxyPreparedStatement.java:61) at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeUpdate(HikariProxyPreparedStatement.java) at org.springframework.jdbc.core.JdbcTemplate.lambda$update$2(JdbcTemplate.java:965) at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:648) ... 55 common frames omitted

akshaylad26 avatar Sep 11 '24 10:09 akshaylad26

@akshaylad26 Is this happening in a Batch Task application that is launched by Spring Cloud Data Flow or using a Schedule?

corneil avatar Sep 11 '24 11:09 corneil

@akshaylad26 Can you provide more information on your cleanup job?

corneil avatar Sep 11 '24 11:09 corneil

@akshaylad26 Please provide the version of MySql and the platform you are launching the applications (Kubernetes, local, Cloud Foundry). Thanks!

cppwfs avatar Sep 11 '24 11:09 cppwfs

We are launching using Kubernetes & MySQL version is 8.0.33 Both cleanup job & batch task application are scheduled operations. Cleanup job executes every 6 hours whereas task is scheduled every hour. In Cleanup job we get executions using /tasks/executions endpoint in a paginated way sorted by task execution id in ascending order. Then we use /task/executions/{list of executionIds}?action=REMOVE_DATA,CLEANUP&schemaTarget=boot3 endpoint for cleaning up older executions. We follow this way because normal cleanup API provided by SCDF doesn't clean executions without any start/end date.

Let me know if this helps.

akshaylad26 avatar Sep 11 '24 12:09 akshaylad26

What transaction isolation level are you using for your batch application(s)?

cppwfs avatar Sep 11 '24 14:09 cppwfs

We use default isolation level. We use hikari datasource & dont override it.

akshaylad26 avatar Sep 11 '24 14:09 akshaylad26

You may wish to consider setting the transaction isolation level of your apps to something like ISOLATION_REPEATABLE_READ as discussed in Spring Batch's Javadoc.

cppwfs avatar Sep 11 '24 14:09 cppwfs

sure will give it a try. Thanks!

akshaylad26 avatar Sep 11 '24 15:09 akshaylad26

Creating new transaction with name [org.springframework.cloud.dataflow.server.service.LauncherInitializationService.initialize]: PROPAGATION_REQUIRED,ISOLATION_DEFAULT

I see this line in SCDF server DEBUG logs as well. Does this also need to be updated to ISOLATION_REPEATABLE_READ?

akshaylad26 avatar Sep 12 '24 06:09 akshaylad26

Thank you for your update! For now SCDF needs to keep its current transaction isolation level.

cppwfs avatar Sep 12 '24 10:09 cppwfs

I just checked in our Database, its default isolation level itself is REPEATABLE_READ.

image

akshaylad26 avatar Sep 12 '24 12:09 akshaylad26

Understood. Not sure that necessarily affects the Batch apps in that it sets its own default.

cppwfs avatar Sep 13 '24 13:09 cppwfs

Hi @cppwfs ,

We tried adding transaction isolation level = ISOLATION_REPEATABLE_READ in our batch application. But today again got the same deadlock issue.

akshaylad26 avatar Sep 26 '24 08:09 akshaylad26

Can you share a small sample application that reproduces the issue? Thanks

cppwfs avatar Sep 30 '24 13:09 cppwfs

Hi @cppwfs ,

We were not able to reproduce issue. However, attaching innodb status file from mysql which shows details about deadlock. deadlock.txt

Hope it helps to investigate further.

akshaylad26 avatar Oct 15 '24 09:10 akshaylad26

DELETE FROM BOOT3_BATCH_JOB_INSTANCE WHERE NOT EXISTS ( SELECT JOB_INSTANCE_ID FROM BOOT3_BATCH_JOB_EXECUTION WHERE BOOT3_BATCH_JOB_INSTANCE.JOB_INSTANCE_ID = JOB_INSTANCE_ID)

I believe this query holds the lock of entire table rows which causes the problem.

akshaylad26 avatar Nov 14 '24 11:11 akshaylad26