Payara icon indicating copy to clipboard operation
Payara copied to clipboard

A batch Job started from a managed transaction sometimes leads to a failed execution and NullPointerException / FISH-1179

Open OndroMih opened this issue 4 years ago • 6 comments

Description


When a batch job is started from a managed transaction (e.g. from a stateless EJB), it sometimes (not always) fails to execute the job.

This is with the default configuration when the Batch engine uses the default H2 datasource. I haven't tested with other databases.

When the job fails, there's a NullPointerException in the server log. It turns out that it's not the original exception, it only covers the original PersistenceException exception, which is swallowed. The information from the debugger shows that the original PersistenceException says:

Referential integrity constraint violation: "JOBEXEC_STEPEXEC_FK: PUBLIC.STEPEXECUTIONINSTANCEDATA FOREIGN KEY(JOBEXECID) REFERENCES PUBLIC.EXECUTIONINSTANCEDATA(JOBEXECID) (2)"; SQL statement: INSERT INTO STEPEXECUTIONINSTANCEDATA (jobexecid, batchstatus, exitstatus, stepname, readcount,writecount, commitcount, rollbackcount, readskipcount, processskipcount, filtercount, writeskipcount, starttime,endtime, persistentdata) VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) [23506-196]

But when I check the data in the database, the table EXECUTIONINSTANCEDATA contains the JOBEXECID which is going to be inserted into STEPEXECUTIONINSTANCEDATA. This probably means that it wasn't in the table at the time when the INSERT into STEPEXECUTIONINSTANCEDATA was executed but it was added later.

Expected Outcome

A timer defined on a Stateless EJB starts a Batch job with the following code:

BatchRuntime.getJobOperator().start("EventFilesProcessorJob", null);  // the EventFilesProcessorJob job is defined as a Batch job in a XML file at the expected location

The job runs successfully as defined in the XML, without any exceptions.

Current Outcome

Most of the times, the batch job runs successfully as expected. Sometimes (roughly 1 in 10 times), it fails with the following stacktrace:

nullpointer_stacktrace.txt

According to the source code, the exception is raised from a catch block that is already handling an exception. The original exception is swallowed but the debugger shows that this is the original exception:

persistenceexc_stacktrace.txt

Steps to reproduce (Only for bug reports)

Just build the attached application batch-issue.zip with mvn install and deploy to Payara Server.

Context (Optional)

This issue was originally reported in Payara Forum for the CargoTracker project.

Environment

  • Payara Version: 5.2020.7 (but the same applies also to 5.2021.1)
  • Edition: Full
  • JDK Version: Zulu 11
  • Operating System: Linux

OndroMih avatar Feb 17 '21 22:02 OndroMih

The reason for the exception is that the JobOperator.start() method creates an instance of the job and persists it within the transaction that is created for running the timer method in the JobScheduler EJB. Then the JobOperator runs the first batch step asynchronously in a new thread where it tries to persist the checkpoint after that step. The new thread uses autocommit and thus SQL INSERT statement is applied immediately. The timer method sleeps for 1 second and the first transaction thus commits 1 second later after the new thread tries to persist the checkpoint. This assumes that the job data is already saved but the first transaction, which saves the job, isn't committed yet. If the sleep instruction is removed, the job runs OK most of the time but sometimes the background thread is faster and tries to autocommit before the first thread commits, which results in the same exception.

Everything works as expected if transaction management of the JobScheduler EJB is set to BEAN: @TransactionManagement(TransactionManagementType.BEAN). This causes that the timer method runs without a transaction and the JobOperator uses autocommit to commit immediately. This ensures that the job data is committed before the step is exxecuted in a new thread and the job data is already available when the step is persisted.

A possible fix is that the JobOperator.start() runs all SQL statements in a nested transaction. Or that it persists job and the first step on a new thread instead of persisting the job in the current thread and the step in a new thread. It's important to ensure that the transaction that writes the Job data commits (autocommits) before the step data is written.

OndroMih avatar Feb 17 '21 22:02 OndroMih

I think this should be raised at the Jakarta Batch spec project. Not sure this is a bug. If there's a global TX around the creation of a job what is the developer expecting. This is similar to say a global-tx around submitting an Jakarta Concurrency task or enqueuing a message onto a JMS queue.

I agree we could have some better error messages around this.

smillidge avatar Feb 18 '21 09:02 smillidge

Hi @smillidge , I'm pretty sure it's a bug but I agree that it should be raised at the upstream JBatch RI project, and ideally at the Jakarta Batch project to standardize the behavior. The Jakarta Batch spec doesn't say anything about running a job in a global transaction.

The reason why I believe it's a bug is that I would expect that in a global transaction job would be scheduled according to one of the following scenarios:

  • The step execution would wait until the global transaction is committed. This is because it depends on that transaction being committed. If the global transaction rolls back, there's no reason to run the job.
  • Both job creation and step execution would be done asynchronously, disconnected from the global transaction. This would commit job creation before the step is started.
  • Job creation would be done in a nested transaction which would commit before the step is started and wouldn't affect the global transaction. This is very similar to the previous option but might go against the spec

Since the spec mandates that the job should reuse the existing global transaction (although this isn't clarified enough in the spec), I believe the best solution is to wait with the job execution until the global transaction is committed. This also makes the most logic to me, because:

  1. if Job is created in a transaction and the transaction rolls back, the job shouldn't be executed or should be rolled back. This solution simply wouldn't execute the job. Reverting a custom job is complicated if not impossible in a compliant way.
  2. JobOperator().start() should execute asynchronously and it implies that it's not connected to the transaction. So it's OK if it's executed after the transaction commits. There's no reason why the job should be started within the transaction or even have some impact on whether the transaction commits or not.

The argument 2. applies to all options but argument 1. applies only to the first option. That's why I believe the first option is semantically the most correct.

I'll raise this with the JBatch RI and also the Jakarta Batch project to get their opinion.

OndroMih avatar Feb 22 '21 10:02 OndroMih

Reproduced and created FISH-1179 to keep track of this issue.

AlanRoth avatar Mar 10 '21 16:03 AlanRoth

It looks like OpenLiberty provides its own JobOperator that suspends the current transaction and commits the changes directly: https://github.com/OpenLiberty/open-liberty/blob/integration/dev/com.ibm.jbatch.container/src/com/ibm/jbatch/container/api/impl/JobOperatorImplSuspendTran.java

OndroMih avatar Feb 10 '22 16:02 OndroMih

The JBatch JobOperator will be nehanced in version 2.1.0 with this PR: https://github.com/WASdev/standards.jsr352.jbatch/pull/77. Although I haven't tested it yet (there's no way to test it with Payara 5 because it's not compatible), I'm pretty sure it fixes this issue because it fixes a very similar issue in GlassFish and Jakarta Batch 2.1 TCK. Therefore this issue will most probably be fixed in Payara 6 once it integrates JBatch 2.1.0.

OndroMih avatar Feb 14 '22 23:02 OndroMih