central-backend icon indicating copy to clipboard operation
central-backend copied to clipboard

Creating or publishing form draft is not atomic

Open matthew-white opened this issue 2 years ago • 4 comments

@lognaturel was working on a server with a large number of form versions. Quoting @lognaturel from Slack:

Creating a form draft takes a very long time and sometimes 504s. This is for any form on the server. Publishing a form draft generally 504s. Sometimes it succeeds but CSV attachments are not successfully added.

The behavior I’m experiencing is exactly what @matthew-white described at #381.

I do think publish should be atomic and it doesn’t look like it is. It’s a bad state to have tested a draft with an attachment and then to have it not be there when published.

Both draft creation and draft publish [are not atomic].

matthew-white avatar Nov 10 '22 06:11 matthew-white

@lognaturel, when you publish on this server, do you generally specify the version query parameter? Or do you modify the version string in the XForm yourself when uploading the form definition?

matthew-white avatar Nov 10 '22 06:11 matthew-white

The idea that these actions are not atomic is very surprising to me! I'm surprised because we generally process non-GET requests within a database transaction, which should make things atomic.

I'm guessing that at some point, Backend was working on concurrent, identical requests to create or publish the form draft. In general, even if there's a 504 response, Backend may continue working on the request (#407). That means that if the same request was sent after the 504, Backend could have been working on the new request even as it was still working on the earlier request. That made me wonder about transaction isolation. (Could the concurrent transactions have been insufficiently isolated? Was a database check effectively bypassed?) But after thinking about it a little, I'm not seeing how that would be the issue. Two concurrent requests to create a form draft will both try to UPDATE the same row of the forms table, setting draftDefId. But even under the read committed isolation level:

UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the command start time. However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the would-be updater will wait for the first updating transaction to commit or roll back (if it is still in progress).

matthew-white avatar Nov 10 '22 07:11 matthew-white

I think I can write a failing test for this now!

Two concurrent requests to create or publish a form draft for the same form will queue according to the logic I quoted above. Both will try to UPDATE the same row of the forms table, so whichever request is the first to try to UPDATE will proceed while the other request has to wait for that transaction to commit or roll back. Again, note that either or both requests may return with a 504 error, but that won't affect the requests' transactions, which will continue to be worked on.

I was thinking previously that that queueing ensures that only one of the requests' transactions can be committed. For example, two requests to publish a draft with the same version string should trip a database constraint or check. However, I'm realizing that in some cases, both transactions can be committed without tripping a database constraint or check. For example, if you send two concurrent requests to publish a draft, both can succeed as long as they specify different version query parameters.

Here's another example, which I think may be closely related to what @lognaturel was seeing: if you have an existing draft, then you can send concurrent requests to publish the draft and replace the draft, and both requests can be successful without tripping a database constraint or check. (The endpoint to replace the draft is the same as the endpoint to create a draft: the endpoint doesn't return an error if there is already a draft.) The issue is that replacing the draft has the effect of deleting the current draft along with its form attachments. If the publish request gets off the ground before the transaction to replace the draft commits, then the publish request will end up publishing a copy of the replaced draft — but by the time it tries to copy that draft's form attachments, the attachments will have been deleted by the transaction to replace the draft. (I think the publish request has to specify a version query parameter so that it creates a new form_def: otherwise the two transactions won't both commit.)

More specifically in the code, POST …/draft and POST …/draft/publish?version both call Forms.createVersion(). createVersion() first executes an UPDATE, then once that completes, it calls FormAttachments.createVersion(), which copies forward the form attachments returned by FormAttachments.getAllByFormDefId(). That can result in the following sequence:

  • A request to replace the draft is received. It UPDATEs the form.
  • Soon after the UPDATE, the request inserts new form fields. Let's assume this takes a long time.
  • While that insert is happening, a request to publish the draft is received. It reads the XML of the soon-to-be-replaced draft, then tries to UPDATE the form. At this point, it is forced to wait on the earlier transaction.
  • The insert finishes, and the rest of the request to replace the draft proceeds. The previous draft and its form attachments are deleted.
  • Now that the first transaction has committed, the request to publish the draft stops waiting. After its UPDATE, it moves on to querying form_attachments. It doesn't see any form attachments to copy (because they've been deleted), so it doesn't try to copy them.

To see this in a test, first cause Forms.createVersion() to take a while after the UPDATE. For example, wait for 5 seconds after inserting form fields. You will need to adjust the Mocha timeout (for example, specifying --timeout 0). Then this test should fail:

Test
it('should always copy attachments when publishing', testServiceFullTrx((service) =>
  service.login('alice', (asAlice) =>
    asAlice.post('/v1/projects/1/forms')
      .send(testData.forms.withAttachments)
      .set('Content-Type', 'application/xml')
      .expect(200)
      .then(() => asAlice.post('/v1/projects/1/forms/withAttachments/draft/attachments/goodone.csv')
        .send('this is goodone.csv')
        .expect(200))
      .then(() => asAlice.post('/v1/projects/1/forms/withAttachments/draft/publish')
        .expect(200))
      .then(() => asAlice.post('/v1/projects/1/forms/withAttachments/draft')
        .expect(200))
      .then(() => Promise.all([
        asAlice.post('/v1/projects/1/forms/withAttachments/draft')
          .expect(200),
        new Promise(resolve => { setTimeout(resolve, 2000); })
          .then(() => asAlice.post('/v1/projects/1/forms/withAttachments/draft/publish?version=v2')
            .expect(200))
      ]))
      .then(() => asAlice.get('/v1/projects/1/forms/withAttachments/attachments')
        .expect(200)
        .then(({ body }) => {
          const { exists } = body.find(({ name }) => name === 'goodone.csv');
          exists.should.be.true();
        })))));

I'm only seeing this issue happen when there are concurrent requests. If we fix #662, that should greatly reduce the likelihood of such concurrent requests. In general, Backend doesn't tend to do a lot to account for the possibility of concurrent modification of the same data. I think we could account for that possibility here if we wanted to (by locking things? by using the repeatable read transaction isolation level?). But I also think we could consider just fixing #662 and closing this issue as a "won't fix" for now.

matthew-white avatar Nov 11 '22 08:11 matthew-white

i've long been nervous about this.

we used to have transaction context to understand if we are in a write request or not. in that case, i thought to add a FOR UPDATE to selects but it seemed like things were okay so i didn't. seems we either need to do that.

issa-tseng avatar Nov 11 '22 17:11 issa-tseng