nodejs-bigtable icon indicating copy to clipboard operation
nodejs-bigtable copied to clipboard

fix: Update last row key in write function of user stream to avoid data loss and data duplication

Open danieljbruce opened this issue 6 months ago • 0 comments

Summary:

This PR offers a long term fix ensuring that when retryable errors occur that the user will not experience data loss nor will duplicate data be delivered to them. It replaces the patch that is currently in place that is preventing data duplication.

Background

In this PR a fix was applied to prevent data loss in the client for readRows calls. Data duplication was also mostly addressed by adjusting watermarks except for a special case that occurred from time to time in Node v14. The patch in this PR was then applied to ensure the user didn't receive duplicate data by throwing away duplicate data just before it reached the user. Throwing away the duplicate data solves the problem, but it isn't the best permanent solution because it involves two changes to solve just one fundamental problem. This current PR replaces the patch with a simpler change that will also solve the data duplication and data loss issues.

Root cause analysis

While investigating the issue it was found that duplicated data was delivered when a row made it to the write function of the user stream, but not to the transform function of the user stream when the retry request was being formed. It was being stored here in the stream waiting to be processed by transform. Since the transform function was where the last row key was being updated and this hadn't happened yet for the row, the row was re-requested. By pushing the row key update back to the write function we ensure the row doesn't get re-requested.

Changes

  • The old test for data duplication would only fail sometimes on Node v14. This PR adds a test that always fails on Node v14 when the patch isn't in place. This will make it easier to detect data duplication and data loss if it ever happens again
  • Moved the lastRowKey and rowsRead updates to the write function instead of the transform function. When observing the call stack we can see that the emit function of the row stream directly calls the write function in the user stream demonstrating that the write function is the earliest point in time the row is guaranteed to make it to the user and therefore is where the update should occur.
  • Removed the patch in the user stream transform function as it is no longer required there.

Alternatives considered

Instead of overriding write we could pass in a write function to the constructor, but we don't want to do that because it will override _write and stop calling _transform if the Transform is ready for reading. Other options in the Transform constructor don't allow us to achieve our goal of updating the last row key earlier. Therefore, overriding write is our best option.

danieljbruce avatar Jul 26 '24 21:07 danieljbruce