extensions icon indicating copy to clipboard operation
extensions copied to clipboard

[firestore-bigquery-export] Duplicate records in raw changelog

Open tamireinhorn opened this issue 1 year ago • 10 comments

Describe your configuration

  • Extension name: firestore-bigquery-export
  • Extension version: 0.1.12
  • Configuration values (redact info where appropriate):
    • Collection path: users
    • Table id: users
    • Data set id: prod_firestore_export

Describe the problem

I've recently noticed, via an incremental job that made use of a merge statement using event_id as the unique key, that there were multiple records in the users_raw_changelog table generated by the extension with the SAME exact data and event_id. As far as I'm aware, this shouldn't ever be the case, as the event id should be definition unique. There were a few event ids that were repeated 4-5 times, even, which is very weird. The most intriguing part is that these rows, although with timestamps from March-August 2022, or even 2021, only seem to have been duplicated now, as the error only showed up now, despite the most recent duplicate being from 28/8/22, and the incremental job running every 2 days, getting the last 30 days.

tamireinhorn avatar Sep 14 '22 10:09 tamireinhorn

Hey @tamireinhorn, Thank you for raising up this issue. Can you please provide a screenshot of the duplicate rows?

yamankatby avatar Sep 14 '22 11:09 yamankatby

image

I even applied farm fingerprint to the json data just to show that they have the exact same contents. These are what could fit into the print.

tamireinhorn avatar Sep 14 '22 12:09 tamireinhorn

Hey @tamireinhorn, while investigating this issue I noticed that your changelog_raw table (the screenshot you provided) contains a column called "hashing" which is not a column added by the extension as far as I know. Is it something you added yourself?

yamankatby avatar Sep 26 '22 13:09 yamankatby

Yes, I added it myself as I mentioned in the print. I applied a farm fingerprint to the JSON itself just to show that they are indeed duplicates.

tamireinhorn avatar Sep 26 '22 13:09 tamireinhorn

Hey @tamireinhorn, I was working on reproducing this issue for days but I couldn't. Can you please provide any extra data you think can be helpful, something like your complete config params list, your data structure, your history using this Extention (like steps that happened before this issue appeared) or anything you think that can help to reproduce the problem?

yamankatby avatar Sep 29 '22 10:09 yamankatby

I wonder if this scenario could be caused by multiple installations of the BQ extension?

Moving to blocked until we have more information or can reproduce the error.

dackers86 avatar Feb 28 '23 17:02 dackers86

This issue has gone stale, closing until further feedback provided.

dackers86 avatar Nov 10 '23 10:11 dackers86

@tamireinhorn @dackers86 Hi, I am facing the duplicate issue. I exported userProfile collection to BigQuery. When the document in userProfile collection changes, multiple documents with the same doc id are added to BigQuery. Are there any options to control that specify whether updates should append new records or update existing ones in BigQuery?

Thanks in advance

timeisgolden avatar Dec 13 '23 16:12 timeisgolden

Re-opening to track the issue again.

pr-Mais avatar Dec 13 '23 17:12 pr-Mais

@timeisgolden Do you mean being added to the raw_changelog table? As far as I understand the changelog should record every change happening so this means multiple documents can have multiple rows depending on how many changes have happened to the document, it won't mirror Firestore it will just record every change.

pr-Mais avatar Dec 19 '23 10:12 pr-Mais