oso icon indicating copy to clipboard operation
oso copied to clipboard

ecosyste.ms replication follow-on work (Meltano/Airbyte)

Open ravenac95 opened this issue 10 months ago • 0 comments

What is it?

In attempting to replicate the databases from ecosyste.ms we found that the airbyte connector (and it's use with meltano) had some issues:

  • When using meltano with the airbyte connector
    • The state wasn't properly stored (i believe this had worked in a previously tested version so this will need to be checked)
  • When using airbyte without meltano
    • We encountered some issues with the bigquery destination when the data that it was loading had sufficiently large JSON columns. The exact reason for the failure is currently unknown but my theory is below:
      • Airbyte's bigquery destination writes to GCS first as CSV files.
      • These CSV files contain, I believe, 4 columns. some airbyte internal data and a final column that is the JSON blob of the output of postgres airbyte source.
      • This is then imported into bigquery en masse using the bigquery + gcs built in import
      • The json blob is then processed here
        • I believe it is this step that fails. If the data in the JSON blob is sufficiently large then the process seems to fail. Airbyte also isn't very robust here to just skip and try to succeed in other ways (in some cases). That or the inclusion of that json the specific table that was continuously failing was always too large.

ravenac95 avatar Apr 15 '24 17:04 ravenac95