passport icon indicating copy to clipboard operation
passport copied to clipboard

Automating Data Sync: Parquet File Export Integration for Passport Scores with Open Source Observer

Open erichfi opened this issue 1 year ago • 8 comments

User Story:

As a Passport Project Manager, I want to set up a system to export our public index of all addresses & scores (in Parquet format) to the Open Source Observer (OSO) project's S3, So that we can automate the synchronization of Passport scores with the OSO system, enabling real-time data analysis and insights through daily scheduled exports.

Acceptance Criteria

GIVEN a Parquet file containing all addresses and scores at https://public.scorer.gitcoin.co/passport_scores/registry_score.parquet is updated, WHEN the file is exported to the OSO's S3 bucket daily, THEN the OSO system should reflect the latest data from the Passport scores index every day without manual intervention.

Product & Design Links:

Tech Details:

  • The export must support Parquet file format for data synchronization.
  • Ensure the export system can handle the dynamic nature of Passport scores, including the decay mechanism.
  • Consider the scalability of the solution to manage the continuous growth of Passport users and score updates.
  • Set up a daily scheduled export to the OSO's S3.

Open Questions:

  • [x] What specific data fields and metrics does the OSO require from the Passport scores for its analysis?
    • Required fields:
      • passport.address
      • last_score_timestamp
      • evidence.rawScore
      • evidence.threshold
      • stamp_scores
    • Not required as of now:
      • status
      • error
  • [x] Are there any specific security or privacy protocols we need to establish between Passport and the OSO project for the S3 exports?
    • No additional security or privacy protocols are required from OSS's side; the data will be available on BigQuery as a public dataset.

Notes/Assumptions:

  • Assumption: The OSO project has the infrastructure to support daily data integration.
  • Note: Passport scores decay over time; the export system must account for this to ensure data accuracy.
  • Assumption: Both teams have agreed on the data usage policies and privacy considerations for sharing Passport scores with the OSO.

erichfi avatar Mar 28 '24 21:03 erichfi

Hi @erichfi responding below:

What specific data fields and metrics does the OSS require from the Passport scores for its analysis?

The fields we require are:

  • passport.address
  • last_score_timestamp
  • evidence.rawScore
  • evidence.threshold
  • stamp_scores

We aren't sure if we'll ever need the status or error fields.

How frequently does the OSS system need updates from the Passport scores index to maintain data relevance?

Every 24 hours would be consistent with our other SLAs.

Are there any specific security or privacy protocols we need to establish between Passport and the OSS project?

From our side, no. The data would be available on BigQuery as a public dataset.

Will we need to establish a direct integration or API access with the OSS for the Airbyte connector to function?

Tagging @ravenac95

Is Airbyte the best tool for this integration, or should we consider other options to ensure the most efficient and reliable data synchronization between Passport and the OSS?

Tagging @ravenac95

ccerv1 avatar Mar 29 '24 09:03 ccerv1

From what I can understand, we could use for example the S3 connector to connect it to our current data dumps which we store to s3: https://docs.airbyte.com/integrations/sources/s3 The problem however is that the dumps are in jsonl which is not supported by the connector. So we should change the format (switching to parquet probably makes sense).

@ccerv1 does this make sense?

nutrina avatar Apr 09 '24 07:04 nutrina

@nutrina @ccerv1

Will we need to establish a direct integration or API access with the OSS for the Airbyte connector to function?

If it is somehow possible is there a way to setup a direct connector to a data store (preferably postgres as we can do Change Data Capture) through airbyte as opposed to S3 storage? That is definitely the clearest/easiest method of integration.

Is Airbyte the best tool for this integration, or should we consider other options to ensure the most efficient and reliable data synchronization between Passport and the OSS?

Airbyte is the best tool if we have direct connection to a database. If the dumps are only going to be JSONL into S3, getting it directly into our GCS bucket is easy enough for us to load into our warehouse. So in this scenario, I'd expect it's easy enough for you to simply have a write-only user to a path in a bucket we control. It sounds like the JSONL dump is updated in place as the addresses+scores are updated. If this is the path that we choose, it would be preferable if the dumps are done in a way that just includes a datetime as a suffix of the filename and we can do any cleaning/deleting of old files (based on the filename suffix).

ravenac95 avatar Apr 10 '24 16:04 ravenac95

@erichfi We had a good conversation with @nutrina @ccerv1 and @ravenac95 today. We aligned on the following:

  • We'll export the data in Parquet file format
  • Export to OSO's S3
  • Daily export is good to start Can we do this as part of the infra sprint before May 10th?

@NadjibBenlaldj Will likely need to pull an interim file for the OSO team to have a static data set for the data scientists to start working with in the meantime.

Jkd-eth avatar Apr 16 '24 17:04 Jkd-eth

Waiting for credentials to upload data.

Some preliminary work is done (see linked branch, was attempting to test authentication for uploading data, butu need credential for that). This code can be used to test the generated parquet export:

import pyarrow.parquet as pq
import pandas as pd

# Path to your Parquet file
file_path = "registry_score.parquet"

# Load the Parquet file
table = pq.read_table(file_path)

# Convert to Pandas DataFrame
df = table.to_pandas()

# Print the DataFrame
print(df)

nutrina avatar Apr 23 '24 12:04 nutrina

Waiting for feedback on authentication mechanism

lucianHymer avatar May 02 '24 22:05 lucianHymer

@lucianHymer I was late in the game to get the creds sent over. I sent it over DM to @nutrina but I can also send over to you if that helps.

ravenac95 avatar May 03 '24 07:05 ravenac95

Waiting for further clarification on how to authenticate.

nutrina avatar May 06 '24 07:05 nutrina