pdr-backend
pdr-backend copied to clipboard
Fix Issue1076 - Scope down queries, Fix drop, Improve DRY, Stabilize e2e ETL
Fixes #1076
bronze_predictions and other tables are starting + ending like how we would expect it to, at the end of the first run...
I had to...:
- [x] Fix drop logic #1079
...I have tested it multiple runs and other variations and it remained stable...
Changes proposed in this PR:
- [x] Removed slots, and subscriptions queries from GQLDF
- [x] Removed bronze_slots query from ETL
- [x] Updated tests to verify expected tables
- [x] Run pipeline e2e
- [x] Verify that expected tables and results are being yielded
- [x] Try a variety of drops/updates/validate/describes, and verify that the pipeline is running reliably, without generating gaps or dupes
- [x] Fixed "temp_tables" util functionality inside ETL that wasn't correctly implemented. This bug would potentially keep temp tables hanging around, although they should be cleaning up on their own.
- [x] Implemented Calina's feedback to reduce DRY violations WRT GQLDF + ETL configuration
After playing around for a while and feeling confident, i ran the validation pipeline and noticed errors.
Because I was testing deleting CSV files from my hardrive and then rescuing the lake... I somehow managed to get duplicate records in the lake. Knowing everything should be rebuilt quickly from CSV data, I dropped everything in DuckDB and ran lake update raw + lake update etl to rebuild everything in a few seconds.
[Tangent - Improving describe vs. validate]
This is bad because I I ran describe a dozen times... but didn't notice the problem until I ran validate... I really want to see the results of both in one place on the CLI but can't.
Issue - CSV Deletion/Raw Table Duplication - Checks in save_to_storage against duckdb raw tables
The reason the problem above happened is that GQLDF uses save_to_storage, which doesn't do any checks against the DuckDB table when doing the insert. This is because it assumes that the DuckDB table will be in the same state as the CSV... clean, empty.
So, it's currently the responsibility of the user, that if they delete the CSV, that they'll also trim the DuckDB table (lake) with the drop commands that are currently there.
And, let's compartmentalise the problem...
- The whole data pipeline works on the assumption that users will be iterating and dropping their database tables (DuckDB) to rebuild from scratch using the CSVs, with frequency.
- Right now, we do not want users to delete their CSV, or expose them to managing this.
- The whole data pipeline works on the assumption that they only have to fetch this data once. Read: low touch/velocity.
[CSV Management - Considerations / Potential Solutions]
- Keep track of table min/max timestamp. Checks are made in GQLDF -> append_to_table -> insert_to_duckdb (checks the row to be inserted is less than min or greater than max). This way, we update CSV but don't insert double records into lake.
- We continue to extend CI commands in such a way, where you can also manage the CSV data easily.
- We continue to extend CI commands in such a way, where you can drop across all CSV/RAW/ETL easily.
- Others...
Even right now, if you do delete some of your CSVs, you can get your tables up and running again in just a few seconds.
Put it in a ticket and handle it later... It's super low in the list of priority.
[Fix ETL - Move Temp Table to Live + Drop Temp Tables] There were some bugs/bad assumptions with how the "Temp Table" logic works inside etl.py... I believe i have fixed that now, and updated the tests to match this.
[Table Registry] Due to the new NamedTables, I think we can get rid of the TableRegistry.
I'm getting the following error when running the describe cli command:
Found the issue and pushed a fix, etl.raw_table_name property was accessed but that doesn't exist. I removed it from the validation function and replaced etl.raw_table_name + etl.bronze_table_names with etl.all_table_names. Also removed that from etl tests
The cli command description for describing the lake misses the network parameter:
pdr lake describe PPSS_FILE --HTML -> pdr lake describe PPSS_FILE NETWORK --HTML
I run the ETL a few times and the number of rows within pdr_predictions table has always been different that the number of rows in the bronze_pdr_predictions table
@KatunaNorbert
- can you share the whole cli summary table? i'd like to see st_ts and end_ts for both raw + etl tables
- did you try clearing your etl and then rebuilding to see if the final numbers are the same?
- did you try clearing your etl + raw and then rebuilding to see if the final numbers are the same?
- In ppss.yaml, I'm wondering if you're using a st_ts of: "1d ago" or a "relative date natural language term" like that on the st_ts parameter.
I deleted the lake with all the tables and I'm using: start_d: 1d ago end_d: now I'll try to rerun and share the whole data
@KatunaNorbert i'm pretty sure the reason for the problem is the st_ts: "1d ago" and configuring a sliding window for the lake st_ts and end_ts params.
Yes, that's it, I retested with a fixed date and it's looking good
Found the issue and pushed a fix, etl.raw_table_name property was accessed but that doesn't exist. I removed it from the validation function and replaced etl.raw_table_name + etl.bronze_table_names with etl.all_table_names. Also removed that from etl tests
It's because raw_table_names is a property of gql_data_factory.... Removing this, is not the way to fix the test.
I have fixed it here. https://github.com/oceanprotocol/pdr-backend/pull/1077/commits/5493905edb661056040b4419ab9290dab851478d
I have noticed that subscription table was half implemented into the fixture/coverage, and am removing it from the ETL fixture such that we implement it properly when subscriptions are enabled.
https://github.com/oceanprotocol/pdr-backend/issues/1085
This should happen after duckdb PR first-candidate.
I'm simply removing subscription from ETL fixture for now as part of fixing the tests.
lgtm, but I think we shouldn't remove the slots process from the GQL
We're removing everything non-critical from the core-flows and adding the complexity back in, bit-by-bit.