New "refresh" mode and "dev_mode"
Description
Still needs some cleanup and testing.
-
full_refresh argreplaced withdev_mode - Deprecation warning when
full_refresh=True - Add
refresh: "full" | "replace"arg to pipeline -
fulldrops all tables/resource states/source state keys -
replacedrops tables/state from the resources selected to extract
The new refresh mode introduces a new file to the load package: dropped_tables.json, this gets created in extract
That was the simplest way I could think of to drop tables in appropriate stages (local schema and state is updated in extract and destination tables get dropped in load step).
Otherwise we'd have to start poking at destination storage before extract which feels wrong.
Refresh works as follows:
- Before extract -> uses drop command to wipe all/selected resource state and tables (local state only, destination tables are not deleted and normalize/load doesn't run as part of drop)
- Add
dropped_tables.jsonto load package with the tables found by drop command - In load stage -> read
dropped_tables.jsonand drop the tables from destination schema (non-sql destination fall back to truncate on dataset init)
Related Issues
- Resolves # #588
Deploy Preview for dlt-hub-docs canceled.
| Name | Link |
|---|---|
| Latest commit | 879e4e5ea61229c79b4d297a917e3930abf65609 |
| Latest deploy log | https://app.netlify.com/sites/dlt-hub-docs/deploys/6657a4adfac32b0008eaee57 |
@steinitzu this looks pretty cool now. in the meantime we refactored extract code so please merge the current changes. sorry for that!
@steinitzu do you need a review here? PR is still draft
OK! now tests are almost in order! still to do:
1. I have a strong opinion to rename our modes to `drop_sources`, `drop_resources` and `drop_data`. WDYT? this corresponds more to what happens 2. is it much work to add `refresh` to `run` and `extract` methods of `Pipeline`? 3. we need to improve docstrings and docs 4. let's remove all schemas for a given name from a destination. Old flag was not that bad IMO - it leaves all details to the implementer 5. a simple test for local filesystem - it really makes sense when I look how many users use that destination
- Yes. I think that naming is much better. I changed it.
- Added this and tests for each. Nice to have this.
- I updated both a bit. Let me know if anything missing.
- Doing this again now, but I renamed the
drop_tablesreplace_schemaargument todelete_schemaand changed the behaviour.
This is simpler for initializing client. No need to do the dance with swapping out client schema, we just drop tables and delete schema history in one transaction -> then "update stored schema" runs as normal after. - We don't support
drop_tablesin filesystem. Should we add it? I guess it would be same as truncate basically (maybe delete the folders also?). But I made a test forrefresh=drop_data
@sh-rp thanks for review. the part where I asked you to look deeper is how we pass and write the package state and if the pipeline state is handled correctly. I do not see any updates on that. so I assume this is mostly right?
@steinitzu plenty of tests are failing. my take is that you should try to run common test locally so we have less turnover on CI (if you do not do that already)
@steinitzu plenty of tests are failing. my take is that you should try to run common test locally so we have less turnover on CI (if you do not do that already)
@rudolfix got it. I don't always run them but should for sure before pushing. Usually I try to run the tests I think are relevant, in this case lot more than I thought.
Common + load postgres/duck/local filesystem tests are passing locally now
The state stuff looks good to me too. The way it is loaded to the destination has not been changed here yet, this is for an upcoming PR (correct me if I'm wrong there @steinitzu), only those new fields have been added and the state is committed after extraction, so I don't see a problem, or any migration stuff we need to take care of.