Allow Direct INSERT in Final Table
This would only apply to snapshot mode, or incremental mode with only update_key (no primary_key).
See https://github.com/slingdata-io/sling-cli/discussions/33#discussioncomment-7380683 for more details.
Changes needed in WriteToDb:
- Add environment variable to capture setting. Something like
SLING_ALLOW_DIRECT_INSERT=TRUE. - Add error handling, if mode is not
snapshotorincrementalmode with onlyupdate_key.- Add config method to determine if direct insert is allowed:
DirectInsertAllowed()
- Add config method to determine if direct insert is allowed:
- Ensure
PreSQLis executed before direct insert. Currently, is executed after successful insert into temp table - Clean up and reorg
WriteToDba bit for easier read
Note about current implementation:
The temporary table is to ensure data quality, such as column typing (sling profiles every value and determines the appropriate column type), check summing, as well as matching row count. Inserting into the final table directly, we cannot check quality at this level. Also, what if the operation is cancelled? So yes, sling handles failure gracefully and drops any temporary file/table if something goes wrong, not affecting the final table, and errors.
Hi, I'd like to work on that. Do you see any new dependencies with other open issues?
Hey @temminks, thanks for volunteering. As for your question, no, there aren't any pending dependencies that would block this.
This one is a bit complex, as it involves breaking into pieces the WriteToDB function. It's one of the more involved pieces of the Sling codebase, since there are specific sequences of transactions that occur in order (depending on the mode), such as:
- drop temp table if exists
- create temp table
- insert into temp table
- validate temp table
- validate columns names
- execute pre-sql
- create final table if not exists
- insert into final table
- execute post-sql
- drop temp table
For this new feature, we'd need to modify the sequence depending on whether SLING_ALLOW_DIRECT_INSERT is enabled, to just do:
- validate columns names
- execute pre-sql
- create final table if not exists
- insert into final table
- execute post-sql
So I figure it's a good opportunity to break up/clean up WriteToDB into a few small functions (as it is a bit of a mess imho). The small functions could be used/re-used in a neater way.
Do you have any thoughts on how you'd approach this?
Hi @flarco, I was looking into that function the other day as I was a bit surprise that I have to grant CREATE privileges on a schema to the user that write a table with Sling, instead of just using TEMPORARY (both in PostgreSQL). This is probably to make sure that the implementation works across databases where TEMPORARY is not a privilege.
I actually tried to play around with pre-sql to create a temporary table there and pass that to table_tmp but this did not work as Sling was not satisfied with the temporary table not having a schema.
One thing I'm not sure of right now is how you would read the environment variable: do you want to do this inside the function, just using os.Getenv() or should this be part of the Config struct?
As for an approach, the steps
- drop temp table if exists
- create temp table
- insert into temp table
- validate temp table
- AddCleanupTask
appear to be one logical unit: These might become smaller functions themselves but in the end WriteToDB should call a function createTempTable() if SLING_ALLOW_DIRECT_INSERT=TRUE. I'll have to see where it makes sense to extract smaller functions to make the code more readable or simply to be able to test a step.
The step validate columns names consists of adjustments of column types and adding columns. That adding part could be tricky: you add new columns based on the temporary table. We should be able to use the results from the data sampling, though, so this must not be part of createTempTable().
Executing pre-sql doesn't have to be changed, I think. This is just running the statement and does not depend on the temporary table.
I think I'll have to start playing around a bit before I see how this changes the actual part where you write to the target table.
+1 on this as we have a restriction wherein we cannot create tables in the target database. Table creation / modification is managed via CICD for deployment.
Would value a method to avoid creating a temp table and perform a direct insert/truncate insert.
I recently started using Sling for some one-off migrations and creating a temp table and then doing a INSERT INTO does increase the time it takes. Was moving around 1-2 billion rows. In some cases the initial load is faster than the final move from temp_table to final_table because of increased IO (read & write).
This tool works great otherwise, awesome work.
is anyone working on this? if not I will start this issue.
@yokofly thanks, not working on this atm. Feel free to work on it. The WriteToDB would need to carefully be broken into pieces. Let me know if you have any questions.
Added in #410
Would be possible to have direct insert configurable per stream in a replication file? Or let streams with unsupported modes default to temporary table implementation when using SLING_ALLOW_DIRECT_INSERT?
Second option sounds good. 👍
Will put a warning, mentioning it's falling to using temp table (for incremental/backfill with PK)