sling-cli icon indicating copy to clipboard operation
sling-cli copied to clipboard

Allow Direct INSERT in Final Table

Open flarco opened this issue 2 years ago • 7 comments

This would only apply to snapshot mode, or incremental mode with only update_key (no primary_key).

See https://github.com/slingdata-io/sling-cli/discussions/33#discussioncomment-7380683 for more details.

Changes needed in WriteToDb:

  • Add environment variable to capture setting. Something like SLING_ALLOW_DIRECT_INSERT=TRUE.
  • Add error handling, if mode is not snapshot or incremental mode with only update_key.
    • Add config method to determine if direct insert is allowed: DirectInsertAllowed()
  • Ensure PreSQL is executed before direct insert. Currently, is executed after successful insert into temp table
  • Clean up and reorg WriteToDb a bit for easier read

Note about current implementation:

The temporary table is to ensure data quality, such as column typing (sling profiles every value and determines the appropriate column type), check summing, as well as matching row count. Inserting into the final table directly, we cannot check quality at this level. Also, what if the operation is cancelled? So yes, sling handles failure gracefully and drops any temporary file/table if something goes wrong, not affecting the final table, and errors.

flarco avatar Oct 25 '23 12:10 flarco

Hi, I'd like to work on that. Do you see any new dependencies with other open issues?

temminks avatar Jan 30 '24 10:01 temminks

Hey @temminks, thanks for volunteering. As for your question, no, there aren't any pending dependencies that would block this.

This one is a bit complex, as it involves breaking into pieces the WriteToDB function. It's one of the more involved pieces of the Sling codebase, since there are specific sequences of transactions that occur in order (depending on the mode), such as:

  • drop temp table if exists
  • create temp table
  • insert into temp table
  • validate temp table
  • validate columns names
  • execute pre-sql
  • create final table if not exists
  • insert into final table
  • execute post-sql
  • drop temp table

For this new feature, we'd need to modify the sequence depending on whether SLING_ALLOW_DIRECT_INSERT is enabled, to just do:

  • validate columns names
  • execute pre-sql
  • create final table if not exists
  • insert into final table
  • execute post-sql

So I figure it's a good opportunity to break up/clean up WriteToDB into a few small functions (as it is a bit of a mess imho). The small functions could be used/re-used in a neater way.

Do you have any thoughts on how you'd approach this?

flarco avatar Jan 30 '24 21:01 flarco

Hi @flarco, I was looking into that function the other day as I was a bit surprise that I have to grant CREATE privileges on a schema to the user that write a table with Sling, instead of just using TEMPORARY (both in PostgreSQL). This is probably to make sure that the implementation works across databases where TEMPORARY is not a privilege.

I actually tried to play around with pre-sql to create a temporary table there and pass that to table_tmp but this did not work as Sling was not satisfied with the temporary table not having a schema.

One thing I'm not sure of right now is how you would read the environment variable: do you want to do this inside the function, just using os.Getenv() or should this be part of the Config struct?

As for an approach, the steps

  • drop temp table if exists
  • create temp table
  • insert into temp table
  • validate temp table
  • AddCleanupTask

appear to be one logical unit: These might become smaller functions themselves but in the end WriteToDB should call a function createTempTable() if SLING_ALLOW_DIRECT_INSERT=TRUE. I'll have to see where it makes sense to extract smaller functions to make the code more readable or simply to be able to test a step.

The step validate columns names consists of adjustments of column types and adding columns. That adding part could be tricky: you add new columns based on the temporary table. We should be able to use the results from the data sampling, though, so this must not be part of createTempTable().

Executing pre-sql doesn't have to be changed, I think. This is just running the statement and does not depend on the temporary table.

I think I'll have to start playing around a bit before I see how this changes the actual part where you write to the target table.

temminks avatar Feb 08 '24 12:02 temminks

+1 on this as we have a restriction wherein we cannot create tables in the target database. Table creation / modification is managed via CICD for deployment.

Would value a method to avoid creating a temp table and perform a direct insert/truncate insert.

gu-xie avatar Feb 22 '24 20:02 gu-xie

I recently started using Sling for some one-off migrations and creating a temp table and then doing a INSERT INTO does increase the time it takes. Was moving around 1-2 billion rows. In some cases the initial load is faster than the final move from temp_table to final_table because of increased IO (read & write).

This tool works great otherwise, awesome work.

ScreenShot 2024-08-15 at 13 37 39

deviarte avatar Aug 15 '24 17:08 deviarte

is anyone working on this? if not I will start this issue.

yokofly avatar Sep 29 '24 01:09 yokofly

@yokofly thanks, not working on this atm. Feel free to work on it. The WriteToDB would need to carefully be broken into pieces. Let me know if you have any questions.

flarco avatar Sep 29 '24 10:09 flarco

Added in #410

flarco avatar Oct 24 '24 02:10 flarco

Would be possible to have direct insert configurable per stream in a replication file? Or let streams with unsupported modes default to temporary table implementation when using SLING_ALLOW_DIRECT_INSERT?

guaripete-github avatar Oct 30 '24 19:10 guaripete-github

Second option sounds good. 👍 Will put a warning, mentioning it's falling to using temp table (for incremental/backfill with PK)

flarco avatar Oct 30 '24 21:10 flarco

Done

flarco avatar Oct 30 '24 21:10 flarco