target-redshift icon indicating copy to clipboard operation
target-redshift copied to clipboard

Allow adding TRUNCATECOLUMNS option to Redshift COPY

Open stefankeidel opened this issue 5 years ago • 13 comments
trafficstars

This adds a configuration parameter (defaulting to False) which triggers the TRUNCATECOLUMNS option in every Redshift COPY statement sent by the target.

The use case for us is the combination with tap-intercom where some of the content can exceed 64k, but the content for those few records/fields where that happens can be safely ignored. I couldn't find another way to truncate the content before sending to Redshift.

It might be useful to at some point add a more flexible way to include other options as well, but this should work for now.

stefankeidel avatar May 13 '20 10:05 stefankeidel

@stefankeidel this seems really straightforward and practical. I think it'd make sense to have a test which creates an (insanely) large record, uses this option, and then makes sure we read a truncated version out of Redshift since unlike some of our other configuration options, this is pretty straightforward to actually test for.

Curious if @awm33 has any opinions/thoughts here?

AlexanderMann avatar May 14 '20 12:05 AlexanderMann

Good idea! Added a test that does roughly that. Lmk if that works

stefankeidel avatar May 15 '20 09:05 stefankeidel

Test looks good. If we can get @awm33 to weigh in here, I think this is good to merge. Really nice work @stefankeidel!

AlexanderMann avatar May 15 '20 13:05 AlexanderMann

@AlexanderMann @stefankeidel I'm wondering if we should create a subobject for Redshift COPY options to group them?

@stefankeidel Did you try unselecting ("selected": false") the column/field in the tap catalog? Or do you actually need the column for analysis?

awm33 avatar May 16 '20 21:05 awm33

@awm33 that seems like a reasonable thing to do. I think a good enhancement for all of our config would be grouping all of the various things. Like, for psycopg2 we can make the connection object just a 1:1 mapping in a sub-object etc.

You're suggesting doing it here so folks don't have a bunch of work to do in the future?

AlexanderMann avatar May 17 '20 21:05 AlexanderMann

@stefankeidel Did you try unselecting ("selected": false") the column/field in the tap catalog? Or do you actually need the column for analysis?

Yeah, this is for tap-intercom. We don't care about every individual message being imported in full but would still like to have most of them :)

Regarding a subgrouping: Makes sense! Wdyt about something like this?

{
	"redshift_host": "...",
	"redshift_port": 5439,
	"redshift_database": "...",
	"redshift_username": "...",
	"redshift_password": "...",
	"redshift_schema": "test_schema",
	"default_column_length": 1000,
	"max_batch_rows": 16000000,
	"max_buffer_size": 30064771072,
	"target_s3": {
		"aws_access_key_id": "...",
		"aws_secret_access_key": "....",
		"bucket": "...",
		"key_prefix": "singer-io/"
	},
	"redshift_copy_options": {
		"truncate_columns": true
	}
}

stefankeidel avatar May 18 '20 05:05 stefankeidel

If we're going this route, I'd prefer nested values ie: redshift: { copy: ...

Also, I'm wondering if we want to simply make this an array of strings which get passed right through to the COPY statement, that way you can override anything without needing another PR.

AlexanderMann avatar May 18 '20 15:05 AlexanderMann

If we're going this route, I'd prefer nested values ie: redshift: { copy: ...

Hmm, we already have prefixed redshift_ values for the connection params. Would you want to move those in there as well and break existing configs? Either way it should be consistent I think.

Also, I'm wondering if we want to simply make this an array of strings which get passed right through to the COPY statement, that way you can override anything without needing another PR.

I like this idea! Not sure if we should do some verification or if we can just assume people that are using such an option know what they're doing?

stefankeidel avatar May 19 '20 05:05 stefankeidel

I implemented it using the prefix redshift_* for now, but I'm happy to change the config format obviously. I just thought this is most consistent with the existing redshift_host etc.

stefankeidel avatar May 19 '20 11:05 stefankeidel

@stefankeidel @AlexanderMann I kind of regret us prefixing everything with redshift :). Other targets have COPY commands (postgres and snowflake) and snowflake has a TRUNCATECOLUMNS options too.

I propose something that we can use with other targets as well, since they offer something similar:

{
	"redshift_host": "...",
	"redshift_port": 5439,
	"redshift_database": "...",
	"redshift_username": "...",
	"redshift_password": "...",
	"redshift_schema": "test_schema",
	"default_column_length": 1000,
	"max_batch_rows": 16000000,
	"max_buffer_size": 30064771072,
	"target_s3": {
		"aws_access_key_id": "...",
		"aws_secret_access_key": "....",
		"bucket": "...",
		"key_prefix": "singer-io/"
	},
	"copy_options": {
		"truncate_columns": true
	}
}

awm33 avatar May 19 '20 17:05 awm33

@awm33 I like that format, but what do you think about the copy-options-as-array-of-strings idea posted by @AlexanderMann above and implemented in this latest revision? Do we want keywords for every single option we want to support or just assume users know what they're doing?

stefankeidel avatar May 20 '20 06:05 stefankeidel

@awm33 on that note...should this really be something we put into target-postgres then and have the other targets inherit? Is this something we can assume is a SQLBase type configuration?

AlexanderMann avatar May 20 '20 12:05 AlexanderMann

Is this ready to be merged + released? Happen to be looking for exactly this config option =)

Tom-E avatar Jan 05 '21 20:01 Tom-E