target-redshift icon indicating copy to clipboard operation
target-redshift copied to clipboard

Add support for patternProperties

Open pcorbel opened this issue 5 years ago • 7 comments

Issue: If a tap have some fields not explicitly declared, but declared in a

"patternProperties": {
  ".+": {}
}

block, they won't be loaded into Redshift. While, when using another target (like target-csv), the fields are available.

How to reproduce:

  • data.jsonl.txt ( a file with a schema and an example record)

  • issues.csv.txt A CSV generated by the following command cat data.jsonl | target-csv

Version: Python 3.7.3 target-csv==0.3.0 target-redshift==0.0.7

Documentation: The link to the target-csv flatten function

pcorbel avatar Jun 24 '19 08:06 pcorbel

@pcorbel so I took a look at things in tap-csv and the way I think it actually works is that it simply takes the first record for a stream, and then uses that records keys for the headers and subsequent keys for all future records.

Since it's not doing any batching, it doesn't have any ability to add new keys as they come up.

So for instance, if you have:

{"patternProperties": {
  "a.+": {type: int}
}}

{a: 123}
{a_1: 456}
{a_2: 789}
...

You'll get:

a
123
NULL
NULL

All records will pass validation, but you'll still effectively lose the data in a_1 and a_2.

Is this the same as your experience and is this how you were expecting the support herein to work?

AlexanderMann avatar Jun 24 '19 13:06 AlexanderMann

@AlexanderMann With the taps I work with, all keys are always represented, so I'll always have

{a: 123, a_1: None, a_2: None}
{a: None, a_1: 456, a_2: None}
{a: None, a_1: None, a_2: 789}

and I think a lot of API/taps are working that way.

I think it would still be perfectible but it would be an enormous improvement to implement that like the CSV target.

pcorbel avatar Jun 24 '19 14:06 pcorbel

@AlexanderMann Hello Alexander, could you give me some pointers to where to begin to implement it target-csv way?

pcorbel avatar Jul 23 '19 14:07 pcorbel

Hey @pcorbel. So there's an issue over in https://github.com/datamill-co/target-postgres/issues/129 which deals with a similar issue you're describing here.

I'm thinking that your option of using target-csv and simply taking the fields off of the first record is a good alternative option.

Since this repo depends on target-postgres pretty heavily, whatever solution we come up with for target-redshift will most likely effect target-postgres.

(also, sorry about the tardy reply, vacation and then wrapping up a project with a client)

AlexanderMann avatar Jul 31 '19 15:07 AlexanderMann

Also, @pcorbel forgot to ask, but what taps are you using which are leveraging patternProperties?

AlexanderMann avatar Jul 31 '19 15:07 AlexanderMann

Also, @pcorbel forgot to ask, but what taps are you using which are leveraging patternProperties?

AlexanderMann avatar Jul 31 '19 16:07 AlexanderMann

@AlexanderMann I was using tap-jira and most of the fields in an issue was discarded because they were in a patternProperties field

pcorbel avatar Aug 05 '19 08:08 pcorbel