target-redshift
target-redshift copied to clipboard
Add support for patternProperties
Issue: If a tap have some fields not explicitly declared, but declared in a
"patternProperties": {
".+": {}
}
block, they won't be loaded into Redshift. While, when using another target (like target-csv), the fields are available.
How to reproduce:
-
data.jsonl.txt ( a file with a schema and an example record)
-
issues.csv.txt A CSV generated by the following command
cat data.jsonl | target-csv
Version: Python 3.7.3 target-csv==0.3.0 target-redshift==0.0.7
Documentation: The link to the target-csv flatten function
@pcorbel so I took a look at things in tap-csv
and the way I think it actually works is that it simply takes the first record for a stream, and then uses that records keys for the headers and subsequent keys for all future records.
Since it's not doing any batching, it doesn't have any ability to add new keys as they come up.
So for instance, if you have:
{"patternProperties": {
"a.+": {type: int}
}}
{a: 123}
{a_1: 456}
{a_2: 789}
...
You'll get:
a
123
NULL
NULL
All records will pass validation, but you'll still effectively lose the data in a_1
and a_2
.
Is this the same as your experience and is this how you were expecting the support herein to work?
@AlexanderMann With the taps I work with, all keys are always represented, so I'll always have
{a: 123, a_1: None, a_2: None}
{a: None, a_1: 456, a_2: None}
{a: None, a_1: None, a_2: 789}
and I think a lot of API/taps are working that way.
I think it would still be perfectible but it would be an enormous improvement to implement that like the CSV target.
@AlexanderMann Hello Alexander, could you give me some pointers to where to begin to implement it target-csv way?
Hey @pcorbel. So there's an issue over in https://github.com/datamill-co/target-postgres/issues/129 which deals with a similar issue you're describing here.
I'm thinking that your option of using target-csv
and simply taking the fields off of the first record is a good alternative option.
Since this repo depends on target-postgres
pretty heavily, whatever solution we come up with for target-redshift
will most likely effect target-postgres
.
(also, sorry about the tardy reply, vacation and then wrapping up a project with a client)
Also, @pcorbel forgot to ask, but what taps are you using which are leveraging patternProperties
?
Also, @pcorbel forgot to ask, but what taps are you using which are leveraging patternProperties
?
@AlexanderMann I was using tap-jira and most of the fields in an issue
was discarded because they were in a patternProperties
field