sdk
sdk copied to clipboard
bug: Properties mismatch warning for child streams that do not declare parent context keys in their schemas
Singer SDK Version
0.45.4
Is this a regression?
- [ ] Yes
Python Version
NA
Bug scope
Taps (catalog, state, etc.)
Operating System
No response
Description
In tap-f1, the drivers stream is configured as a child stream of the seasons stream. The season context key is only referenced in the request to /{season}/drivers.json.
When running the tap, I observe this warning:
2025-04-02 01:53:15,799 | WARNING | tap-f1.drivers | Properties ('season',) were present in the 'drivers' stream but not found in catalog schema. Ignoring.
I could suppress this by adding a season property to the drivers stream schema, but I do not want that to be output at all to prevent downstream consumption (i.e. stop a database target from creating a season column in the drivers table).
Is this a valid warning here? Should these kinds of warnings be shown for child streams with parent properties present in context at all?
Link to Slack/Linen
No response
@ReubenFrankel might be worth setting the state_partitioning_keys = [] attribute in that stream so those parent context fields are not added to the record if the stream's state does not really depend on any of the parent keys:
https://github.com/meltano/sdk/blob/868b5667cb1620e88efc561a74bc696d64e7f5b3/singer_sdk/streams/core.py#L1089-L1093
state_partitioning_keys = []
I haven't really dug into state partitioning - from https://sdk.meltano.com/en/v0.45.4/classes/singer_sdk.Stream.html#singer_sdk.Stream.state_partitioning_keys:
If an empty list is set ([]), state will be held in one bookmark per stream.
How does this relate to my issue? I'm happy to make the fix, but want to make sure I understand how it works and why the SDK shows a warning by default.
How does this relate to my issue?
It does seem unrelated according to that, so it's a probably a gap in the docs.
The reason context keys are added to the record is because they're also used to generate the state context. This is needed only for cases where you want a different bookmark for each partition, i.e. for each different parent record.
Your state looks like this for the default settings:
{
"bookmarks": {
"seasons": {
"replication_key": "season",
"replication_key_value": "2025"
},
"drivers": {
"partitions": [
{
"context": {
"season": "2025"
}
}
]
}
}
}
but if you configure start_date to 2024-01-01, it extracts the drivers data twice (once for each season) and the state looks like this
{
"bookmarks": {
"seasons": {
"replication_key": "season",
"replication_key_value": "2025"
},
"drivers": {
"partitions": [
{
"context": {
"season": "2024"
}
},
{
"context": {
"season": "2025"
}
}
]
}
}
}
This is done so that replication can pick up child streams on the next run (with full-table replication in this case), even if the parent stream emits no records.
Setting state_partitioning_keys = [] skips the warning and results in this state:
{
"bookmarks": {
"seasons": {
"replication_key": "season",
"replication_key_value": "2025"
},
"drivers": {}
}
}
Does that make sense?
That said, if I was a consumer of this data, I'd still want the season data attached to a driver to answer, for example, a question like "How many season has this driver been in?".
Simply adding season to the drivers table is probably overkill and leads to a ton of duplicate data since drive attributes would have no reason to change season to season, so I'd probably use dbt after the load to create a many-to-many mapping (e.g. driver_seasons) table and dedup the driver data.