sdk
sdk copied to clipboard
Feature: Ability for taps to use batch messaging only on select streams
Feature scope
Taps (catalog, state, stream maps, etc.)
Description
This feature would allow more control over when batching is used.
I'll list a few approaches for discussion, but everyone please feel free to propose alternatives.
Explicit allow or disallow by stream name
One approach is simply to embed a exclude_streams or include_strams config under the existing batch_config setting.
{
batch_config: {
include_streams: ["stream_1", "stream_2"]
// ..
}
}
{
batch_config: {
exclude_streams: ["stream_1", "stream_2"]
// ..
}
}
Use rowcount estimates if available
If rowcount estimates are available in the catalog or if the tap has an ability to count records before the stream starts, perhaps the user could specify they want to use batch if the number of records is greater than 10K or 100K.
Note:
- It's hard to say how the tap should act towards this input if it cannot determine a rowcount before the sync starts.
- For those that can, the better home for rowcount-based decisioning would perhaps be the 'auto' setting discussed below. Wherein the user saying 'auto' gives permission to the tap to use its own discretion, including with the use of rowcounts when available...
Allow 'auto' behavior as determined by the developer
We could also introduce a behavior where the user can give the tap permission to select it's best recommendation for whether use use RECORD messages or BATCH messages for each stream.
For example, a developer could make 'auto' behavior equal to 'false' for all parent-child streams, and default to 'true' for all streams that it knows to be proportionally higher volume.
{
batch_config: {
batching_enabled: "auto",
exclude_streams: ["stream_1", "stream_2"],
// ..
}
}
In the long run, the 'auto' behavior probably gives best overall performance and best user experience. It's always going to be important that users can disable batch messages, since some targets won't support them, but sometimes only the tap will know what sync strategy is fastest - and the decision might be dynamic based on a combination of runtime conditions (how many rows are available) as well as static metadata about the streams (this is inherently a "big" table, or this is a child stream).
In these cases, it's not critical that the tap make the 'best possible' decision, but only that it not make an 'obviously wrong' one. Meaning, the distinction of using batch when there are greater than 1000 records, versus 5000 records, is not as important as not using batch when there are only 3 records, and always using batch when there are 25 million.