[Feature]: default `max_records` config for testing
Feature scope
Taps (catalog, state, stream maps, etc.)
Description
It would be really nice to have something like a max_records config that I can set for my taps in test/CI environments that limits the amount of records to sync (e.g. I want around 100 records). We usually recommend limiting the tap's start date to something like yesterday but that could still replicate an excessive amount of data in some cases. If you only need 100 records but syncing from yesterday returns 1M then its a bad pattern. Ideally I could just configure my tap to pull 100 record and once it knows its exceeded that then it just exits. I think the test feature does something similar, exiting after it sees its first successful record.
My example from Squared CI:
Set a DEFAULT_START_DATE for my extractor start dates in my CICD meltano environment to yesterday's date. For cloudwatch, google analytics, slack, etc. I'm pulling way more data than I need to do simple CI tests. It would allow me to save on time and cost.
@pnadolny13 - My only question here is if it should be 100 max records total or 100 max records per stream. I think there's a good case that per-stream is a better limiter, just to ensure your tests have representative data. Wdyt?
@aaronsteers yep youre right, per stream is really what I'm looking for. I want a limited amount of data in each stream.
@pnadolny13 - Thanks for confirming. Here's the internal mechanism I mentioned in office hours, which could be used to deliver this:
https://github.com/meltano/sdk/blob/12679ff066e8f917031151f32b7859021a8017e1/singer_sdk/streams/core.py#L87
@edgarrmondragon I think this may be the right mechanism for users specifying the _MAX_RECORDS_LIMIT, that we can also leverage in the new testing framework. This is only 1 of 2 parts of it though:
- Users specify a
max_records_limitinconfig.json, optionally at the stream level (as @aaronsteers suggests).This config sets theStream._MAX_RECORDS_LIMITunder the hood. Related issue on standardising per-stream config here. - Developers utilise
_MAX_RECORDS_LIMITas part of implementingStream.get_records(), to only fetch the specified number of records. For SQL this likely means usingLIMIT xas part of compiled sql, as we do in the defaultSQLStreamhere. For API's it likely means ending pagination early, once the specified number of records have been retrieved (cutting down on requests, which is handy for APIs with rate limits). Issue to implement the latter in the SDK here.
WDYT?
- This config sets the
Stream._MAX_RECORDS_LIMITunder the hood
@kgpayne I'm not a fan of changing this attribute at runtime, it feels clunky and is prone to error as you've seen (if you re-instantiate the streams, the value is reset to the class default).
I'd prefer at-runtime control in the form of a parameter in Tap.sync_all(max_records=None) that cascades through the streams. It could still be set from config:
tap.sync_all(max_records=tap.config.get("max_records"))
wdyt?
@edgarrmondragon this makes sense to me 👍
tap.sync_all(max_records=tap.config.get("max_records"))
By "This config sets the Stream._MAX_RECORDS_LIMIT under the hood" I just meant the config would result in a value being available on the stream as an attribute, for use in get_records() 🙂 I am totally open to both improving how _MAX_RECORDS_LIMIT is handled/applied and potentially renaming it or placing it behind a public get_record_limit() method to smooth the user experience in get_records().
I was looking for a feature like this.
We try to limit the ingestion in our CI pipeline, we currently do this by altering the start_date (Like Pat mentioned). However, this doesn't work for taps that are not incremental.
It would be great to be able to configurate this. Either by using the tap config or using an environment variable:
environments:
- name: ci
env:
MAX_RECORDS: 1000
@aaronsteers I think this is higher priority than first assumed, as it also blocks adopting the new standard tests framework with SQL taps 😬
Theres more discussion around alternative implementations and details of this in https://github.com/meltano/sdk/issues/1366.
Another user requested this: https://meltano.slack.com/archives/CMN8HELB0/p1685438218728059
I would also love something like this for our test CI pipelines. Any updates to this issue?
Is this one already done? https://github.com/meltano/sdk/blob/0d54ebed751b7db34935d31ca4b405a31b774f09/singer_sdk/streams/sql.py#L202
Made a seperate issue here for adding this configuration for users to control themselves here https://github.com/meltano/sdk/issues/2312