sdk icon indicating copy to clipboard operation
sdk copied to clipboard

[Feature]: default `max_records` config for testing

Open pnadolny13 opened this issue 2 years ago • 13 comments

Feature scope

Taps (catalog, state, stream maps, etc.)

Description

It would be really nice to have something like a max_records config that I can set for my taps in test/CI environments that limits the amount of records to sync (e.g. I want around 100 records). We usually recommend limiting the tap's start date to something like yesterday but that could still replicate an excessive amount of data in some cases. If you only need 100 records but syncing from yesterday returns 1M then its a bad pattern. Ideally I could just configure my tap to pull 100 record and once it knows its exceeded that then it just exits. I think the test feature does something similar, exiting after it sees its first successful record.

My example from Squared CI:

Set a DEFAULT_START_DATE for my extractor start dates in my CICD meltano environment to yesterday's date. For cloudwatch, google analytics, slack, etc. I'm pulling way more data than I need to do simple CI tests. It would allow me to save on time and cost.

pnadolny13 avatar Jan 18 '23 16:01 pnadolny13

@pnadolny13 - My only question here is if it should be 100 max records total or 100 max records per stream. I think there's a good case that per-stream is a better limiter, just to ensure your tests have representative data. Wdyt?

aaronsteers avatar Jan 18 '23 17:01 aaronsteers

@aaronsteers yep youre right, per stream is really what I'm looking for. I want a limited amount of data in each stream.

pnadolny13 avatar Jan 19 '23 14:01 pnadolny13

@pnadolny13 - Thanks for confirming. Here's the internal mechanism I mentioned in office hours, which could be used to deliver this:

https://github.com/meltano/sdk/blob/12679ff066e8f917031151f32b7859021a8017e1/singer_sdk/streams/core.py#L87

aaronsteers avatar Jan 21 '23 04:01 aaronsteers

@edgarrmondragon I think this may be the right mechanism for users specifying the _MAX_RECORDS_LIMIT, that we can also leverage in the new testing framework. This is only 1 of 2 parts of it though:

  1. Users specify a max_records_limit in config.json, optionally at the stream level (as @aaronsteers suggests).This config sets the Stream._MAX_RECORDS_LIMIT under the hood. Related issue on standardising per-stream config here.
  2. Developers utilise _MAX_RECORDS_LIMIT as part of implementing Stream.get_records(), to only fetch the specified number of records. For SQL this likely means using LIMIT x as part of compiled sql, as we do in the default SQLStream here. For API's it likely means ending pagination early, once the specified number of records have been retrieved (cutting down on requests, which is handy for APIs with rate limits). Issue to implement the latter in the SDK here.

WDYT?

kgpayne avatar Jan 25 '23 14:01 kgpayne

  1. This config sets the Stream._MAX_RECORDS_LIMIT under the hood

@kgpayne I'm not a fan of changing this attribute at runtime, it feels clunky and is prone to error as you've seen (if you re-instantiate the streams, the value is reset to the class default).

I'd prefer at-runtime control in the form of a parameter in Tap.sync_all(max_records=None) that cascades through the streams. It could still be set from config:

tap.sync_all(max_records=tap.config.get("max_records"))

wdyt?

edgarrmondragon avatar Jan 25 '23 17:01 edgarrmondragon

@edgarrmondragon this makes sense to me 👍

tap.sync_all(max_records=tap.config.get("max_records"))

By "This config sets the Stream._MAX_RECORDS_LIMIT under the hood" I just meant the config would result in a value being available on the stream as an attribute, for use in get_records() 🙂 I am totally open to both improving how _MAX_RECORDS_LIMIT is handled/applied and potentially renaming it or placing it behind a public get_record_limit() method to smooth the user experience in get_records().

kgpayne avatar Jan 25 '23 17:01 kgpayne

I was looking for a feature like this.

We try to limit the ingestion in our CI pipeline, we currently do this by altering the start_date (Like Pat mentioned). However, this doesn't work for taps that are not incremental.

It would be great to be able to configurate this. Either by using the tap config or using an environment variable:

environments:
- name: ci
  env:
    MAX_RECORDS: 1000

JulesHuisman avatar Jan 25 '23 21:01 JulesHuisman

@aaronsteers I think this is higher priority than first assumed, as it also blocks adopting the new standard tests framework with SQL taps 😬

kgpayne avatar Jan 26 '23 15:01 kgpayne

Theres more discussion around alternative implementations and details of this in https://github.com/meltano/sdk/issues/1366.

pnadolny13 avatar Feb 07 '23 14:02 pnadolny13

Another user requested this: https://meltano.slack.com/archives/CMN8HELB0/p1685438218728059

edgarrmondragon avatar May 30 '23 15:05 edgarrmondragon

I would also love something like this for our test CI pipelines. Any updates to this issue?

yaozhang09 avatar Aug 18 '23 19:08 yaozhang09

Is this one already done? https://github.com/meltano/sdk/blob/0d54ebed751b7db34935d31ca4b405a31b774f09/singer_sdk/streams/sql.py#L202

Made a seperate issue here for adding this configuration for users to control themselves here https://github.com/meltano/sdk/issues/2312

visch avatar Mar 13 '24 17:03 visch