airbyte icon indicating copy to clipboard operation
airbyte copied to clipboard

low-code: Yield records from generators instead of keeping them in in-memory lists

Open girarda opened this issue 11 months ago • 2 comments

What

Improve memory usage by yielding records from generators instead of returning lists of objects

This PR addresses a part of https://github.com/airbytehq/airbyte-internal-issues/issues/6554

How

  1. Update the record selector, extractor, and filter interfaces to work on generators instead of lists of records
  2. Update the paginator interface to only use the number of records read and the last record instead of the full list of records read
  3. Update the simple retriever to tie in everything together

Reading order

  1. airbyte-cdk/python/airbyte_cdk/sources/declarative/extractors/record_extractor.py
  2. airbyte-cdk/python/airbyte_cdk/sources/declarative/extractors/http_selector.py
  3. airbyte-cdk/python/airbyte_cdk/sources/declarative/extractors/dpath_extractor.py
  4. airbyte-cdk/python/airbyte_cdk/sources/declarative/extractors/record_selector.py
  5. airbyte-cdk/python/airbyte_cdk/sources/declarative/extractors/record_filter.py
  6. airbyte-cdk/python/airbyte_cdk/sources/declarative/requesters/paginators/paginator.py
  7. airbyte-cdk/python/airbyte_cdk/sources/declarative/requesters/paginators/no_pagination.py
  8. airbyte-cdk/python/airbyte_cdk/sources/declarative/requesters/paginators/default_paginator.py
  9. airbyte-cdk/python/airbyte_cdk/sources/declarative/requesters/paginators/strategies/pagination_strategy.py
  10. airbyte-cdk/python/airbyte_cdk/sources/declarative/requesters/paginators/strategies/offset_increment.py
  11. airbyte-cdk/python/airbyte_cdk/sources/declarative/requesters/paginators/strategies/page_increment.py
  12. airbyte-cdk/python/airbyte_cdk/sources/declarative/requesters/paginators/strategies/cursor_pagination_strategy.py
  13. airbyte-cdk/python/airbyte_cdk/sources/declarative/requesters/paginators/strategies/stop_condition.py
  14. airbyte-cdk/python/airbyte_cdk/sources/declarative/retrievers/simple_retriever.py
  15. airbyte-cdk/python/airbyte_cdk/sources/declarative/retrievers/simple_retriever.py

girarda avatar Mar 22 '24 21:03 girarda

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview May 14, 2024 11:02pm

vercel[bot] avatar Mar 22 '24 21:03 vercel[bot]

confirmed this change in combination with some changes on the iterable side helps with the memory usage of the connector.

The large spikes show attempts with all the fixes, and the last (much lower one) shows the memory usage with this change, using generators in the custom components, and reducing the size of the time windows: Screenshot 2024-05-09 at 9 31 16 PM

The underlying issue is that iterable returns gigantic responses (I've seen one ~4GB). I think fixing this would require streaming the responses, which is out of scope for this PR

girarda avatar May 10 '24 04:05 girarda