cartography
cartography copied to clipboard
Refactor AWS Inspector: reduce memory and add parallel ingestion
Summary
There is a memory issue with AWS Inspector module. The pagination only returns results until findings are retrieved from one account and all those objects are added to one single list. Considering there can be millions of findings, this can consume all the memory of the job running it. This PR refactor this module to:
- Provide a maximum number of pages to
aws_paginate - Allow the caller of
aws_paginateto get thenextTokento resume results retrieval. - Breaks inspector ingestion in batches of 100 pages (each page provides 100 results, hence 10000 findings)
- Adds aysnc processing to ingest multiple accounts in parallel
This processed ~110,000 findings using a max of ~600MiB
INFO:cartography.intel.aws.inspector:Getting a batch of findings for account XXXXXXX in region us-west-2
INFO:cartography.util:fetching page number 100
WARNING:cartography.util:Reached max batch size of 100 pages
Filename: /Users/heryxpc/src/cartography/cartography/util.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
211 455.4 MiB 455.4 MiB 1 @profile
212 def aws_paginate(
213 client: boto3.client,
214 method_name: str,
215 object_name: str,
216 max_pages: int = DEFAULT_MAX_PAGES,
217 **kwargs: Any,
218 ) -> tuple[List[Dict], Optional[str]]:
219 """
220 Helper method for boilerplate boto3 pagination
221 The **kwargs will be forwarded to the paginator
222 """
223 455.4 MiB 0.0 MiB 1 paginator = client.get_paginator(method_name)
224 455.4 MiB 0.0 MiB 1 items = []
225 455.4 MiB 0.0 MiB 1 i = 0
226 455.4 MiB 0.0 MiB 1 next_token = None
227 588.7 MiB 133.3 MiB 100 for i, page in enumerate(paginator.paginate(**kwargs), start=1):
228 588.7 MiB 0.0 MiB 100 if i % 100 == 0:
229 588.7 MiB 0.0 MiB 1 logger.info(f"fetching page number {i}")
230 588.7 MiB 0.0 MiB 100 if object_name in page:
231 588.7 MiB 0.0 MiB 100 items.extend(page[object_name])
232 588.7 MiB 0.0 MiB 100 next_token = page.get("nextToken")
233 else:
234 logger.warning(
235 f"""aws_paginate: Key "{object_name}" is not present, check if this is a typo.
236 If not, then the AWS datatype somehow does not have this key.""",
237 )
238 588.7 MiB 0.0 MiB 100 if i >= max_pages:
239 588.7 MiB 0.0 MiB 1 logger.warning(f"Reached max batch size of {max_pages} pages")
240 588.7 MiB 0.0 MiB 1 break
241 588.7 MiB 0.0 MiB 1 return items, next_token
INFO:cartography.intel.aws.inspector:Loading 10000 findings from account 277829364062
INFO:cartography.intel.aws.inspector:Loading 254 packages
Related issues or links
https://github.com/cartography-cncf/cartography/issues/1025 https://github.com/cartography-cncf/cartography/issues/988
Checklist
Provide proof that this works (this makes reviews move faster). Please perform one or more of the following:
- [x] Update/add unit or integration tests.
- [x] Include a screenshot showing what the graph looked like before and after your changes.
- [x] Include console log trace showing what happened before and after your changes.
If you are changing a node or relationship:
If you are implementing a new intel module:
- [ ] Use the NodeSchema data model.