Refactor AWS Inspector: reduce memory and add parallel ingestion

Open heryxpc opened this issue 6 months ago • 0 comments

Summary

There is a memory issue with AWS Inspector module. The pagination only returns results until findings are retrieved from one account and all those objects are added to one single list. Considering there can be millions of findings, this can consume all the memory of the job running it. This PR refactor this module to:

Provide a maximum number of pages to aws_paginate
Allow the caller of aws_paginate to get the nextToken to resume results retrieval.
Breaks inspector ingestion in batches of 100 pages (each page provides 100 results, hence 10000 findings)
Adds aysnc processing to ingest multiple accounts in parallel

This processed ~110,000 findings using a max of ~600MiB

INFO:cartography.intel.aws.inspector:Getting a batch of findings for account XXXXXXX in region us-west-2
INFO:cartography.util:fetching page number 100
WARNING:cartography.util:Reached max batch size of 100 pages
Filename: /Users/heryxpc/src/cartography/cartography/util.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   211    455.4 MiB    455.4 MiB           1   @profile
   212                                         def aws_paginate(
   213                                             client: boto3.client,
   214                                             method_name: str,
   215                                             object_name: str,
   216                                             max_pages: int = DEFAULT_MAX_PAGES,
   217                                             **kwargs: Any,
   218                                         ) -> tuple[List[Dict], Optional[str]]:
   219                                             """
   220                                             Helper method for boilerplate boto3 pagination
   221                                             The **kwargs will be forwarded to the paginator
   222                                             """
   223    455.4 MiB      0.0 MiB           1       paginator = client.get_paginator(method_name)
   224    455.4 MiB      0.0 MiB           1       items = []
   225    455.4 MiB      0.0 MiB           1       i = 0
   226    455.4 MiB      0.0 MiB           1       next_token = None
   227    588.7 MiB    133.3 MiB         100       for i, page in enumerate(paginator.paginate(**kwargs), start=1):
   228    588.7 MiB      0.0 MiB         100           if i % 100 == 0:
   229    588.7 MiB      0.0 MiB           1               logger.info(f"fetching page number {i}")
   230    588.7 MiB      0.0 MiB         100           if object_name in page:
   231    588.7 MiB      0.0 MiB         100               items.extend(page[object_name])
   232    588.7 MiB      0.0 MiB         100               next_token = page.get("nextToken")
   233                                                 else:
   234                                                     logger.warning(
   235                                                         f"""aws_paginate: Key "{object_name}" is not present, check if this is a typo.
   236                                         If not, then the AWS datatype somehow does not have this key.""",
   237                                                     )
   238    588.7 MiB      0.0 MiB         100           if i >= max_pages:
   239    588.7 MiB      0.0 MiB           1               logger.warning(f"Reached max batch size of {max_pages} pages")
   240    588.7 MiB      0.0 MiB           1               break
   241    588.7 MiB      0.0 MiB           1       return items, next_token


INFO:cartography.intel.aws.inspector:Loading 10000 findings from account 277829364062
INFO:cartography.intel.aws.inspector:Loading 254 packages

Checklist

Provide proof that this works (this makes reviews move faster). Please perform one or more of the following:

[x] Update/add unit or integration tests.
[x] Include a screenshot showing what the graph looked like before and after your changes.
[x] Include console log trace showing what happened before and after your changes.

If you are changing a node or relationship:

[ ] Update the schema and readme.

If you are implementing a new intel module:

[ ] Use the NodeSchema data model.

Jun 26 '25 05:06 heryxpc

Refactor AWS Inspector: reduce memory and add parallel ingestion

Summary

Related issues or links

Checklist