[Feature] Introduce a Cartography Exception hierarchy for better error handling

Open jychp opened this issue 1 month ago • 1 comments

Summary

Cartography should expose a dedicated exception hierarchy to make error handling simpler and more deterministic when the project is used as a library. The objective is not to modify the "fail loudly" behavior, but to give callers a clean way to detect and capture Cartography-specific failures.

Motivation

Consumers embedding Cartography currently have to catch generic Python exceptions or provider-specific errors. This makes integrations fragile, ties callers to internal details, and makes it harder to identify configuration mistakes.

A structured exception hierarchy solves this by enabling clear distinctions between:

Generic Cartography failures (CartographyException)
Database-related errors (CartographyDBException)
Intel module failures (CartographyIntelException)
Authentication failures inside modules (CartographyAuthException)
Misconfiguration inside intel modules (CartographyConfigException)

For CLI usage, errors will still be surfaced as standard error messages. The exception structure primarily benefits library consumers.

Proposed Solution

Introduce the following classes:

class CartographyException(Exception):
    pass

class CartographyDBException(CartographyException):
    pass

class CartographyIntelException(CartographyException):
    pass

class CartographyAuthException(CartographyIntelException):
    pass

class CartographyConfigException(CartographyIntelException):
    pass

Initial implementation steps:

Wrap the main execution flow with a broad try/except.
Re-raise errors using the appropriate Cartography exception types.
Focus first on mapping existing DB and intel-level exceptions.
Gradually refine exception usage inside modules to categorize errors more accurately over time.
CLI behavior remains unchanged: exceptions will still appear as user-friendly error messages.

Alternatives Considered

Keep the current generic exception pattern: rejected because it prevents robust library-level error handling.
Refactor every module upfront to use precise exception types: too large and intrusive for a first iteration.

Nov 23 '25 23:11 jychp

Yeah, I think error handling is a problem, though I'd like to zoom out on what specific things we would like to accomplish with better error handling:

In my mind, the main concern with error handling is with data quality. Specifically: we don't want the sync to keep continuing on to run a cleanup job that deletes all nodes in the graph when it shouldn't because this totally breaks automation that relies on the graph. Other than that, if the sync crashes loudly for any other reason, I'm a lot less concerned.

If it is a transient error, we should retry up to a limit. If it is a permanent error like if the credential is out of date, we should fail loudly to stop the sync from going on to prevent the data quality issue I mentioned above.

Then there are cases where users want the sync to keep going even if the cred doesn't have permissions: e.g. red teamers wanting to see how much their creds can audit.

We should document this but there is a lot of nuance here and it is a decently big project because we would have to catch specific exceptions for the multitude of providers and bucket them into all of these cases above. I'm open to discussion here but given the other priorities of the project - more coverage, more rules - I don't see error handling here as an urgent priority. It's been working fine while we continue this patchy, whack-a-mole approach.

Nov 24 '25 17:11 achantavy