PyAirbyte Feat: CatalogManager refactor, decoupling sql processors, caches, and internal sql backend tables

trafficstars

This PR does a lot of things...

CatalogManager refactor

The role of catalog manager is refactored into the new classes, under the role of two "backends", two "providers", and a "writer" (for state).

StateBackend - Responsible for serializing and deserializing state in the internal SQL tables.
CatalogBackend - Responsible for serializing and deserializing catalog info in the internal SQL tables.
StateProvider - Provides state inputs to anything that needs state - namely to Source objects, which use the state to know where to start their sync.
CatalogProvider - Provides catalog schema info to any method that needs access to catalog metadata.
StateWriter - There are two implementations:
- StdOutStateWriter Performs the role as a destination would - which is to simply print the state message to STDOUT.
- SqlStateWriter - Writes state messages to the internal SQL table, which is how a Cache should behave.

With these class in mind, the "Backend" classes generally are able to create "provider" classes when they are the source of the state or catalog. Also, much of the parsing logic for how to handle catalogs was able to be moved out of the processor and cache classes, and into the CatalogProvider class - for instance, getting the primary keys of a stream or getting a stream's properties.

Refactor CatalogManager to isolate StateManager functions in a new class.
Create abstract base classes of both of the above so we can decouple them from their SQL-based serialization implementations.
Create simple versions of these two classes that simply accept input from upstream.
Allow creation of SQLProcessor objects with State and Catalog artifacts provided explicitly, rather than being read/written to internal SQL tables.
Move classes that we want to move to the CDK into a new _future_cdk module. These would be internal classes anyway, but the goal is to move these abstract implementations upstream so destinations can depend upon them.

Decoupling `SQLProcessor` classes from `SQLBackend` and `Cache` classes.

We want to be able to use SQLProcessor classes in destinations like the Cortext destination, and so this PR decouples SQLProcessor from Cache and Backend classes (previously the CatalogManager). Now, instead of passing a backend to the SQLProcessor, we simply pass a CatalogProvider and StateWriter. When a state provider is not explicitly created, the SQLProcessor class will just create its own StdOutStateWriter class, and will behave like a destination. (See the Cortext sample script for an example that doesn't require a cache.)

One last change was that, since we don't want SQLProcessor to depend on Cache class, we needed a different way to provide the inputs that previously were mapped to an embedded cache property in the SQLProcessor classes. Now, we decouple the functionality of configuration into a new set of SQLConfig classes - these are basically the just the user inputs that would be provided via the Cache, but without the behavioral traits of a full Cache class. SQLConfig objects know their properties and they know how to create SQLAlchemy connections (as well as 3rd-party vendor connections, when applicable), but they are lightweight and can be handed down to the SQLProcessor classes. They can also be imported or implemented by destinations, without the need to create a full cache class.

For backwards compatibility and ease-of-use, the Cache classes inherit from the respective SQLConfig classes, so they accept the same config inputs in their constructor as they did previous to this PR (no breaking change) and they can pass themselves to the SQLProcessor classes as a valid subclass of SQLConfig. Typing prevents the processor from performing cache-related functions on the SQLConfig instances, even though the object may also be a Cache. This ensures that we don't end up with any circular dependencies between the role of the Cache and the SQLProcessor, and it also ensures that destination connectors that use the SQLProcessor class can send the lighter-weight SQLConfig object to the SQLProcessor initializer, and they won't need to create a full Cache object.

Eligible for CDK Re-Use

RecordProcessor
SQLProcessor
CatalogProvider
StateProvider
StateWriter
StdOutStateWriter

May 08 '24 01:05 aaronsteers

Finished taking a pass at the code. The refactoring is easy to follow and makes sense to me. It is def much more intuitive and cleaner to have a separate statemanager and catalogmanager. This would make the cortex processor implementation much cleaner!!

May 08 '24 18:05 bindipankhudi

Tests are passing except one new issue with Windows paths. That will be fixed shortly, but it doesn't change anything in the core code.

May 14 '24 20:05 aaronsteers

/poetry-lock

poetry lock job started... Check job output.

✅ poetry lock applied successfully.

May 15 '24 18:05 aaronsteers

/poetry-lock

poetry lock job started... Check job output.

✅ poetry lock applied successfully.

May 15 '24 19:05 aaronsteers

PyAirbyte PyAirbyte copied to clipboard

Feat: CatalogManager refactor, decoupling sql processors, caches, and internal sql backend tables

CatalogManager refactor

Decoupling SQLProcessor classes from SQLBackend and Cache classes.

Eligible for CDK Re-Use

PyAirbyte
PyAirbyte copied to clipboard

Decoupling `SQLProcessor` classes from `SQLBackend` and `Cache` classes.