PyAirbyte
PyAirbyte copied to clipboard
Feat: CatalogManager refactor, decoupling sql processors, caches, and internal sql backend tables
This PR does a lot of things...
CatalogManager refactor
The role of catalog manager is refactored into the new classes, under the role of two "backends", two "providers", and a "writer" (for state).
-
StateBackend
- Responsible for serializing and deserializing state in the internal SQL tables. -
CatalogBackend
- Responsible for serializing and deserializing catalog info in the internal SQL tables. -
StateProvider
- Provides state inputs to anything that needs state - namely toSource
objects, which use the state to know where to start their sync. -
CatalogProvider
- Provides catalog schema info to any method that needs access to catalog metadata. -
StateWriter
- There are two implementations:-
StdOutStateWriter
Performs the role as a destination would - which is to simply print the state message toSTDOUT
. -
SqlStateWriter
- Writes state messages to the internal SQL table, which is how a Cache should behave.
-
With these class in mind, the "Backend" classes generally are able to create "provider" classes when they are the source of the state or catalog. Also, much of the parsing logic for how to handle catalogs was able to be moved out of the processor and cache classes, and into the CatalogProvider
class - for instance, getting the primary keys of a stream or getting a stream's properties.
- Refactor CatalogManager to isolate StateManager functions in a new class.
- Create abstract base classes of both of the above so we can decouple them from their SQL-based serialization implementations.
- Create simple versions of these two classes that simply accept input from upstream.
- Allow creation of SQLProcessor objects with State and Catalog artifacts provided explicitly, rather than being read/written to internal SQL tables.
- Move classes that we want to move to the CDK into a new
_future_cdk
module. These would be internal classes anyway, but the goal is to move these abstract implementations upstream so destinations can depend upon them.
Decoupling SQLProcessor
classes from SQLBackend
and Cache
classes.
We want to be able to use SQLProcessor
classes in destinations like the Cortext destination, and so this PR decouples SQLProcessor
from Cache
and Backend
classes (previously the CatalogManager
). Now, instead of passing a backend to the SQLProcessor
, we simply pass a CatalogProvider
and StateWriter
. When a state provider is not explicitly created, the SQLProcessor
class will just create its own StdOutStateWriter
class, and will behave like a destination. (See the Cortext sample script for an example that doesn't require a cache.)
One last change was that, since we don't want SQLProcessor
to depend on Cache
class, we needed a different way to provide the inputs that previously were mapped to an embedded cache
property in the SQLProcessor
classes. Now, we decouple the functionality of configuration into a new set of SQLConfig
classes - these are basically the just the user inputs that would be provided via the Cache
, but without the behavioral traits of a full Cache
class. SQLConfig
objects know their properties and they know how to create SQLAlchemy
connections (as well as 3rd-party vendor connections, when applicable), but they are lightweight and can be handed down to the SQLProcessor classes. They can also be imported or implemented by destinations, without the need to create a full cache class.
For backwards compatibility and ease-of-use, the Cache
classes inherit from the respective SQLConfig
classes, so they accept the same config inputs in their constructor as they did previous to this PR (no breaking change) and they can pass themselves to the SQLProcessor
classes as a valid subclass of SQLConfig
. Typing prevents the processor from performing cache-related functions on the SQLConfig
instances, even though the object may also be a Cache
. This ensures that we don't end up with any circular dependencies between the role of the Cache
and the SQLProcessor
, and it also ensures that destination connectors that use the SQLProcessor
class can send the lighter-weight SQLConfig
object to the SQLProcessor
initializer, and they won't need to create a full Cache
object.
Eligible for CDK Re-Use
-
RecordProcessor
-
SQLProcessor
-
CatalogProvider
-
StateProvider
-
StateWriter
-
StdOutStateWriter
Finished taking a pass at the code. The refactoring is easy to follow and makes sense to me. It is def much more intuitive and cleaner to have a separate statemanager and catalogmanager. This would make the cortex processor implementation much cleaner!!
Tests are passing except one new issue with Windows paths. That will be fixed shortly, but it doesn't change anything in the core code.