spark-cassandra-connector
spark-cassandra-connector copied to clipboard
SPARKC-577, round two
Description
How did the Spark Cassandra Connector Work or Not Work Before this Patch
Connector was using internal serializable reps for keyspace/table metadata because corresponding Java driver classes weren't serializable. This changed in v4.6.0.
General Design of the patch
Existing connector metadata types were converted to traits with existing impls becoming a "default" implementation. Also added trait impls based on Java driver metadata types. Worth noting that the existing representations served two distinct functions:
- Metadata retrieval/access
- As a definition/descriptor for future table creation
Schema loading functions return an impl based on Java driver metadata types so (1) is the default for most operations. (2) is primarily used in the ColumnMapper code.
Also worth noting: impl for (1) is largely a direct cut-through to Java driver types and methods, but in some cases this wasn't adequate to implement the trait. As an example: the connector metadata types were storing some information (such as clustering column info) at the column level while the Java metadata types represent this at the table level. To work around this problem the new impls (based on Java driver metadata types) act as providers of Java metadata types for the current level all the way up to the keyspace. So DriverColumnDef can provide column metadata for the column it represents as well as table and keyspace metadata for the appropriate structures. Distinct traits were implemented to identify which classes can provide which Java metadata type(s).
Fixes: SPARKC-577
How Has This Been Tested?
Still a WIP, hasn't been tested meaningfully yet
Checklist:
- [X] I have a ticket in the OSS JIRA
- [ ] I have performed a self-review of my own code
- [ ] Locally all tests pass (make sure tests fail without your patch)
@absurdfarce I'm not sure if this is the way to go. I thought that we we want to use Driver types and get rid of SCC types completely. The less code to maintain the better. Do you think it would be possible to refactor SCC so that it relies solely on Driver types?
There's a bit of a complication there @jtgrabowski , although I'm not completely sure it's fatal.
The old Table/ColumnDef impl served two use cases in the code base. It represented data retrieved from C* metadata but it also could be created to define tables/columns which should be created by SCC. My original take on this ticket tried to separate those concerns... but it snowballed quickly. That led me to the approach used here of pulling these impls into traits and creating parallel impls.
I suppose we might be able to get there by leaving the Table/ColumnDef impls alone and just changing the blahFromCassandra() calls to return Java driver types and just leave everything else alone. I believe we do have at least one case where TableDef.copy() is used to tweak something we get from C*, so in that case we'll have to construct a TableDef based on the underlying Java driver type... but that's probably doable.
There's a second (smaller) problem as well: ColumnDef contains info that is maintained in ColumnMetadata and some info from TableMetadata. So callers interested in that functionality may need to be changed to get a column and then get the corresponding table. Unfortunately the Java driver doesn't allow that traversal directly; they'll have to get the table name and look it up as a separate op. That's entirely doable, but it does represent a bit of an increase in complexity.
Anyways, if you're okay with the approach suggested above I can try this again and see how feasible that is to implement.
I see. It's far more complicated than we all expected :) What would we gain with the current approach and the described one? I initially thought that we are going for less code to maintain but now I'm not sure if this whole thing is worth the effort. wdyt?
Yeah, it's definitely more complicated than we expected. :)
It seems to me that the following things are true:
- We won't be able to remove Column/TableDef from the current code base
- We need these classes (or something very much like them) to describe structures to be created by ColumnMapper
- There is value in decoupling this function of describing structures to be created from analyzing keyspace/table/column metadata
- This certainly doesn't require parallel trait impls (like what I've used in this PR)
- In other contexts I've argued for removing intermediaries and exposing the Java driver API more directly... and I see no reason to argue for something different here
- A further point: IIRC Russ dealt directly with Java driver metadata already in his DSv2 impl
- There are two potential issues that occur to me when considering such an impl
- There are occasions when we do a TableDef.copy() based on retrieved metadata, so in the new world we'd have to support the ability to create a TableDef from Java driver metadata (discussed already on this PR)
- There are a few spots in code where metadata is expected to conform to FieldDef/StructDef traits so we might have to wrap Java driver metadata in these cases
Given all of these factors I think there's still a decent argument for trying the following:
- Changing *FromSchema() methods to return Java driver metadata types
- Preserving Table/ColumnDef for use as a means to define table creation in ColumnMapper
Doing so gives us the decoupling between table creation and metadata and moves us further along in the goal of interacting with the Java driver types more directly in the code base.
wdyt @jtgrabowski ?
Let's give it a try if you think it's feasible. I traced couple of *FromCassandra invocations and it looks like it would require a bit of work to convert them.