[REFACTOR] type annotation -> CocoIndex type encoding logic in Python SDK should return strong-typed schema class
Some Background
Regarding data types / schemas, there're multiple forms:
- Python native type annotation, e.g.
int,dict[str, Any], a specific data class. They're directly used in users code as type hints. AnalyzedTypeInfo: basically a more structured representation of 1. Used by our Python SDK internally only.- Strong-typed schema representing in CocoIndex's type system, these classes, they mirror engine's data schema representation. They're exposed to some third party APIs, e.g. custom targets (custom target connectors can inspect schema of the data exporting to them), and also custom functions / sources in the future.
- Generic-typed JSON-equivalent values, in types such as
dict[str, Any](for JSON object),list[dict[str, Any]](for JSON array),str(for JSON string), etc. They can be directly passed from/to engine in Rust.
Task
We have logic to convert Python's native type annotation to engine type. Currently we're doing 1->2->4 (code), because 3 was just introduced recently.
We want to:
- Change the logic of 2->4 to 2->3, i.e. convert
AnalyzedTypeInfoto strong-typed schema representation first. This will make our code easier to read and maintain (3 is easier to build than 4, and can leverage mypy type checks etc.) - After got 3, existing callers can simply call the
encode()method to get 4. So we don't have to expose convenient methods to directly return 4 in thetypingpackage. - Tests in
test_typing.pyshould be updated accordingly, to check the output of 3 instead of 4 (3 is more structured than 4, and easier to check).
Can I work on this issue @georgeh0?
@Shivansh-22866 Sure! Assigned! Welcome to contribute!
@georgeh0 would this also be a step toward allowing "registering" custom datatypes? I have classes that are not fully under my control and not compatible with CocoIndex (e.g. Azure DocAI AnalyzeResult which is neither a DataClass nor a Pydantic model, or LlamaIndex ImageNode which is a Pydantic model but with a complex field that CocoIndex does not understand). Currently I'm converting them back and forth between a CocoIndex-compatible representation and the format that the tools need, but this is rather cumbersome and does not scale well. It would be nice if I could provide an instance of some class which gives CocoIndex the required field information so that it can use the custom types like the predefined ones.
@georgeh0 would this also be a step toward allowing "registering" custom datatypes? I have classes that are not fully under my control and not compatible with CocoIndex (e.g. Azure DocAI AnalyzeResult which is neither a DataClass nor a Pydantic model, or LlamaIndex ImageNode which is a Pydantic model but with a complex field that CocoIndex does not understand). Currently I'm converting them back and forth between a CocoIndex-compatible representation and the format that the tools need, but this is rather cumbersome and does not scale well. It would be nice if I could provide an instance of some class which gives CocoIndex the required field information so that it can use the custom types like the predefined ones.
@nightscape Thanks a lot for sharing the pain points and thoughts! Yes, we're also aware of the limitations of the type bindings - mostly because we need our data to be serializable, so currently we use disciplined type systems and not all Python object can be directly used (we have similar constraints as a database).
This feature request doesn't help much for this (it allows type representation in CocoIndex engine more easily accessed by Python, but doesn't allow CocoIndex engine to handle more types with more flexibility).
I created a separate issue #1144 to address this problem. Will have more discussions over there.
unassigned the issue since there's no more activity
Can I work in this issue?
@Aterg264 thanks for taking this! Assigned to you.
I plan to solve this issue in five steps — one for each ValueType of the strong-type schema.
When I try to use BasicValueType, there are several errors when returning a BasicValueType instance in the _encode_type function.
To make it work, I temporarily solved it by calling .encode(), but I’m not sure if this is the intended solution.
Here’s my current implementation:
def _encode_type(type_info: AnalyzedTypeInfo) -> dict[str, Any]: variant = type_info.variant
if isinstance(variant, AnalyzedAnyType):
raise ValueError("Specific type annotation is expected")
if isinstance(variant, AnalyzedUnknownType):
raise ValueError(f"Unsupported type annotation: {type_info.core_type}")
if isinstance(variant, AnalyzedBasicType):
return BasicValueType(kind=variant.kind).encode()
Could you please guide me on the correct approach for handling BasicValueType here? I want to make sure I’m aligning with the intended design of this function.