cocoindex icon indicating copy to clipboard operation
cocoindex copied to clipboard

[REFACTOR] type annotation -> CocoIndex type encoding logic in Python SDK should return strong-typed schema class

Open georgeh0 opened this issue 2 months ago • 9 comments

Some Background

Regarding data types / schemas, there're multiple forms:

  1. Python native type annotation, e.g. int, dict[str, Any], a specific data class. They're directly used in users code as type hints.
  2. AnalyzedTypeInfo: basically a more structured representation of 1. Used by our Python SDK internally only.
  3. Strong-typed schema representing in CocoIndex's type system, these classes, they mirror engine's data schema representation. They're exposed to some third party APIs, e.g. custom targets (custom target connectors can inspect schema of the data exporting to them), and also custom functions / sources in the future.
  4. Generic-typed JSON-equivalent values, in types such as dict[str, Any] (for JSON object), list[dict[str, Any]] (for JSON array), str (for JSON string), etc. They can be directly passed from/to engine in Rust.

Task

We have logic to convert Python's native type annotation to engine type. Currently we're doing 1->2->4 (code), because 3 was just introduced recently.

We want to:

  • Change the logic of 2->4 to 2->3, i.e. convert AnalyzedTypeInfo to strong-typed schema representation first. This will make our code easier to read and maintain (3 is easier to build than 4, and can leverage mypy type checks etc.)
  • After got 3, existing callers can simply call the encode() method to get 4. So we don't have to expose convenient methods to directly return 4 in the typing package.
  • Tests in test_typing.py should be updated accordingly, to check the output of 3 instead of 4 (3 is more structured than 4, and easier to check).

georgeh0 avatar Oct 01 '25 20:10 georgeh0

Can I work on this issue @georgeh0?

Shivansh-22866 avatar Oct 01 '25 20:10 Shivansh-22866

@Shivansh-22866 Sure! Assigned! Welcome to contribute!

georgeh0 avatar Oct 01 '25 20:10 georgeh0

@georgeh0 would this also be a step toward allowing "registering" custom datatypes? I have classes that are not fully under my control and not compatible with CocoIndex (e.g. Azure DocAI AnalyzeResult which is neither a DataClass nor a Pydantic model, or LlamaIndex ImageNode which is a Pydantic model but with a complex field that CocoIndex does not understand). Currently I'm converting them back and forth between a CocoIndex-compatible representation and the format that the tools need, but this is rather cumbersome and does not scale well. It would be nice if I could provide an instance of some class which gives CocoIndex the required field information so that it can use the custom types like the predefined ones.

nightscape avatar Oct 02 '25 19:10 nightscape

@georgeh0 would this also be a step toward allowing "registering" custom datatypes? I have classes that are not fully under my control and not compatible with CocoIndex (e.g. Azure DocAI AnalyzeResult which is neither a DataClass nor a Pydantic model, or LlamaIndex ImageNode which is a Pydantic model but with a complex field that CocoIndex does not understand). Currently I'm converting them back and forth between a CocoIndex-compatible representation and the format that the tools need, but this is rather cumbersome and does not scale well. It would be nice if I could provide an instance of some class which gives CocoIndex the required field information so that it can use the custom types like the predefined ones.

@nightscape Thanks a lot for sharing the pain points and thoughts! Yes, we're also aware of the limitations of the type bindings - mostly because we need our data to be serializable, so currently we use disciplined type systems and not all Python object can be directly used (we have similar constraints as a database).

This feature request doesn't help much for this (it allows type representation in CocoIndex engine more easily accessed by Python, but doesn't allow CocoIndex engine to handle more types with more flexibility).

I created a separate issue #1144 to address this problem. Will have more discussions over there.

georgeh0 avatar Oct 05 '25 06:10 georgeh0

unassigned the issue since there's no more activity

georgeh0 avatar Oct 17 '25 06:10 georgeh0

Can I work in this issue?

Aterg264 avatar Oct 21 '25 10:10 Aterg264

@Aterg264 thanks for taking this! Assigned to you.

georgeh0 avatar Oct 22 '25 02:10 georgeh0

I plan to solve this issue in five steps — one for each ValueType of the strong-type schema.

Aterg264 avatar Oct 29 '25 01:10 Aterg264

When I try to use BasicValueType, there are several errors when returning a BasicValueType instance in the _encode_type function.

To make it work, I temporarily solved it by calling .encode(), but I’m not sure if this is the intended solution.

Here’s my current implementation:

def _encode_type(type_info: AnalyzedTypeInfo) -> dict[str, Any]: variant = type_info.variant

if isinstance(variant, AnalyzedAnyType):
    raise ValueError("Specific type annotation is expected")

if isinstance(variant, AnalyzedUnknownType):
    raise ValueError(f"Unsupported type annotation: {type_info.core_type}")

if isinstance(variant, AnalyzedBasicType):
    return BasicValueType(kind=variant.kind).encode()

Could you please guide me on the correct approach for handling BasicValueType here? I want to make sure I’m aligning with the intended design of this function.

Aterg264 avatar Nov 03 '25 21:11 Aterg264