neural-search [FEATURE] Refactor on data validation and extraction from customer's documents in several processors

[FEATURE] Refactor on data validation and extraction from customer's documents in several processors

Open zane-neo opened this issue 10 months ago • 7 comments

Is your feature request related to a problem?

Background

Currently in neural search, we validate and extract data from user's index documents and send them to model for inference. We used a recursive approach to check the data structure based on user's configuration. For example, below is the user's configuration for text_embedding and the field_map is the important part which represents a mapping relation between the original key and the target key(embedding field key).

{
    "text_embedding": {
        "model_id": "WYjkv4MBHcWxVq8Jtc8U",
        "field_map": {
            "title": "title_knn",
            "favor_list": "favor_list_knn",
            "favorites": {
                "game": "game_knn",
                "movie": "movie_knn"
            }
        }
    }
}

Above configuration assume user has original document in below structure:

{
    "title": "content of title", // raw string
    "favor_list": ["content of each element", "content of each element", ...], // list of string
    "favorites": {
        "game": "game content", // map type with leaf string type
        "movie": "movie content"
    }
}

We support raw string, map type with leaf string type, and list of string, list of map with leaf string type.

Problem statement

Several processors are using the same configuration to validate and extract the field content: InferenceProcessor, TextImageEmbeddingProcessor and TextChunkingProcessor etc which causes duplicate code among these classes. And a more critical issue is when we need to implement new features for this validation and extraction, we need to duplicate that as well, e.g.: https://github.com/opensearch-project/neural-search/issues/110 this issue requested to add dot support for user's configuration, and we need to implement for this in multiple places which is a bad smell and it's time to refactor our code.

Proposal

The proposal is that we can extract these code in a common place and by designing them reasonably we can reduce the code duplication and all the enhancement goes a single place in future.

We should note that not every piece can be made common since different processor has different logic, e.g. TextImageEmbeddingProcessor and InferenceProcessor's buildxxxKeyAndOriginalValue, since they support different type, it's not easy to make this part reusable, so in this case we will have to make some abstract methods and different processor has their own different implementation. This case we need to add an abstract class and different processors need to extend it.

Some code are almost same, e.g. TextImageEmbeddingProcessor and InferenceProcessor and TextChunkProcessor's validateEmbeddingFieldsValue/validateFieldsValue method, they're almost exactly same with only minor differences, in this case we can extract the code to a common class a common method with adding slightly change and with combination approach we can reduce the code duplication.

What solution would you like?

A clear and concise description of what you want to happen.

What alternatives have you considered?

A clear and concise description of any alternative solutions or features you've considered.

Do you have any additional context?

Add any other context or screenshots about the feature request here.

Apr 02 '24 02:04 zane-neo

neural-search neural-search copied to clipboard

[FEATURE] Refactor on data validation and extraction from customer's documents in several processors

Is your feature request related to a problem?

Background

Problem statement

Proposal

What solution would you like?

What alternatives have you considered?

Do you have any additional context?

neural-search
neural-search copied to clipboard