neural-search icon indicating copy to clipboard operation
neural-search copied to clipboard

[BUG] _bulk update request failing when using text chunking processor pipeline

Open janmederly opened this issue 8 months ago • 15 comments

Describe the bug

When performing _bulk update request while using text chunking processor I am getting {"took":0,"ingest_took":1,"errors":true,"items":[{"index":{"_index":null,"_id":null,"status":500,"error":{"type":"null_pointer_exception","reason":"Cannot invoke \"Object.toString()\" because the return value of \"java.util.Map.get(Object)\" is null"}}}]}. There is no error when I am not using text chunking processor or when I am using regural update API.

Example request:

curl -H "Content-Type: application/json" -X POST "https://localhost:9200/_bulk" -u "admin:xxxxx" --insecure -d ' { "update": { "_id": "test", "_index": "docs-chunks"} } {"doc": {"text": "testing testing"}, "doc_as_upsert": true} ' Example response:

{"took":0,"ingest_took":1,"errors":true,"items":[{"index":{"_index":null,"_id":null,"status":500,"error":{"type":"null_pointer_exception","reason":"Cannot invoke \"Object.toString()\" because the return value of \"java.util.Map.get(Object)\" is null"}}}]}

Related component

Indexing

To Reproduce

  1. Deploy text model
  2. Create text chunking pipeline
  3. Create index with the text chunking pipeline as default pipeline
  4. Try to post bulk update request
  5. Error should appear

Expected behavior

Sucessfully update opensearch ducuments.

Additional Details

Plugins [opensearch@opensearch-cluster-master-0 ~]$ bin/opensearch-plugin list opensearch-alerting opensearch-anomaly-detection opensearch-asynchronous-search opensearch-cross-cluster-replication opensearch-custom-codecs opensearch-flow-framework opensearch-geospatial opensearch-index-management opensearch-job-scheduler opensearch-knn opensearch-ml opensearch-neural-search opensearch-notifications opensearch-notifications-core opensearch-observability opensearch-performance-analyzer opensearch-reports-scheduler opensearch-security opensearch-security-analytics opensearch-skills opensearch-sql

Host/Environment (please complete the following information):

  • OS: Amazon Linux
  • Version: 2023
  • Helm environment v2.20.0, running on k8s cluster v1.28.3
  • Opensearch v2.14.0

Additional context ML model used: https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1 Text chunking pipeline:

{ "description": "A text chunking and embedding ingest pipeline", "processors": [ { "text_chunking": { "algorithm": { "fixed_token_length": { "token_limit": 350, "overlap_rate": 0.2, "tokenizer": "standard" } }, "field_map": { "text": "passage_chunk" } } }, { "text_embedding": { "model_id": "ueVVfo4Bvd-X9jaivNwl", "field_map": { "passage_chunk": "passage_embedding" } } } ] }

Index settings and mappings:

{ "settings": { "index": { "number_of_shards": 2, "number_of_replicas": 2, "knn": true, "default_pipeline": "text-chunking-embedding-ingest-pipeline", "analyze": { "max_token_count": 1000000 } } }, "mappings": { "properties": { "text": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "passage_embedding": { "type": "nested", "properties": { "knn": { "type": "knn_vector", "dimension": 384 } } } } } }

janmederly avatar Jun 19 '24 13:06 janmederly