connectors icon indicating copy to clipboard operation
connectors copied to clipboard

ensure_content_index_mappings doesn't ensure mappings

Open sakurai-youhei opened this issue 1 year ago • 7 comments

Bug Description

ensure_content_index_mappings does nothing actually because connector setup sets some mappings to the index on creation. This issue makes id and other fields (code) rely on dynamic mapping, which may cause mapping problems.

For example, network_drive.py will probably type id as (signed) long, but it can index documents with unsigned long values, which will end up with document_parsing_exception in Elasticsearch.

[FMWK][04:24:31][ERROR] 
  [Connector id: XH_9rY8BHCAv43wWFIhC, index name: test, Sync job id: C63-rY8B_7UHF08UH-5L] 
  operation index failed, {'type': 'document_parsing_exception', 'reason': "[1:108] failed to parse field [id] of type [long] in document with id '12368218145543858784'. 
  Preview of field's value: '12368218145543858784'", 'caused_by': {'type': 'x_content_parse_exception', 'reason': '[1:128] Numeric value (12368218145543858784) out of range of long (-9223372036854775808 - 9223372036854775807)\n at ...

Reproducer

  1. Spin up Elastic stack 8.13.4.
  2. Go to Kibana > Search > Content > Connectors.
  3. Create a new connector with Network drive.
  4. Click Create and attach an index named XXX.
  5. Click Convert connector on the right.
  6. Click Generate API key.
  7. Click Edit configuration, fill in the settings, and click Save configuration.
  8. Run elastic-ingest -c /path/to/config.yaml --debug with the config given on the UI.
  9. Click Sync > Full Content.
  • elastic-ingest will output Index test-ensure-content-index-mappings already has mappings, skipping mappings creation.
  • elastic-ingest will highly probably also output document_parsing_exception like the one above.

Mappings before the sync

{
  "mappings": {
    "dynamic": "true",
    "dynamic_templates": [
      {
        "all_text_fields": {
          "match_mapping_type": "string",
          "mapping": {
            "analyzer": "iq_text_base",
            "fields": {
              "delimiter": {
                "analyzer": "iq_text_delimiter",
                "type": "text",
                "index_options": "freqs"
              },
              "joined": {
                "search_analyzer": "q_text_bigram",
                "analyzer": "i_text_bigram",
                "type": "text",
                "index_options": "freqs"
              },
              "prefix": {
                "search_analyzer": "q_prefix",
                "analyzer": "i_prefix",
                "type": "text",
                "index_options": "docs"
              },
              "enum": {
                "ignore_above": 2048,
                "type": "keyword"
              },
              "stem": {
                "analyzer": "iq_text_stem",
                "type": "text"
              }
            }
          }
        }
      }
    ]
  }
}

Mappings after the sync

{
  "mappings": {
    "dynamic": "true",
    "dynamic_templates": [
      {
        "all_text_fields": {
          "match_mapping_type": "string",
          "mapping": {
            "analyzer": "iq_text_base",
            "fields": {
              "delimiter": {
                "analyzer": "iq_text_delimiter",
                "type": "text",
                "index_options": "freqs"
              },
              "joined": {
                "search_analyzer": "q_text_bigram",
                "analyzer": "i_text_bigram",
                "type": "text",
                "index_options": "freqs"
              },
              "prefix": {
                "search_analyzer": "q_prefix",
                "analyzer": "i_prefix",
                "type": "text",
                "index_options": "docs"
              },
              "enum": {
                "ignore_above": 2048,
                "type": "keyword"
              },
              "stem": {
                "analyzer": "iq_text_stem",
                "type": "text"
              }
            }
          }
        }
      }
    ],
    "properties": {
      "_timestamp": {
        "type": "date"
      },
      "created_at": {
        "type": "date"
      },
      "id": {
        "type": "long"
      },
      "path": {
        "type": "text",
        "fields": {
          "delimiter": {
            "type": "text",
            "index_options": "freqs",
            "analyzer": "iq_text_delimiter"
          },
          "enum": {
            "type": "keyword",
            "ignore_above": 2048
          },
          "joined": {
            "type": "text",
            "index_options": "freqs",
            "analyzer": "i_text_bigram",
            "search_analyzer": "q_text_bigram"
          },
          "prefix": {
            "type": "text",
            "index_options": "docs",
            "analyzer": "i_prefix",
            "search_analyzer": "q_prefix"
          },
          "stem": {
            "type": "text",
            "analyzer": "iq_text_stem"
          }
        },
        "analyzer": "iq_text_base"
      },
      "size": {
        "type": "long"
      },
      "title": {
        "type": "text",
        "fields": {
          "delimiter": {
            "type": "text",
            "index_options": "freqs",
            "analyzer": "iq_text_delimiter"
          },
          "enum": {
            "type": "keyword",
            "ignore_above": 2048
          },
          "joined": {
            "type": "text",
            "index_options": "freqs",
            "analyzer": "i_text_bigram",
            "search_analyzer": "q_text_bigram"
          },
          "prefix": {
            "type": "text",
            "index_options": "docs",
            "analyzer": "i_prefix",
            "search_analyzer": "q_prefix"
          },
          "stem": {
            "type": "text",
            "analyzer": "iq_text_stem"
          }
        },
        "analyzer": "iq_text_base"
      },
      "type": {
        "type": "text",
        "fields": {
          "delimiter": {
            "type": "text",
            "index_options": "freqs",
            "analyzer": "iq_text_delimiter"
          },
          "enum": {
            "type": "keyword",
            "ignore_above": 2048
          },
          "joined": {
            "type": "text",
            "index_options": "freqs",
            "analyzer": "i_text_bigram",
            "search_analyzer": "q_text_bigram"
          },
          "prefix": {
            "type": "text",
            "index_options": "docs",
            "analyzer": "i_prefix",
            "search_analyzer": "q_prefix"
          },
          "stem": {
            "type": "text",
            "analyzer": "iq_text_stem"
          }
        },
        "analyzer": "iq_text_base"
      }
    }
  }
}

Expected behavior

ensure_content_index_mappings should add missing mappings.

Workaround

Recreate the index by not using the Connectors Configuration page, and re-run sync.

Environment

  • OS: Elastic Cloud + Windows 11 10.0.22621 N/A Build 22621
  • Browser: Chrome Version 125.0.6422.77 (Official Build) (64-bit)
  • Version: docker.elastic.co/enterprise-search/elastic-connectors:8.13.4.0

Additional context

I will open a PR to fix address this issue.

sakurai-youhei avatar May 25 '24 09:05 sakurai-youhei