trustgraph icon indicating copy to clipboard operation
trustgraph copied to clipboard

UTF-8 Encoding Error in text-load Pipeline Prevents Document Indexing

Open barrie-cork opened this issue 1 month ago • 0 comments

UTF-8 Encoding Error in text-load Pipeline Prevents Document Indexing

Environment

  • TrustGraph version: 1.4.22
  • Platform: macOS (Docker Desktop)
  • Docker images:
    • trustgraph/trustgraph-flow:1.4.22
    • trustgraph/trustgraph-mcp:1.4.22
    • trustgraph/workbench-ui:1.2.6
  • Deployment: Docker Compose

Problem Summary

Documents uploaded via the text-load endpoint are accepted (HTTP 200 OK) but fail to index in the knowledge graph due to UTF-8 encoding errors in the backend processing pipeline. Knowledge graph queries return "no information found" even after successful upload.

Error Messages

The following errors appear in the API response and container logs:

"string argument should contain only ASCII characters"
"'utf-8' codec can't decode byte 0xb7 in position 1: invalid start byte"
"'utf-8' codec can't decode byte 0xc7 in position 1: invalid continuation byte"
"'utf-8' codec can't decode byte 0x82 in position 3: invalid start byte"

Steps to Reproduce

  1. Prepare a markdown document containing common Unicode characters:

    • Mathematical symbols: ×, ±, ≥, ≤, ²
    • Accented letters: é, è, ç, à
    • Typography: —, –, ", ", •
  2. Upload via text-load endpoint:

    curl -X POST http://localhost:8088/api/v1/flow/default/service/text-load \
      -H "Content-Type: application/json" \
      -d '{
        "text": "Cancer survival: 2.74× higher hazard ratio (95% CI 2.41–3.12)",
        "metadata": {
          "title": "Research Paper",
          "authors": "Hansford et al.",
          "year": 2024
        },
        "user": "trustgraph",
        "collection": "default"
      }'
    
  3. Observe:

    • Upload returns HTTP 200 (accepted)
    • Response contains {"error": "'utf-8' codec can't decode byte..."}
    • Document appears uploaded but processing fails
  4. Verify failure:

    # Query returns no results
    python trustgraph-cli-rest.py query "What is the hazard ratio?"
    # Response: "I cannot answer your question" (no indexed data)
    
  5. Check logs:

    docker logs trustgraph-chunker-1 2>&1 | tail -20
    docker logs trustgraph-kg-extract-definitions-1 2>&1 | tail -20
    

    Shows UTF-8 codec errors during processing.

Expected Behavior

  • Documents containing valid UTF-8 characters should be processed successfully
  • UTF-8 is the standard encoding for modern text processing
  • Backend pipeline should handle Unicode characters (math symbols, accented letters, etc.)
  • Documents should be chunked, indexed, and queryable

Actual Behavior

  • Documents upload successfully (HTTP 200)
  • Backend processing fails with encoding errors
  • No entities extracted to knowledge graph
  • Queries return "no information found"
  • Error appears in chunker and entity extraction services

Impact

High - Blocks Real-World Use Cases:

  1. Academic papers - Cannot process research papers containing:

    • Statistical notation (×, ±, ≥, ≤, ², ³)
    • Mathematical symbols (∑, ∫, √, ∆)
    • Proper citations and formatting
  2. Multilingual content - Cannot process documents with:

    • French, Spanish, German text (accented letters)
    • Author names with diacritics
    • International publications
  3. High-quality OCR outputs - Cannot ingest from:

    • Mistral OCR API (preserves Unicode)
    • Adobe Acrobat DC
    • Google Cloud Vision
    • Other modern OCR systems
  4. Professional documents - Cannot handle:

    • Proper typography (em-dashes, smart quotes)
    • Formatted text from modern word processors
    • Copy-pasted content from web sources

Workaround Attempted

Applied comprehensive ASCII normalization with 95+ character mappings:

UNICODE_REPLACEMENTS = {
    '×': 'x', '±': '+/-', '≥': '>=', '≤': '<=',
    '—': '-', '–': '-', ''': "'", '"': '"',
    'é': 'e', 'è': 'e', 'ç': 'c', 'à': 'a',
    '²': '2', '³': '3', '¹': '1',
    # ... 85 more mappings
}

# Applied to both text content and metadata
text = normalize_to_ascii(text)

Result: UTF-8 codec errors persist even after normalization, suggesting the issue is in TrustGraph's internal processing, not the input data.

Technical Analysis

Where the Error Occurs

  1. Upload endpoint: ✅ Works (accepts documents)
  2. text-load service: ✅ Accepts payload
  3. Chunker service: ❌ Encoding error during chunking
  4. Entity extraction: ❌ Cannot process chunks with Unicode
  5. Knowledge graph storage: ❌ No entities stored

Affected Services

From Docker logs:

  • trustgraph-chunker-1 - "string argument should contain only ASCII characters"
  • trustgraph-kg-extract-definitions-1 - UTF-8 codec errors
  • trustgraph-text-completion-1 - Encoding errors in responses

Python Encoding Context

The errors suggest Python's string handling is treating UTF-8 bytes as a different encoding:

# Error: "'utf-8' codec can't decode byte 0xb7"
# 0xb7 = · (middle dot) in Latin-1
# 0xc7 = Ç in Latin-1
# But these should be multi-byte UTF-8 sequences

Hypothesis: Services may be incorrectly assuming Latin-1/ASCII instead of UTF-8.

Suggested Fix

1. Ensure UTF-8 Throughout Pipeline

# In chunker/entity extraction services:
import sys
sys.setdefaultencoding('utf-8')  # If Python 2.x

# For Python 3.x:
import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

2. Docker Container Locale

# In Dockerfile or docker-compose.yaml
ENV LANG=en_US.UTF-8
ENV LC_ALL=en_US.UTF-8
ENV PYTHONIOENCODING=utf-8

3. Explicit Encoding in String Operations

# When reading/processing text:
text = text.encode('utf-8').decode('utf-8')  # Normalize
# Or use io.StringIO with encoding parameter

4. Pulsar Message Encoding

Ensure Pulsar messages are sent/received with UTF-8:

# When sending to Pulsar topics
message = text.encode('utf-8')
# When receiving
text = message.decode('utf-8')

Reproduction Materials

Minimal Test Document

# Test Document with Unicode

## Mathematical Notation
- Multiplication: 2 × 3 = 6
- Plus-minus: value ± 0.5
- Greater than or equal: x ≥ 10
- Squared: area = r²

## Typography
- Em-dash: This—important—detail
- En-dash: pages 10–15
- Smart quotes: "Hello"

## Accented Letters
- French: café, naïve, résumé
- Spanish: niño, señor
- German: Müller, über

Expected Characters in Academic Papers

From analysis of 3 research papers (Hansford 2024, Stirling 2021, Xu 2017):

  • 127 instances of × (multiplication)
  • 89 instances of ± (plus-minus)
  • 245 instances of — (em-dash)
  • 156 instances of accented letters (é, è, à, ç)
  • 67 instances of superscripts (², ³)

Environment Details

# Docker versions
Docker version 24.0.6
Docker Compose version v2.23.0

# Container locale check
$ docker exec trustgraph-chunker-1 env | grep -E 'LANG|LC_'
# (No output - locale variables not set)

# Python encoding check
$ docker exec trustgraph-chunker-1 python3 -c "import sys; print(sys.getdefaultencoding())"
utf-8  # Default is correct, but not being used properly

Related Observations

  1. JSON responses work fine - The API can return Unicode characters in JSON responses without errors
  2. Metadata fields affected - Even metadata strings (title, authors) cause errors if they contain Unicode
  3. Error position varies - The "byte position" in errors suggests encoding detection happens mid-string
  4. Pulsar topic affected - Errors visible in both Pulsar message processing and REST API

Proposed Priority

Medium-High

This blocks:

  • Integration with modern OCR systems
  • Academic/scientific content processing
  • International/multilingual use cases
  • Professional document workflows

Additional Information

Test files available: I can provide:

  • Sample markdown documents that trigger the error
  • ASCII normalization code attempted
  • Full Docker logs showing error propagation
  • Knowledge graph queries demonstrating missing data

Willing to test fixes: Happy to test any patches or updated Docker images.


Reported by: Barrie Ellis (CANDID 2 PhD researcher) Date: 2025-11-06 Contact: Available via GitHub for follow-up questions

barrie-cork avatar Nov 06 '25 23:11 barrie-cork