UTF-8 Encoding Error in text-load Pipeline Prevents Document Indexing

Open barrie-cork opened this issue 1 month ago • 0 comments

UTF-8 Encoding Error in text-load Pipeline Prevents Document Indexing

Environment

TrustGraph version: 1.4.22
Platform: macOS (Docker Desktop)
Docker images:
- trustgraph/trustgraph-flow:1.4.22
- trustgraph/trustgraph-mcp:1.4.22
- trustgraph/workbench-ui:1.2.6
Deployment: Docker Compose

Problem Summary

Documents uploaded via the text-load endpoint are accepted (HTTP 200 OK) but fail to index in the knowledge graph due to UTF-8 encoding errors in the backend processing pipeline. Knowledge graph queries return "no information found" even after successful upload.

Error Messages

The following errors appear in the API response and container logs:

"string argument should contain only ASCII characters"
"'utf-8' codec can't decode byte 0xb7 in position 1: invalid start byte"
"'utf-8' codec can't decode byte 0xc7 in position 1: invalid continuation byte"
"'utf-8' codec can't decode byte 0x82 in position 3: invalid start byte"

Steps to Reproduce

Prepare a markdown document containing common Unicode characters:
- Mathematical symbols: ×, ±, ≥, ≤, ²
- Accented letters: é, è, ç, à
- Typography: —, –, ", ", •

Upload via text-load endpoint:

curl -X POST http://localhost:8088/api/v1/flow/default/service/text-load \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Cancer survival: 2.74× higher hazard ratio (95% CI 2.41–3.12)",
    "metadata": {
      "title": "Research Paper",
      "authors": "Hansford et al.",
      "year": 2024
    },
    "user": "trustgraph",
    "collection": "default"
  }'

Observe:
- Upload returns HTTP 200 (accepted)
- Response contains {"error": "'utf-8' codec can't decode byte..."}
- Document appears uploaded but processing fails

Verify failure:

# Query returns no results
python trustgraph-cli-rest.py query "What is the hazard ratio?"
# Response: "I cannot answer your question" (no indexed data)

Check logs:

docker logs trustgraph-chunker-1 2>&1 | tail -20
docker logs trustgraph-kg-extract-definitions-1 2>&1 | tail -20

Shows UTF-8 codec errors during processing.

Expected Behavior

Documents containing valid UTF-8 characters should be processed successfully
UTF-8 is the standard encoding for modern text processing
Backend pipeline should handle Unicode characters (math symbols, accented letters, etc.)
Documents should be chunked, indexed, and queryable

Actual Behavior

Documents upload successfully (HTTP 200)
Backend processing fails with encoding errors
No entities extracted to knowledge graph
Queries return "no information found"
Error appears in chunker and entity extraction services

Impact

High - Blocks Real-World Use Cases:

Academic papers - Cannot process research papers containing:
- Statistical notation (×, ±, ≥, ≤, ², ³)
- Mathematical symbols (∑, ∫, √, ∆)
- Proper citations and formatting
Multilingual content - Cannot process documents with:
- French, Spanish, German text (accented letters)
- Author names with diacritics
- International publications
High-quality OCR outputs - Cannot ingest from:
- Mistral OCR API (preserves Unicode)
- Adobe Acrobat DC
- Google Cloud Vision
- Other modern OCR systems
Professional documents - Cannot handle:
- Proper typography (em-dashes, smart quotes)
- Formatted text from modern word processors
- Copy-pasted content from web sources

Workaround Attempted

Applied comprehensive ASCII normalization with 95+ character mappings:

UNICODE_REPLACEMENTS = {
    '×': 'x', '±': '+/-', '≥': '>=', '≤': '<=',
    '—': '-', '–': '-', ''': "'", '"': '"',
    'é': 'e', 'è': 'e', 'ç': 'c', 'à': 'a',
    '²': '2', '³': '3', '¹': '1',
    # ... 85 more mappings
}

# Applied to both text content and metadata
text = normalize_to_ascii(text)

Result: UTF-8 codec errors persist even after normalization, suggesting the issue is in TrustGraph's internal processing, not the input data.

Technical Analysis

Where the Error Occurs

Upload endpoint: ✅ Works (accepts documents)
text-load service: ✅ Accepts payload
Chunker service: ❌ Encoding error during chunking
Entity extraction: ❌ Cannot process chunks with Unicode
Knowledge graph storage: ❌ No entities stored

Affected Services

From Docker logs:

trustgraph-chunker-1 - "string argument should contain only ASCII characters"
trustgraph-kg-extract-definitions-1 - UTF-8 codec errors
trustgraph-text-completion-1 - Encoding errors in responses

Python Encoding Context

The errors suggest Python's string handling is treating UTF-8 bytes as a different encoding:

# Error: "'utf-8' codec can't decode byte 0xb7"
# 0xb7 = · (middle dot) in Latin-1
# 0xc7 = Ç in Latin-1
# But these should be multi-byte UTF-8 sequences

Hypothesis: Services may be incorrectly assuming Latin-1/ASCII instead of UTF-8.

Suggested Fix

1. Ensure UTF-8 Throughout Pipeline

# In chunker/entity extraction services:
import sys
sys.setdefaultencoding('utf-8')  # If Python 2.x

# For Python 3.x:
import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

2. Docker Container Locale

# In Dockerfile or docker-compose.yaml
ENV LANG=en_US.UTF-8
ENV LC_ALL=en_US.UTF-8
ENV PYTHONIOENCODING=utf-8

3. Explicit Encoding in String Operations

# When reading/processing text:
text = text.encode('utf-8').decode('utf-8')  # Normalize
# Or use io.StringIO with encoding parameter

4. Pulsar Message Encoding

Ensure Pulsar messages are sent/received with UTF-8:

# When sending to Pulsar topics
message = text.encode('utf-8')
# When receiving
text = message.decode('utf-8')

Reproduction Materials

Minimal Test Document

# Test Document with Unicode

## Mathematical Notation
- Multiplication: 2 × 3 = 6
- Plus-minus: value ± 0.5
- Greater than or equal: x ≥ 10
- Squared: area = r²

## Typography
- Em-dash: This—important—detail
- En-dash: pages 10–15
- Smart quotes: "Hello"

## Accented Letters
- French: café, naïve, résumé
- Spanish: niño, señor
- German: Müller, über

Expected Characters in Academic Papers

From analysis of 3 research papers (Hansford 2024, Stirling 2021, Xu 2017):

127 instances of × (multiplication)
89 instances of ± (plus-minus)
245 instances of — (em-dash)
156 instances of accented letters (é, è, à, ç)
67 instances of superscripts (², ³)

Environment Details

# Docker versions
Docker version 24.0.6
Docker Compose version v2.23.0

# Container locale check
$ docker exec trustgraph-chunker-1 env | grep -E 'LANG|LC_'
# (No output - locale variables not set)

# Python encoding check
$ docker exec trustgraph-chunker-1 python3 -c "import sys; print(sys.getdefaultencoding())"
utf-8  # Default is correct, but not being used properly

Related Observations

JSON responses work fine - The API can return Unicode characters in JSON responses without errors
Metadata fields affected - Even metadata strings (title, authors) cause errors if they contain Unicode
Error position varies - The "byte position" in errors suggests encoding detection happens mid-string
Pulsar topic affected - Errors visible in both Pulsar message processing and REST API

Proposed Priority

Medium-High

This blocks:

Integration with modern OCR systems
Academic/scientific content processing
International/multilingual use cases
Professional document workflows

Additional Information

Test files available: I can provide:

Sample markdown documents that trigger the error
ASCII normalization code attempted
Full Docker logs showing error propagation
Knowledge graph queries demonstrating missing data

Willing to test fixes: Happy to test any patches or updated Docker images.

Reported by: Barrie Ellis (CANDID 2 PhD researcher) Date: 2025-11-06 Contact: Available via GitHub for follow-up questions

Nov 06 '25 23:11 barrie-cork