liquid icon indicating copy to clipboard operation
liquid copied to clipboard

When the key name of the parameter is Chinese, the template cannot be resolved

Open StrangeYear opened this issue 3 years ago • 1 comments

Checklist

  • [x] I have searched the issue list
  • [ ] I have tested my example against Shopify Liquid. (This isn't necessary if the actual behavior is a panic, or an error for which IsTemplateError returns false.)

Expected Behavior

bindings := map[string]interface{}{ "app": "test_app", "描述": "content", } template := 应用:{{app}} 描述:{{ 描述 }} expected := 应用:test_app 描述:content

Actual Behavior

syntax error in "描述" in {{ 描述 }}

Detailed Description

Possible Solution

StrangeYear avatar Feb 17 '22 07:02 StrangeYear

I'm not even sure a key name can have a slash '/' in it..

SleepyBrett avatar May 03 '24 00:05 SleepyBrett

谢谢你的错误报告!我终于有时间来处理它了

Shopify Liquid does not support Chinese variable names either. Shopify/liquid#31 closes it as "won't fix". (It's actually got a richer history. It was fixed, and then the fix was reverted due to performance impacts.)

However, I'm not comfortable with that as a justification for this library not to support them. I'll look into what it would take to fix this.

The experience of Shopify/liquid#31 warns that this work should be accompanied by benchmarks.

osteele avatar Aug 30 '25 02:08 osteele

Here's an analysis and what it would take to fix this:

Current State

This Implementation

  • The Ragel lexer in expressions/scanner.rl uses ASCII-only character classes
  • Pattern: identifier = (alpha | '_') . (alnum | '_' | '-')* '?'?
  • alpha and alnum in Ragel only match ASCII characters

Shopify Liquid

  • Also doesn't officially support Unicode in variable names
  • Uses similar regex patterns without Unicode support: VariableSegment = /[\w\-]/
  • Community has requested this feature (Shopify/liquid#31) but it was reverted

What It Would Take to Fix

The Core Issue

The Ragel lexer uses ASCII-only character classes that need to be extended for Unicode support.

Implementation Approaches

Option 1: Ragel Unicode Support (Complex)

  1. Generate Unicode character classes using Ragel's unicode2ragel.rb script
  2. Create unicode.rl file with Unicode character class definitions (ualpha, ualnum)
  3. Modify scanner.rl to include and use Unicode classes:
    • Replace alpha with ualpha
    • Replace alnum with ualnum
  4. Regenerate scanner.go using go generate

Option 2: Post-Processing with Go's Unicode Support (Simpler)

  1. Keep Ragel for basic tokenization but allow wider character sets
  2. Add Unicode validation in the Go code after Ragel processing
  3. Use Go's unicode.IsLetter() and unicode.IsDigit() to validate identifiers
  4. Modify the identifier pattern to accept any non-ASCII bytes, then validate

Recommended Solution: Hybrid Approach

  1. Modify scanner.rl to accept broader character ranges:

    identifier = (alpha | '_' | 0x80..0xFF) . (alnum | '_' | '-' | 0x80..0xFF)*  '?'?
    

    This allows UTF-8 continuation bytes

  2. Add Go validation in the scanner's Identifier action:

    // Validate Unicode identifier using Go's unicode package
    if !isValidUnicodeIdentifier(lex.token()) {
        return error
    }
    
  3. Implement validation function:

    func isValidUnicodeIdentifier(s string) bool {
        runes := []rune(s)
        if len(runes) == 0 {
            return false
        }
        // First character must be letter or underscore
        if !unicode.IsLetter(runes[0]) && runes[0] != '_' {
            return false
        }
        // Rest can be letters, digits, underscore, or hyphen
        for _, r := range runes[1:] {
            if !unicode.IsLetter(r) && !unicode.IsDigit(r) && 
               r != '_' && r != '-' && r != '?' {
                return false
            }
        }
        return true
    }
    

Files to Modify

  1. expressions/scanner.rl - Update identifier pattern
  2. expressions/scanner.go - Will be regenerated
  3. Add Unicode validation logic
  4. Update tests to include Unicode test cases

Testing Requirements

  • Add test cases with Chinese, Japanese, Arabic, Cyrillic characters
  • Ensure backward compatibility with existing ASCII identifiers
  • Test edge cases like combining characters, emoji
  • Verify performance impact of Unicode validation

Considerations

  • Performance impact of Unicode validation
  • Backward compatibility
  • May need to handle normalization (NFC vs NFD)

osteele avatar Aug 30 '25 02:08 osteele