graphql-ruby icon indicating copy to clipboard operation
graphql-ruby copied to clipboard

Escape sequences in string arguments - unicode and others

Open Eli-Siegel1 opened this issue 1 year ago • 2 comments

Describe the bug handling unicode has some bugs and weird behaviors:

  1. parsing unicode with double escape chars. this makes it impossible to pass strings with the pattern "\uXXXX"
  2. any escape sequence, not only unicodes: string within triple quotes are still parsed the same as single quotes (possible ruby related?)

Versions

graphql version: 2.0.19 rails (or other framework): 6.0.4.7 other applicable versions (graphql-batch, etc): none

GraphQL schema

in both the examples below i used the following simple resolver. note that the result is the rendered result to the client, so there are escape characters added. i also printed the string passed to the resolver char by char to make sure i'm not making mistakes:

class Resolvers::GetString < GraphQL::Schema::Resolver
  argument :string, String, "the name to return"
  type String, null: false

  def resolve(**arguments)
    puts "***"
    puts "string.length: #{arguments[:string].length}"
    puts "string value: \"#{arguments[:string]}\""
    # printing the chars one by one
    i = 0
    arguments[:string].each_char do |c|
      puts "string[#{i}] = \"#{c}\""
      i += 1
    end
    puts "***"
    arguments[:string]
  end
end

class Types::QueryType < Types::BaseObject
  field :get_string, resolver: Resolvers::GetString, description: "Just return the given string"
end


class ApplicationSchema < GraphQL::Schema
  query Types::QueryType
end

GraphQL query

Example GraphQL query and response (if query execution is involved)

example for bug 1:

query bug1 {
  example1: get_string(string: "\u0064") # correct: should return "d"
  example2: get_string(string: "\\u0064") # incorrect: should return "\\u0064"
  # example3: get_string(string: "\u006") # correct: validation error
  example4: get_string(string: "\\u006") # correct: should return "\\u006"
}

the log:

***
string.length: 1
string value: "d"
string[0] = "d"
***
***
string.length: 1
string value: "d"
string[0] = "d"
***
***
string.length: 5
string value: "\u006"
string[0] = "\"
string[1] = "u"
string[2] = "0"
string[3] = "0"
string[4] = "6"
***

the result:

{
  "data": {
    "example1": "d",
    "example2": "d",
    "example4": "\\u006"
  }
}

example for bug 2

query bug2 {
  # example1: get_string(string: """\a""") # incorrect (fails): should be "\\a"
  # example2: get_string(string: """\u006""") # incorrect (fails): should be "\\u006"
  example3: get_string(string: """\n""") # incorrect: should be "\\n"
  example4: get_string(string: """\u0064""") # incorrect: should be "\\u0064"
}

the log:

***
string.length: 1
string value: "
"
string[0] = "
"
***
***
string.length: 1
string value: "d"
string[0] = "d"
***

the result:

{
  "data": {
    "example3": "\n",
    "example4": "d"
  }
}

Steps to reproduce

Steps to reproduce the behavior

Expected behavior

parsing string arguments:

  1. "\\u0064" => "\u0064"
  2. """\u0064""" => "\u0064"

Actual behavior

parsing string arguments:

  1. "\\u0064" => "n"
  2. """\u0064""" => "n"
Click to view exception backtrace n/a

Additional context I saw an old version changing the behavior of \\u back and forth, stating it was reverted because of failing other parsers test suite: https://github.com/rmosolgo/graphql-ruby/pull/372 it mentions a test suite in graphql-js in this url (it moved, this is the updated link) note: in the tests, the string that are written are parsed by the language parser (js). so, for example, the string '\u{1234}' is actually passed to the parser as , while the string \\u{1234} is passed as the characters \, u, {, etc.

the tests there behave as i the expected behavior i described. examples:

  1. in this test the parser is given the characters: "unicode \u1234\u5678\u90AB\uCDEF", and it returns the characters: unicode ሴ噸邫췯
    expect(lexOne('"unicode \\u1234\\u5678\\u90AB\\uCDEF"')).to.contain({
      kind: TokenKind.STRING,
      start: 0,
      end: 34,
      value: 'unicode \u1234\u5678\u90AB\uCDEF',
    });

this works the same as ruby.

  1. in this test the parser is given the characters: "escaped \n\r\b\t\f" and returns escaped followed by newline, carriage return, etc.
    expect(lexOne('"escaped \\n\\r\\b\\t\\f"')).to.contain({
      kind: TokenKind.STRING,
      start: 0,
      end: 20,
      value: 'escaped \n\r\b\t\f',
    });

this works the same as ruby.

  1. in this test the parser is given the characters: """unescaped \n\r\b\t\f\u1234""" and returns the characters: unescaped \n\r\b\t\f\u1234
    expect(lexOne('"""unescaped \\n\\r\\b\\t\\f\\u1234"""')).to.contain({
      kind: TokenKind.BLOCK_STRING,
      start: 0,
      end: 32,
      line: 1,
      column: 1,
      value: 'unescaped \\n\\r\\b\\t\\f\\u1234',
    });

in ruby the chars are converted to special characters, same as 2 - which is a bug.

  1. to check the escape sequence of unicodes, i cloned the repo and added another local test. in the test i passed the characters: "unicode \\u{1234}\\u{5678}\\u{90AB}\\u{CDE" and the result is the characters: unicode \u{1234}\u{5678}\u{90AB}\u{CDE. the unicode is not parsed, and the escape of the \ before the u works.
    expect(
      lexOne('"escaped unicode \\\\u{1234}\\\\u{5678}\\\\u{90AB}\\\\u{CDE"'),
    ).to.contain({
      kind: TokenKind.STRING,
      start: 0,
      end: 52,
      value: 'escaped unicode \\u{1234}\\u{5678}\\u{90AB}\\u{CDE',
    });

in ruby the first 3 unicodes are parsed, and the last one is not. the escape of the \ before the u does not work.

Eli-Siegel1 avatar Dec 04 '23 21:12 Eli-Siegel1

Hey! Thanks for the detailed writeup. Yes, as you noticed, there have been many attempts to match the spec on this 😅

I've created a replication script for this bug:

Replicate unicode escape bugs

require "bundler/inline"

gemfile do
  gem "graphql", "2.1.7"
end

class MySchema < GraphQL::Schema
  class QueryType < GraphQL::Schema::Object
    field :get_string, String do
      argument :string, String
    end

    def get_string(string:)
      puts "***"
      puts "string.length: #{string.length}"
      puts "string value: \"#{string}\""
      string.each_char.each_with_index do |c, i|
        puts "string[#{i}] = \"#{c}\""
      end
      puts "***"
      string
    end
  end

  query(QueryType)
end

query_str = File.read("./example1.graphql")
puts query_str
pp MySchema.execute(query_str).to_h

query_str = File.read("./example2.graphql")
puts query_str
pp MySchema.execute(query_str).to_h
# example1.graphql 
{
  example1: getString(string: "\u0064") # correct: should return "d"
  example2: getString(string: "\\u0064") # incorrect: should return "\\u0064"
  # example3: getString(string: "\u006") # correct: validation error
  example4: getString(string: "\\u006") # correct: should return "\\u006"
}
# example2.graphql 
query bug2 {
  # example1: getString(string: """\a""") # incorrect (fails): should be "\\a"
  # example2: getString(string: """\u006""") # incorrect (fails): should be "\\u006"
  example3: getString(string: """\n""") # incorrect: should be "\\n"
  example4: getString(string: """\u0064""") # incorrect: should be "\\u0064"
}

It seems like it boils down to two things:

  • In block strings, GraphQL-Ruby is interpreting \... as escapes, but it shouldn't.
  • In single-quoted strings, GraphQL-Ruby is interpreting \\u just like \u, but it shouldn't -- it should be a literal \u instead.

I've got an open PR to rewrite the lexer (#4718) and I'll dive in on that when I'm done there!

rmosolgo avatar Dec 08 '23 16:12 rmosolgo

I've got a fix in the works over at #4824. The most interesting part of the bug was basically that these two .gsub! calls resulted in a double-replacing of escaped characters:

https://github.com/rmosolgo/graphql-ruby/blob/18f3deda9944febcbd97a84a4943362f68c8a373/lib/graphql/language/lexer.rb#L302-L303

They had to be merged into a single gsub! call.

rmosolgo avatar Feb 05 '24 23:02 rmosolgo