prism icon indicating copy to clipboard operation
prism copied to clipboard

Infinite loop/memory exhaustion parsing a rexexp named capture with a malformed unicode escape

Open stevenjohnstone opened this issue 1 month ago • 5 comments

Example input foo.txt:

/(?<\u{21!3}>foo)/ =~ "foo"

causes

./bin/parse foo.txt

to get stuck in pm_named_capture_escape_unicode using CPU and allocating memory on each loop.

Similar results for entering the code into irb etc.

I've patched locally with

diff --git a/src/prism.c b/src/prism.c
--- a/src/prism.c
+++ b/src/prism.c
@@ -21163,6 +21163,9 @@ pm_named_capture_escape_unicode(pm_parser_t *parser, pm_buffer_t *unescaped, con
         }

         size_t length = pm_strspn_hexadecimal_digit(cursor, end - cursor);
+        if (length == 0) {
+            break;
+        }
         uint32_t value = escape_unicode(parser, cursor, length);

and the immediate problem is fixed. Stopped short of a pull request because there may be a diagnostic error that's required in this case?

stevenjohnstone avatar Nov 18 '25 12:11 stevenjohnstone

Tested on 16e5efbd07e2acc8d00c43e82f0c090873d9d7d6 btw

stevenjohnstone avatar Nov 18 '25 12:11 stevenjohnstone

Thanks for the report! It seems like this should emit PM_ERR_ESCAPE_INVALID_UNICODE_LIST. You can check with comparing against the old parser:

$ ruby --parser=parse.y -c test.rb
ruby: test.rb:1: invalid Unicode list: /(?<\u{21!3}>foo)/ (SyntaxError)

Do you want to try a PR for this?

Earlopain avatar Nov 18 '25 13:11 Earlopain

Do you want to try a PR for this?

@Earlopain I gave it a spin with

diff --git a/src/prism.c b/src/prism.c
index a87932f1..0f7da79e 100644
--- a/src/prism.c
+++ b/src/prism.c
@@ -21153,6 +21153,7 @@ pm_named_capture_escape_unicode(pm_parser_t *parser, pm_buffer_t *unescaped, con
     }

     cursor++;
+    const uint8_t *value_start = cursor;
     for (;;) {
         while (cursor < end && *cursor == ' ') cursor++;

@@ -21163,6 +21164,12 @@ pm_named_capture_escape_unicode(pm_parser_t *parser, pm_buffer_t *unescaped, con
         }

         size_t length = pm_strspn_hexadecimal_digit(cursor, end - cursor);
+        if (length == 0) {
+            cursor++;
+            fprintf(stderr, "parser->start = %p, parser->end = %p, cursor = %p, value_start = %p\n", (void *)parser->start, (void *)parser->end, (void*)cursor, (void*)value_start);
+            PM_PARSER_ERR_FORMAT(parser, value_start, cursor, PM_ERR_ESCAPE_INVALID_UNICODE_LIST, (int) (cursor - value_start), value_start);
+            break;
+        }
         uint32_t value = escape_unicode(parser, cursor, length);

         (void) pm_buffer_append_unicode_codepoint(unescaped, value);
@@ -21877,6 +21884,7 @@ parse_expression_infix(pm_parser_t *parser, pm_node_t *node, pm_binding_power_t
                 // If we have a regular expression node, then we can just parse
                 // the named captures directly off the unescaped string.
                 const pm_string_t *content = &((pm_regular_expression_node_t *) node)->unescaped;
+                fprintf(stderr, "content->source = %p\n", (void *)content->source);
                 result = parse_regular_expression_named_captures(parser, content, call, PM_NODE_FLAG_P(node, PM_REGULAR_EXPRESSION_FLAGS_EXTENDED));

I see output like

content->source = 0xb0cf74010
parser->start = 0xb0cdbc6a0, parser->end = 0xb0cdbc6bc, cursor = 0xb0cf74019, value_start = 0xb0cf74016
Errors:
[#<Prism::ParseError @type=:escape_invalid_unicode_list @message="invalid Unicode list: 21!" @location=#<Prism::Location @start_offset=1800566 @length=3 start_line=2> @level=:syntax>]

AST:
@ ProgramNode (location: (1,0)-(1,26))
├── flags: ∅
├── locals: []
└── statements:
    @ StatementsNode (location: (1,0)-(1,26))
    ├── flags: ∅
    └── body: (length: 1)
        └── @ CallNode (location: (1,0)-(1,26))
            ├── flags: newline
            ├── receiver:
            │   @ RegularExpressionNode (location: (1,0)-(1,17))
            │   ├── flags: static_literal, forced_us_ascii_encoding
            │   ├── opening_loc: (1,0)-(1,1) = "/"
            │   ├── content_loc: (1,1)-(1,16) = "(?<\\u{21!3}>oo)"
            │   ├── closing_loc: (1,16)-(1,17) = "/"
            │   └── unescaped: "(?<\\u{21!3}>oo)"
            ├── call_operator_loc: ∅
            ├── name: :=~
            ├── message_loc: (1,18)-(1,20) = "=~"
            ├── opening_loc: ∅
            ├── arguments:
            │   @ ArgumentsNode (location: (1,21)-(1,26))
            │   ├── flags: ∅
            │   └── arguments: (length: 1)
            │       └── @ StringNode (location: (1,21)-(1,26))
            │           ├── flags: ∅
            │           ├── opening_loc: (1,21)-(1,22) = "\""
            │           ├── content_loc: (1,22)-(1,25) = "foo"
            │           ├── closing_loc: (1,25)-(1,26) = "\""
            │           └── unescaped: "foo"
            ├── closing_loc: ∅
            ├── equal_loc: ∅
            └── block: ∅

The location offsets are wrong because the cursor pointer doesn't point between parser.start and parser.end I think? Looks like it's pointing to a copy of the source?

Is there a way to relate cursor back to its location in the source?

stevenjohnstone avatar Nov 18 '25 14:11 stevenjohnstone

I wonder if that is the right place to apply the fix actually. Consider a file with just /(?<\u{21!x}>foo)/ as the content:

$ ruby -c test.rb
Syntax OK

$ ruby test.rb
test.rb: 
test.rb:1: invalid Unicode list: /(?<\u{21!x}>foo)/ (SyntaxError)

$ ruby --parser=parse.y -c test.rb
ruby: test.rb:1: invalid Unicode list: /(?<\u{21!x}>foo)/ (SyntaxError)

$ ruby --parser=parse.y test.rb
test.rb: 
test.rb:1: invalid Unicode list: /(?<\u{21!x}>foo)/ (SyntaxError)

Prism fails to reject this at parse time, only at runtime is it noticing that something is wrong. I would guess if that is made into a proper parse-time error, then this issue would be fixed (your defensive fix still makes sense to me though). The codepath in your patch is not taken for this code snippet.

And yeah, the regexp is not getting parsed by the main parser, so the cursor does not line up with the original source.

Earlopain avatar Nov 18 '25 15:11 Earlopain

Actually, I think your fix is acceptable without it becoming a parse error. Prism does not report any errors for invalid escape sequences in a regex.

So that is more of a different preexisting issue you don't have to bother with. Just no infinite loop for now seems fine to me. Sorry about the confusion.

Earlopain avatar Nov 18 '25 16:11 Earlopain