nokogiri icon indicating copy to clipboard operation
nokogiri copied to clipboard

`Nokogiri_error_raise` as a libxml2 handler considered harmful

Open npac opened this issue 7 years ago • 7 comments

Memory leak is observed while searching in a document with undefined namespace prefix.

# Nokogiri (1.7.0.1)
    ---
    warnings: []
    nokogiri: 1.7.0.1
    ruby:
      version: 2.3.1
      platform: x86_64-linux
      description: ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/opt/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/nokogiri-1.7.0.1/ports/x86_64-pc-linux-gnu/libxml2/2.9.4"
      libxslt_path: "/opt/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/nokogiri-1.7.0.1/ports/x86_64-pc-linux-gnu/libxslt/1.1.29"
      libxml2_patches: []
      libxslt_patches: []
      compiled: 2.9.4
      loaded: 2.9.4

Reproduced with:

require 'nokogiri'
while true do 
  begin
    #parse simple xml. no namespace definition
    doc = Nokogiri::XML('<ns1:Root></ns1:Root>')
    # below code raises exception and leaks memory  
    body = doc.search('//ns1:Root').first 
    puts body.to_xml 
  rescue => e
   puts e
  end 
end 

npac avatar Mar 01 '17 21:03 npac

Thanks for opening this issue! I'll take a look.

On Mar 1, 2017 4:12 PM, "pcapcanari" [email protected] wrote:

Memory leak is observed while searching in a document with undefined namespace prefix.

Nokogiri (1.7.0.1)

---
warnings: []
nokogiri: 1.7.0.1
ruby:
  version: 2.3.1
  platform: x86_64-linux
  description: ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-linux]
  engine: ruby
libxml:
  binding: extension
  source: packaged
  libxml2_path: "/opt/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/nokogiri-1.7.0.1/ports/x86_64-pc-linux-gnu/libxml2/2.9.4"
  libxslt_path: "/opt/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/nokogiri-1.7.0.1/ports/x86_64-pc-linux-gnu/libxslt/1.1.29"
  libxml2_patches: []
  libxslt_patches: []
  compiled: 2.9.4
  loaded: 2.9.4

Reproduced with:

require 'nokogiri'while true do begin #parse simple xml. no namespace definition doc = Nokogiri::XML('ns1:Root</ns1:Root>') # below code raises exception and leaks memory body = doc.search('//ns1:Root').first puts body.to_xml rescue => e puts e end end

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sparklemotion/nokogiri/issues/1610, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAgD2jPxXM6kfIeaRJbE38Ji1lrOTwJks5rhd9SgaJpZM4MQNWE .

flavorjones avatar Mar 02 '17 00:03 flavorjones

I've reproduced this behavior. Digging in.

flavorjones avatar Mar 06 '17 00:03 flavorjones

OK, this leak appears to come from using Nokogiri_error_raise to raise an exception during evaluation of the xpath query, which under the hood does a C longjmp and prevents libxml2 from cleaning up memory that had been allocated before the error was encountered.

Nokogiri_error_raise is used in a handful of places, and so now I'm worried that we may be leaking memory under other circumstances as well. I think I'd like to take a look at overhauling our current error-handling pattern to avoid raising exceptions while libxml2 is in the middle of something.

flavorjones avatar Mar 06 '17 00:03 flavorjones

I've audited the C code for non-trivial raises, use of handlers, and general patterns of error handling:

places where Nokogiri_error_raise is used

ext/nokogiri/xml_sax_push_parser.c

38-  if (xmlParseChunk(ctx, chunk, size, Qtrue == _last_chunk ? 1 : 0)) {
39-    if (!(ctx->options & XML_PARSE_RECOVER)) {
40-      xmlErrorPtr e = xmlCtxtGetLastError(ctx);
41:      Nokogiri_error_raise(NULL, e);
42-    }
43-  }
44-

characterization:

  • call: xmlParseChunk
  • handlers: none
  • check: function return value
  • raise: xmlErrorPtr from xmlCtxtGetLastError
  • errors: none

ext/nokogiri/xml_relax_ng.c

81-  if(NULL == schema) {
82-    xmlErrorPtr error = xmlGetLastError();
83-    if(error)
84:      Nokogiri_error_raise(NULL, error);
85-    else
86-      rb_raise(rb_eRuntimeError, "Could not parse document");
87-

characterization:

  • call: xmlRelaxNGParse
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
    • xmlRelaxNGSetParserStructuredErrors: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: on document
133-  if(NULL == schema) {
134-    xmlErrorPtr error = xmlGetLastError();
135-    if(error)
136:      Nokogiri_error_raise(NULL, error);
137-    else
138-      rb_raise(rb_eRuntimeError, "Could not parse document");
139-

characterization:

  • call: xmlRelaxNGParse
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
    • xmlRelaxNGSetParserStructuredErrors: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: on document

ext/nokogiri/html_sax_push_parser.c

23-  if(htmlParseChunk(ctx, chunk, size, Qtrue == _last_chunk ? 1 : 0)) {
24-    if (!(ctx->options & XML_PARSE_RECOVER)) {
25-      xmlErrorPtr e = xmlCtxtGetLastError(ctx);
26:      Nokogiri_error_raise(NULL, e);
27-    }
28-  }
29-

characterization:

  • call: htmlParseChunk
  • handlers: none
  • check: return value
  • raise: xmlCtxtGetLastError
  • errors: none

ext/nokogiri/xml_xpath_context.c

209-  }
210-
211-  xmlResetLastError();
212:  xmlSetStructuredErrorFunc(NULL, Nokogiri_error_raise);
213-
214-  /* For some reason, xmlXPathEvalExpression will blow up with a generic error */
215-  /* when there is a non existent function. */

characterization:

  • call: xmlXPathEvalExpression
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_raise # TODO bug!
    • xmlSetGenericErrorFunc
  • check: handler # TODO bug!
  • raise: handler

ext/nokogiri/xml_schema.c

120-  if(NULL == schema) {
121-    xmlErrorPtr error = xmlGetLastError();
122-    if(error)
123:      Nokogiri_error_raise(NULL, error);
124-    else
125-      rb_raise(rb_eRuntimeError, "Could not parse document");
126-

characterization:

  • call: xmlSchemaParse
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
    • xmlSchemaSetParserStructuredErrors: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: on document
173-  if(NULL == schema) {
174-    xmlErrorPtr error = xmlGetLastError();
175-    if(error)
176:      Nokogiri_error_raise(NULL, error);
177-    else
178-      rb_raise(rb_eRuntimeError, "Could not parse document");
179-

characterization:

  • call: xmlSchemaParse
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
    • xmlSchemaSetParserStructuredErrors: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: on document

places where rb_exc_raise is used

ext/nokogiri/xml_reader.c

458-
459-  error = xmlGetLastError();
460-  if(error)
461:    rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
462-  else
463-    rb_raise(rb_eRuntimeError, "Error pulling: %d", ret);
464-

characterization:

  • call: xmlTextReaderRead
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: none # TODO this may be a bug?

ext/nokogiri/xml_node.c

1401-
1402-    error = xmlGetLastError();
1403-    if(error)
1404:      rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
1405-    else
1406-      rb_raise(rb_eRuntimeError, "Could not perform xinclude substitution");
1407-  }

characterization:

  • call: xmlXIncludeProcessTreeFlags
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: none # TODO this may be a bug?

ext/nokogiri/xslt_stylesheet.c

80-    if (!ss) {
81-	xmlFreeDoc(xml_cpy);
82-	exception = rb_exc_new3(rb_eRuntimeError, errstr);
83:	rb_exc_raise(exception);
84-    }
85-
86-    return Nokogiri_wrap_xslt_stylesheet(ss);

characterization:

  • call: xsltParseStylesheetDoc
  • handlers:
    • xsltSetGenericErrorFunc: xslt_generic_error_handler
  • check: return value
  • raise: handler error string
  • errors: none
177-
178-    if (parse_error_occurred) {
179-      exception = rb_exc_new3(rb_eRuntimeError, errstr);
180:      rb_exc_raise(exception);
181-    }
182-
183-    return Nokogiri_wrap_xml_document((VALUE)0, result) ;

characterization:

  • call: xsltApplyStylesheet
  • handlers:
    • xsltSetGenericErrorFunc: xslt_generic_error_handler
    • xmlSetGenericErrorFunc: swallow_superfluous_xml_errors
  • check: handler error string # TODO bug? might be leaking return value
  • raise: handler error string
  • errors: none

ext/nokogiri/xml_xpath_context.c

221-
222-  if(xpath == NULL) {
223-    xmlErrorPtr error = xmlGetLastError();
224:    rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
225-  }
226-
227-  assert(ctx->doc);

characterization:

  • call: xmlXPathEvalExpression
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_raise
    • xmlSetGenericErrorFunc: xpath_generic_exception_handler
  • check: return value
  • raise: xmlGetLastError
  • errors: none

ext/nokogiri/xml_document.c

254-
255-    error = xmlGetLastError();
256-    if(error)
257:      rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
258-    else
259-      rb_raise(rb_eRuntimeError, "Could not parse document");
260-

characterization:

  • call: xmlReadIO
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: on document
298-
299-    error = xmlGetLastError();
300-    if(error)
301:      rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
302-    else
303-      rb_raise(rb_eRuntimeError, "Could not parse document");
304-

characterization:

  • call: xmlReadMemory
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: on document
446-  if(NULL == ptr) {
447-    xmlErrorPtr error = xmlGetLastError();
448-    if(error)
449:      rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
450-    else
451-      rb_raise(rb_eRuntimeError, "Could not create entity");
452-

characterization:

  • call: xmlAddDocEntity
  • handlers: none # TODO bug?
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: none

ext/nokogiri/html_document.c

66-    VALUE encoding_found = rb_funcall(io, id_encoding_found, 0);
67-    if (!NIL_P(encoding_found)) {
68-      xmlFreeDoc(doc);
69:      rb_exc_raise(encoding_found);
70-    }
71-  }
72-

characterization:

  • call: htmlReadIO
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
  • check: return value rb_respond_to(io, id_encoding_found) # TODO i have no idea what this is doing
  • raise: encoding_found
  • errors: none
77-
78-    error = xmlGetLastError();
79-    if(error)
80:      rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
81-    else
82-      rb_raise(rb_eRuntimeError, "Could not parse document");
83-

characterization:

  • call: htmlReadIO
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: on document
123-
124-    error = xmlGetLastError();
125-    if(error)
126:      rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
127-    else
128-      rb_raise(rb_eRuntimeError, "Could not parse document");
129-

characterization:

  • call: htmlReadMemory
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: on document

places where rb_raise is used

ext/nokogiri/xml_reader.c

547-
548-  if(reader == NULL) {
549-    xmlFreeTextReader(reader);
550:    rb_raise(rb_eRuntimeError, "couldn't create a parser");
551-  }
552-
553-  rb_reader = Data_Wrap_Struct(klass, NULL, dealloc, reader);

characterization:

  • call: xmlReaderForMemory
  • handlers: none
  • check: return value
  • raise: string
  • errors: none
592-
593-  if(reader == NULL) {
594-    xmlFreeTextReader(reader);
595:    rb_raise(rb_eRuntimeError, "couldn't create a parser");
596-  }
597-
598-  rb_reader = Data_Wrap_Struct(klass, NULL, dealloc, reader);

characterization:

  • call: xmlReaderForIO
  • handlers: none
  • check: return value
  • raise: string
  • errors: none

ext/nokogiri/xml_node.c

1491-    switch (error) {
1492-      case XML_ERR_INTERNAL_ERROR:
1493-      case XML_ERR_NO_MEMORY:
1494:	rb_raise(rb_eRuntimeError, "error parsing fragment (%d)", error);
1495-	break;
1496-      default:
1497-	break;

characterization:

  • call: xmlXIncludeProcessTreeFlags
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: none

ext/nokogiri/xml_sax_push_parser.c

69-          filename
70-        );
71-  if (ctx == NULL) {
72:    rb_raise(rb_eRuntimeError, "Could not create a parser context");
73-  }
74-
75-  ctx->userData = NOKOGIRI_SAX_TUPLE_NEW(ctx, self);

characterization:

  • call: xmlCreatePushParserCtxt
  • handlers: none
  • check: return value
  • raise: string
  • errors: none
93-  Data_Get_Struct(self, xmlParserCtxt, ctx);
94-
95-  if (xmlCtxtUseOptions(ctx, (int)NUM2INT(options)) != 0) {
96:    rb_raise(rb_eRuntimeError, "Cannot set XML parser context options");
97-  }
98-
99-  return Qnil;

characterization:

  • call: xmlCtxtUseOptions
  • handlers: none
  • check: return value
  • raise: string
  • errors: none

flavorjones avatar Mar 06 '17 02:03 flavorjones

I'll further note that this is the script I used to generate the samplings of code I filtered through above (in case I need to re-do the analysis after so much time):

#!/usr/bin/env ruby

target = "ext/nokogiri"
command = "ack -C3 --group --nocolor --ignore-file=match:xml_syntax_error*"

expressions = %w[
  Nokogiri_error_raise
  rb_exc_raise
  rb_raise
]

expressions.each do |expression|
  puts "## places where `#{expression}` is used"
  puts

  cmd = "#{command} #{expression} #{target}"
  output = `#{cmd}`

  output.split("\n").each do |line|
    if line =~ /\.[ch]$/
      puts "### #{line}"
      puts
      puts "```C"
    elsif line.empty?
      puts "```"
      puts
    elsif line =~ /^--$/
      puts "```"
      puts 
      puts "```C"
    else
      puts line
    end
  end
  puts "```"
  puts
end

flavorjones avatar Dec 29 '18 06:12 flavorjones

Note that TruffleRuby would also prefer if we didn't raise from a native code callback, see https://github.com/sparklemotion/nokogiri/issues/1882

flavorjones avatar Oct 12 '20 18:10 flavorjones

Recent relevant post from @peterzhu2118 with explanatory notes: https://blog.peterzhu.ca/ruby-c-ext-part-8/

Note also some exploratory work I've started at #2096

flavorjones avatar Apr 10 '22 18:04 flavorjones