net icon indicating copy to clipboard operation
net copied to clipboard

html: add option to set MaxBuf in Parse

Open Jarcis-cy opened this issue 8 months ago • 4 comments

I encountered an issue when using html.Parse that triggers the following call chain: html.Parse -> ParseWithOptions -> p.parse() -> p.tokenizer.Next() -> readByte(). In the readByte() function, there's a logic block:

if z.maxBuf > 0 && z.raw.end-z.raw.start >= z.maxBuf { z.err = ErrBufferExceeded return 0 }

This logic only takes effect if maxBuf is set. However, when using html.Parse, there is no way to use SetMaxBuf, nor is there any exported method to use ParseWithOptions with SetMaxBuf. As a result, when parsing a very large HTML document, such as this page: http://vod.culture.ihns.cas.cn, the memory usage can increase significantly.

To solve this problem, I wrote a function using reflection:

func ParseOptionSetMaxBuf(maxBuf int) html.ParseOption { funcValue := reflect.MakeFunc( reflect.FuncOf([]reflect.Type{reflect.TypeOf((*html.ParseOption)(nil)).Elem().In(0)}, nil, false), func(args []reflect.Value) (results []reflect.Value) { parserValue := args[0].Elem()

        tokenizerField := parserValue.FieldByName("tokenizer")
        tokenizerPtr := reflect.NewAt(tokenizerField.Type(), unsafe.Pointer(tokenizerField.UnsafeAddr())).Elem().Interface()

        if tokenizer, ok := tokenizerPtr.(interface {
            SetMaxBuf(int)
        }); ok {
            tokenizer.SetMaxBuf(maxBuf)
        }

        return nil
    },
)
var option html.ParseOption
reflect.ValueOf(&option).Elem().Set(funcValue)
return option

}

And then used it as follows:

html.ParseWithOptions(bytes.NewReader(data), util.ParseOptionSetMaxBuf(len(data)*3))

Testing showed that setting maxBuf to at least 1.04 times the body length ensures normal operation.

Therefore, would it be feasible to introduce a function similar to ParseOptionEnableScripting that allows users to set MaxBuf?

Environment:

  • Go version: 1.21
  • OS: Tested on Ubuntu 22.04 and Windows 11

Jarcis-cy avatar Jun 20 '24 08:06 Jarcis-cy