fastjson icon indicating copy to clipboard operation
fastjson copied to clipboard

Issues regarding the new Arena API

Open vincent-163 opened this issue 6 years ago • 3 comments

Some parts of the new Arena API that I find hard to use:

  • Object only provides O(n) implementation of Get/Set, which is useful in most cases, but it also makes it impossible to create very large objects. For faster Set it might make sense to add another method called Append which skips checking for equal keys and append to kvs directly. For Get I see two options: create a map[string]*Value on first Get and use the map on subsequent Get calls; or add a ToMap convenience function that converts the Object to a map[string]*Value for faster gets.
  • The only way to create an Array is to call Arena.NewArray(), and call SetArrayItem for each index. This requires the user to maintain array length by himself, and not be able to use the more idiomatic append(). The API also prevents shrinking arrays or taking a slice from another []*Value directly. Consider adding a []*Value parameter to NewArray (breaking API change), or adding a Replace method that replaces everything in the array.
  • Commit https://github.com/valyala/fastjson/commit/c188f766b4b45af57d230cc9d0291130a8ae796b might cause memory leaks. For example, when a very large array is reused as an object, the array is kept in memory. In fact, this was the behavior before the commit as well. Consider emptying out everything, including slice pointers.

BTW, might not be related to this, but I don't think optimizations like https://github.com/valyala/fastjson/commit/fb293dfd043e3f46ebab89b83b157c2ee376728c is doing anything since there is only one memory read anyway, and CPU is clever enough to execute them in parallel.

vincent-163 avatar Jan 21 '19 14:01 vincent-163

Agreed on all the remarks.

Commit c188f76 might cause memory leaks. For example, when a very large array is reused as an object, the array is kept in memory. In fact, this was the behavior before the commit as well. Consider emptying out everything, including slice pointers.

I'm aware of the possible memory leak, but couldn't find good solution yet. Straightforward zeroing of all the pointers on Reset slows down the Parser on on JSONs with many items ( >10K ). Probably a separate ReleaseRefs method should be added to Parser for zeroing all the pointers.

BTW, any objects' pooling has possible memory leak. For instance, if the Parser was used for parsing a huge JSON and then re-used for parsing small JSONs, then it will waste memory allocated during huge JSON parsing. Distinct ParserPools may be used for parsing variously-sized JSON types in order to minimize the possibility of such a memory leak.

I don't think optimizations like fb293df is doing anything since there is only one memory read anyway, and CPU is clever enough to execute them in parallel.

Mostly agree, but CPU cannot execute these checks in parallel due to Go constraints - it guarantees that conditions are executed sequentially. Otherwise something like the following would break:

type A struct { n int }
func foo(a *A) bool {
    // This line would break on a=nil if conditions are executed in parallel
    return a != nil && a.n > 0
}

valyala avatar Jan 21 '19 20:01 valyala

I'm aware of the possible memory leak, but couldn't find good solution yet. Straightforward zeroing of all the pointers on Reset slows down the Parser on on JSONs with many items ( >10K ). Probably a separate ReleaseRefs method should be added to Parser for zeroing all the pointers.It

I'm afraid that the current Parser benchmarks are actually flawed, which makes them faster than what it would be in practice. The same Parser is used for parsing the exact same json, which makes it much more likely to reuse arrays than usual. Consider the following json: {"a":"hello", "b": <a large array>} When the same parser is used for: {"b": <a large array>, "a": "hello"} The array could not be reused, causing memory leaks and slowdowns. The reorder might frequently happen in practice (especially with Go's behavior of randomizing map keys) but that's not covered in the benchmarks. You should remove benchPool in the benchmarks since they are deceptive. With my humble algorithmic knowledge, I did find a solution though, involving only logarithmic allocations, but it's tedious to implement and few other JSON parsers use such techniques. You create parser-wide kvPool and item slices to store all key-value pairs and array elements in the JSON. You maintain two temporary stacks during parsing, called kvStack and itemStack, for objects and arrays respectively. When parsing an object, the key-value pairs are pushed to kvStack, and after } is seen, pop the items from kvStack and append them to the parser-wide kvPool, then take the corresponding slice from kvPool as the object's kv. A similar technique applies to arrays as well. Note that this prevents appending to kv or item in place, and you need to create another array to do that.

vincent-163 avatar Jan 22 '19 00:01 vincent-163

This sounds interesting... I need some time to evaluate the proposed approach in the PR.

valyala avatar Jan 24 '19 01:01 valyala