tvm
tvm copied to clipboard
[FFI][RUNTIME] Introduce runtime boxed types for int/float/bool
Prior to this commit, int, float, and bool arguments from Python were converted to IntImm, FloatImm, and Bool. These are subtypes of PrimExpr, and should only be used at compile-time. By automatically applying this conversion as part of the FFI, these types are required to be present whenever a primitive is converted to a tvm::ObjectRef.
This can become especially fragile for an end-user when storing objects into a TVM container. Because TVM containers require all contents to be ObjectRef subclasses, an automatic conversion may be applied on storing into a container, resulting in an unexpected type being retrieved from the container. For example, this currently occurs in Relax when extracting a R.Prim from a R.Tuple.
This commit introduces a Box<T> type for storage of boxed primitives at runtime, distinct from the IR types. This was based on discussion on https://github.com/apache/tvm/pull/15983, which would have further cemented the IntImm type in libtvm_runtime.so, rather than starting the process of separating it.
-
Primitive arguments provided to a PackedFunc that requires an
ObjectRefwill be converted to the corresponding boxed type. (e.g. Passing a Pythonintto a C++ function acceptingObjectRefproduces aBox<int64_t>. -
Boxed primitives provided to a PackedFunc that requires an unboxed primitive will be converted to the corresponding primitive.
-
PackedFunc return values of
ObjectRefare converted to the corresponding primitive, if present. (e.g. If atuple_getitemwith static return typeObjectRefreturns aBox<int64_t>, it will be unwrapped to a pythonint.)
Together, these three rules provide backwards compatibility for existing PackedFunc definitions, while avoiding exposing the user to any container-induced type conversions betweeen primitive types and ObjectRef.
cc @junrushao
Just highlight some of the main comments that are high level
- naming: we can probably go with
runtime::Int,runtime::Float,runtime::Boolas they are more understandable terms- we should remove tvm::Integer and tvm::bool after migrating their usage to boxed values(in later PR)
- convention: It is cleaner to have a convention where runtime::Int and runtime::Float always get unboxed when passing into TVMArg and TVMRet.
- The codegen layer that expects int/float then do not need to deal with object.
- That does mean the FFI layer needs to explicitly unbox when setting boxed values into them (seems the current impl is already doing so for TVMRet, need to double check if that is the case for argument setting as well)
- That means we do not need to do TryFromBox when we convert arguments
Thanks for the PR, love the direction we are going with dedicated boxed types.
Thank you!
The changes wrt to PackedFunc should have benefit from more pairs of eyes to review as they are centeral to most FFI routines
Definitely agreed, and the more eyes the better. Given that this PR has the potential to break any use of the PackedFunc interface, the surface area is very broad.
naming: we can probably go with runtime::Int, runtime::Float, runtime::Bool
I like this readability improvement, and will update accordingly
convention: It is cleaner to have a convention where runtime::Int and runtime::Float always get unboxed when passing into TVMArg and TVMRet.
Agreed. Assuming I have it implemented correctly, this is handled in the TVMArgsSetter::SetObject and TVMRetValue::operator=(TObjectRef) implementations, so a boxed primitive is always unboxed when storing in the FFI type.
The conversions that occur when reading from a TVMArgsSetter or TVMRetValue are to convert into some form of ObjectRef that is constructible from the primitive, based on the signature of the type receiving the argument.
After slowly working through all the unit tests and transforms that needed to be updated for this change, this PR is now ready for review! I've squashed the incremental commits on the PR branch that went into this PR, since the (very) long history of this dev branch wasn't that useful. (This should also prevent a recurrence of #16549, by having a shorter PR branch.)
@tvm-bot rerun
This is merged, thanks @Lunderberg !
We just find out some a perf regression introduced by this PR, specifically, during LLM decode function calling overhead(before the first kernel launch) goes up to 1.4 ms. The likely cause is due to increased PackedFunc calling overhead in general. Likely this is due to some of the automatic conversion logic introduced that increased the overhead per PackedFunc call.
Such overhead can impact the perf of downstream to optimize for low latency. As a temp measure, I am going to first revert this PR.
The changes of the PR(boxed types and bool) are valuable. Given the smart auto conversion might introduce extra overhead and the regression we see. I think it is helpful to isolate the PR into two pieces, with one that introduces the boxed types/bool, and another one that introduce a possiblity more conservative version of automatic conversion.
Thanks @Lunderberg for great effort and sorry for the extra trouble. As PackedFunc call gets into the center of our runtime execution, being able to reduce the execution time is now become an important topic, so we need to be more careful balancing the runtime logics
Thank you, and I'm looking into where the performance overhead comes from. Unfortunately, I'm not sure if the de-coupling of the boxed types from the automatic conversions is possible.
- Since the FFI is used during runtime, the default FFI conversions must not use IR types. Therefore, a function that accepts
ObjectRef, and is passed a python integer, has the argument converted toruntime::Intinstead of the previousIntImm. - A python list must be converted to an
tvm::Arrayto be passed through the FFI. SinceArrayholds anObjectRef, this requires converting any primitive values intoObjectReftypes. With (1), this would produce an array ofruntime::Intrather than the previous array ofIntImm. - For a function that accepts
Array<IntImm>, existing callers in python could pass the array[1,2,3], and this usage should not be broken. With the new default conversion for[1,2,3], this would raise an error.
Breaking any one of these three steps would avoid the error, but each comes with downsides:
- Break step (3): Automatic conversion from
Array<runtime::Int>toArray<IntImm>. This is what was implemented in the PR, but results in a two-step conversion. - Break step (2): Change
Arrayto hold aTVMRetValueinstead ofObjectRef. This would avoid the repeated nested conversion, since the C++ API could ide, but would significantly change the C++ side of the API. - Break step (1): An explicit change to the Python code to convert to
IntImmbefore calling the FFI. This would work in some cases where the desired C++ type is known (e.g. the shape of a TIR buffer), but wouldn't work in others (e.g. the target attributes, which are stored asObjectRef, and whose desired type isn't known until they are used).
Looking into the origin of the performance difference, since it may just result from an increased number of FFI calls, and (maybe) could be resolved without removing the automatic conversions.
maybe it is not too bad to do manual conversion to the array of ir types in python wrapper then leave ffi part lightweight (per your sugggsted option 3) the objectref stored target attribute are mostly intended as boxed int rather than Intimm so we should be fine
First results, looks like the implicit conversions aren't necessarily causing the overhead. I benchmarked constructing a TIR buffer, with the shape/strides either provided as a
On main (after the FFI changes), the runtime is about the same, regardless of how the shape/strides are provided.
On 5a67a00bcb (last commit before the FFI changes), it's about 80% slower 's about a 40% overhead when providing python integers ([1,2,3]) as compared to providing TIR IntImm ([T.int32(1), T.int32(2), T.int32(3)]).
This is rather surprising, because I expected the overhead to be present after the FFI changes, but this benchmark suggests that there's less overhead.
Next up, seeing how the FFI changes impact Relax R.call_tir operators.
Hmm, the same thing is occurring for this test case. I made a Relax module that computes x = subtract_one(add_one(x)), for a large number of iterations. (10/100/1000 iterations in the benchmarks shown below.)
(Current main, after the FFI changes)
(Before the FFI changes)
Since the effect isn't showing up for any smaller benchmarks, next for this investigation will probably need to be a full model execution.
Here is an example, the number of weights are large enough to see the difference (1.2ms vs 6.2ms) https://gist.github.com/vinx13/ea7a8c785d8d0d5ae5318e8ace085db2
Thank you for the example, @vinx13, and I can reproduce the difference with and without the FFI changes. My numbers roughly agree with yours (1.1 ms vs 5.2 ms), so I can start narrowing down on the cause.
FOUND IT!
The root cause was in the type validation, but ended up being pretty easily solveable. When converting from the untyped ArrayNode* to the typed Array<T>, we validate that each ObjectRef in the array has the correct type T. However, because each element in the untyped ArrayNode* must still be some type of ObjectRef, this check is unnecessary.
In the test case, this performance overhead largely occurred during the calls to "vm.builtin.tuple_getitem". For each call, the validation function iterated over the entire Array<ObjectRef> argument. This overhead can be removed by adding a compile-time check on the type parameter's type. I've updated the conversions for both Array<T> and Map<T,U> to skip type checks that will always succeed.
With this addition, the FFI update went from being 5x slower than baseline (increase from 1 ms to 5 ms) to being 30% faster than baseline (decrease from 1ms to 700 us). I'll make a PR shortly that re-applied the FFI changes, with the additional fix.