tvm icon indicating copy to clipboard operation
tvm copied to clipboard

[FFI][RUNTIME] Introduce runtime boxed types for int/float/bool

Open Lunderberg opened this issue 1 year ago • 4 comments
trafficstars

Prior to this commit, int, float, and bool arguments from Python were converted to IntImm, FloatImm, and Bool. These are subtypes of PrimExpr, and should only be used at compile-time. By automatically applying this conversion as part of the FFI, these types are required to be present whenever a primitive is converted to a tvm::ObjectRef.

This can become especially fragile for an end-user when storing objects into a TVM container. Because TVM containers require all contents to be ObjectRef subclasses, an automatic conversion may be applied on storing into a container, resulting in an unexpected type being retrieved from the container. For example, this currently occurs in Relax when extracting a R.Prim from a R.Tuple.

This commit introduces a Box<T> type for storage of boxed primitives at runtime, distinct from the IR types. This was based on discussion on https://github.com/apache/tvm/pull/15983, which would have further cemented the IntImm type in libtvm_runtime.so, rather than starting the process of separating it.

  • Primitive arguments provided to a PackedFunc that requires an ObjectRef will be converted to the corresponding boxed type. (e.g. Passing a Python int to a C++ function accepting ObjectRef produces a Box<int64_t>.

  • Boxed primitives provided to a PackedFunc that requires an unboxed primitive will be converted to the corresponding primitive.

  • PackedFunc return values of ObjectRef are converted to the corresponding primitive, if present. (e.g. If a tuple_getitem with static return type ObjectRef returns a Box<int64_t>, it will be unwrapped to a python int.)

Together, these three rules provide backwards compatibility for existing PackedFunc definitions, while avoiding exposing the user to any container-induced type conversions betweeen primitive types and ObjectRef.

Lunderberg avatar Nov 29 '23 18:11 Lunderberg

cc @junrushao

tqchen avatar Nov 29 '23 19:11 tqchen

Just highlight some of the main comments that are high level

  • naming: we can probably go with runtime::Int, runtime::Float, runtime::Bool as they are more understandable terms
    • we should remove tvm::Integer and tvm::bool after migrating their usage to boxed values(in later PR)
  • convention: It is cleaner to have a convention where runtime::Int and runtime::Float always get unboxed when passing into TVMArg and TVMRet.
    • The codegen layer that expects int/float then do not need to deal with object.
    • That does mean the FFI layer needs to explicitly unbox when setting boxed values into them (seems the current impl is already doing so for TVMRet, need to double check if that is the case for argument setting as well)
    • That means we do not need to do TryFromBox when we convert arguments

tqchen avatar Nov 29 '23 19:11 tqchen

Thanks for the PR, love the direction we are going with dedicated boxed types.

Thank you!

The changes wrt to PackedFunc should have benefit from more pairs of eyes to review as they are centeral to most FFI routines

Definitely agreed, and the more eyes the better. Given that this PR has the potential to break any use of the PackedFunc interface, the surface area is very broad.

naming: we can probably go with runtime::Int, runtime::Float, runtime::Bool

I like this readability improvement, and will update accordingly

convention: It is cleaner to have a convention where runtime::Int and runtime::Float always get unboxed when passing into TVMArg and TVMRet.

Agreed. Assuming I have it implemented correctly, this is handled in the TVMArgsSetter::SetObject and TVMRetValue::operator=(TObjectRef) implementations, so a boxed primitive is always unboxed when storing in the FFI type.

The conversions that occur when reading from a TVMArgsSetter or TVMRetValue are to convert into some form of ObjectRef that is constructible from the primitive, based on the signature of the type receiving the argument.

Lunderberg avatar Nov 29 '23 19:11 Lunderberg

After slowly working through all the unit tests and transforms that needed to be updated for this change, this PR is now ready for review! I've squashed the incremental commits on the PR branch that went into this PR, since the (very) long history of this dev branch wasn't that useful. (This should also prevent a recurrence of #16549, by having a shorter PR branch.)

Lunderberg avatar Jun 14 '24 19:06 Lunderberg

@tvm-bot rerun

Lunderberg avatar Jul 18 '24 21:07 Lunderberg

This is merged, thanks @Lunderberg !

tqchen avatar Aug 05 '24 13:08 tqchen

We just find out some a perf regression introduced by this PR, specifically, during LLM decode function calling overhead(before the first kernel launch) goes up to 1.4 ms. The likely cause is due to increased PackedFunc calling overhead in general. Likely this is due to some of the automatic conversion logic introduced that increased the overhead per PackedFunc call.

Such overhead can impact the perf of downstream to optimize for low latency. As a temp measure, I am going to first revert this PR.

The changes of the PR(boxed types and bool) are valuable. Given the smart auto conversion might introduce extra overhead and the regression we see. I think it is helpful to isolate the PR into two pieces, with one that introduces the boxed types/bool, and another one that introduce a possiblity more conservative version of automatic conversion.

Thanks @Lunderberg for great effort and sorry for the extra trouble. As PackedFunc call gets into the center of our runtime execution, being able to reduce the execution time is now become an important topic, so we need to be more careful balancing the runtime logics

tqchen avatar Aug 07 '24 11:08 tqchen

Thank you, and I'm looking into where the performance overhead comes from. Unfortunately, I'm not sure if the de-coupling of the boxed types from the automatic conversions is possible.

  1. Since the FFI is used during runtime, the default FFI conversions must not use IR types. Therefore, a function that accepts ObjectRef, and is passed a python integer, has the argument converted to runtime::Int instead of the previous IntImm.
  2. A python list must be converted to an tvm::Array to be passed through the FFI. Since Array holds an ObjectRef, this requires converting any primitive values into ObjectRef types. With (1), this would produce an array of runtime::Int rather than the previous array of IntImm.
  3. For a function that accepts Array<IntImm>, existing callers in python could pass the array [1,2,3], and this usage should not be broken. With the new default conversion for [1,2,3], this would raise an error.

Breaking any one of these three steps would avoid the error, but each comes with downsides:

  • Break step (3): Automatic conversion from Array<runtime::Int> to Array<IntImm>. This is what was implemented in the PR, but results in a two-step conversion.
  • Break step (2): Change Array to hold a TVMRetValue instead of ObjectRef. This would avoid the repeated nested conversion, since the C++ API could ide, but would significantly change the C++ side of the API.
  • Break step (1): An explicit change to the Python code to convert to IntImm before calling the FFI. This would work in some cases where the desired C++ type is known (e.g. the shape of a TIR buffer), but wouldn't work in others (e.g. the target attributes, which are stored as ObjectRef, and whose desired type isn't known until they are used).

Lunderberg avatar Aug 07 '24 13:08 Lunderberg

Looking into the origin of the performance difference, since it may just result from an increased number of FFI calls, and (maybe) could be resolved without removing the automatic conversions.

Lunderberg avatar Aug 07 '24 13:08 Lunderberg

maybe it is not too bad to do manual conversion to the array of ir types in python wrapper then leave ffi part lightweight (per your sugggsted option 3) the objectref stored target attribute are mostly intended as boxed int rather than Intimm so we should be fine

tqchen avatar Aug 07 '24 14:08 tqchen

First results, looks like the implicit conversions aren't necessarily causing the overhead. I benchmarked constructing a TIR buffer, with the shape/strides either provided as a

On main (after the FFI changes), the runtime is about the same, regardless of how the shape/strides are provided. image

On 5a67a00bcb (last commit before the FFI changes), it's about 80% slower 's about a 40% overhead when providing python integers ([1,2,3]) as compared to providing TIR IntImm ([T.int32(1), T.int32(2), T.int32(3)]).

image

This is rather surprising, because I expected the overhead to be present after the FFI changes, but this benchmark suggests that there's less overhead.

Next up, seeing how the FFI changes impact Relax R.call_tir operators.

Lunderberg avatar Aug 07 '24 17:08 Lunderberg

Hmm, the same thing is occurring for this test case. I made a Relax module that computes x = subtract_one(add_one(x)), for a large number of iterations. (10/100/1000 iterations in the benchmarks shown below.)

(Current main, after the FFI changes) image

(Before the FFI changes) image

Since the effect isn't showing up for any smaller benchmarks, next for this investigation will probably need to be a full model execution.

Lunderberg avatar Aug 07 '24 17:08 Lunderberg

Here is an example, the number of weights are large enough to see the difference (1.2ms vs 6.2ms) https://gist.github.com/vinx13/ea7a8c785d8d0d5ae5318e8ace085db2

vinx13 avatar Aug 07 '24 23:08 vinx13

Thank you for the example, @vinx13, and I can reproduce the difference with and without the FFI changes. My numbers roughly agree with yours (1.1 ms vs 5.2 ms), so I can start narrowing down on the cause.

Lunderberg avatar Aug 08 '24 14:08 Lunderberg

FOUND IT!

The root cause was in the type validation, but ended up being pretty easily solveable. When converting from the untyped ArrayNode* to the typed Array<T>, we validate that each ObjectRef in the array has the correct type T. However, because each element in the untyped ArrayNode* must still be some type of ObjectRef, this check is unnecessary.

In the test case, this performance overhead largely occurred during the calls to "vm.builtin.tuple_getitem". For each call, the validation function iterated over the entire Array<ObjectRef> argument. This overhead can be removed by adding a compile-time check on the type parameter's type. I've updated the conversions for both Array<T> and Map<T,U> to skip type checks that will always succeed.

With this addition, the FFI update went from being 5x slower than baseline (increase from 1 ms to 5 ms) to being 30% faster than baseline (decrease from 1ms to 700 us). I'll make a PR shortly that re-applied the FFI changes, with the additional fix.

Lunderberg avatar Aug 08 '24 19:08 Lunderberg