taichi icon indicating copy to clipboard operation
taichi copied to clipboard

Polish and officially release quantized types

Open strongoier opened this issue 2 years ago • 42 comments

Quantized types are an experimental feature introduced in the QuanTaichi paper. With this useful feature, users can significantly save memory usage of their Taichi programs. The feature can also enable acceleration of atomic operations on mobile phones.

However, the feature has been neither officially announced nor extensively maintained. As Taichi has come to its 1.0 version, I think it is time to polish the feature and make it available to users. My plan is to refine the API and implementation so that it can fit into current Taichi better, be more user-friendly, and become deployable with Taichi AOT. I would like to write an RFC for it.

strongoier avatar Apr 26 '22 01:04 strongoier

Before writing a formal RFC, I would like to briefly summarize some previous discussions on this topic. I think there are still some issues to be solved, and I hope to continue the discussion here.

Background

A quantized type normally has no native support. Therefore, you need to specify a parent primitive type (e.g. a 32-bit int) and describe how you would like to pack a group of quantized types (e.g. a 15-bit int and a 17-bit int) inside.

In Taichi, this is done by introducing two SNode types, bit_struct and bit_array. Example usages:

i4 = ti.quant.int(bits=4)
u28 = ti.quant.int(bits=28, signed=False)

p = ti.field(dtype=i4)
q = ti.field(dtype=u28)
ti.root.dense(ti.i, 4).bit_struct(num_bits=32).place(p, q)

r = ti.field(dtype=i4)
ti.root.dense(ti.i, 4).bit_array(ti.i, 8, num_bits=32).place(r)

What are the problems of current APIs?

  1. bit_struct and bit_array are not consistent with other SNode types. A normal SNode specifies two things: how to split the axes, and how are cells stored in the container. Meanwhile, a normal SNode has no limitations on components of its cells. However, bit_struct has nothing to do with axes, and both bit_struct and bit_array must only have place SNodes as components of its cells with limitation on total number of bits of all components of its cells. These make the APIs inconsistent.
  2. Users cannot use quantized types outside the SNode system. This is especially problematic when it comes to deployment, because ndarrays, which are first-class citizens in Taichi AOT, cannot work with quantized types.

What are our current thoughts on solving the problems?

As bit_array is deeply coupled with the SNode system (it indeed handles axes splitting) while used not that often, we prefer to keep it unchanged. Our main focus is around bit_struct.

Potential change 1: add type ti.types.bit_struct

ti.types.bit_struct is similar to ti.types.struct, with the following differences:

  • The whole ti.types.bit_struct is stored with a primitive type.
  • Members of ti.types.bit_struct must be quantized types.
  • The memory layout of members is clearly defined.

Example usage:

s_ty = ti.types.bit_struct(32, {'a': i4, 'b': u28})
s = ti.field(dtype=s_ty)
ti.root.dense(ti.i, 4).place(s)
s[I].a, s[I].b  # access

s_arr = ti.ndarray(dtype=s_ty, shape=4)
s_arr[I].a, s_arr[I].b  # access

Pros:

  • It can be supported for both SNodes and ndarrays, therefore problem 2 is solved.
  • It gets rid of problem 1 about bit_struct SNode.

Cons:

  • ti.types.bit_struct focuses on storage, so its members may not be a logical group. This can result in hard-to-read user programs.
  • When used in SNodes, users can no longer change storage layout without modifying computation code. This sacrifices one important advantage of the SNode system.

Potential change 2: add helper function bit_struct_wrapper()

bit_struct_wrapper() is introduced to replace the bit_struct SNode. Example usage:

p = ti.field(dtype=i4)
q = ti.field(dtype=u28)
ti.root.dense(ti.i, 4).place(bit_struct_wrapper(32, [p, q]))

It aims at solving problem 1 without sacrificing anything. However, it can do nothing with problem 2 because it is not compatible with ndarrays.

Considering that none of these proposed changes is perfect, shall we apply none, one, or both of them? Or do you have other ideas? @k-ye @ailzhang @yuanming-hu

strongoier avatar Apr 27 '22 15:04 strongoier

I really like 1, because it makes the type system neat :-) However, considering that changing to bit_struct_wrapper should be easier, and that 1 and 2 are not mutually exclusive, I think it's reasonable to go with 2 first. As for ndarray quant type, at the bare minimum we can support storing just fixed-point scalar number first, then quantized vector types, then quantized struct types.

k-ye avatar Apr 27 '22 16:04 k-ye

+1 on implementing #2 as a start! Btw I feel like based on the deployment need, not modifying computation code might not be as hard requirement as we thought. IMHO if it's a s/old/new it shouldn't be a huge problem for people who want to maximize performance. (or is it more complicated than that? :P For ndarray + quant, is it correct understanding that supporting fixed-point scalar number can already solve our problem of floating point atomics?

ailzhang avatar Apr 28 '22 01:04 ailzhang

not modifying computation code might not be as hard requirement as we thought

I agree with that. The main point here is, when we introduce a new language construct, especially as fundamental as a type, we should let it make sense in most cases, instead of being a deployment-only thing.

For ndarray + quant, is it correct understanding that supporting fixed-point scalar number can already solve our problem of floating point atomics?

IMO yes. @k-ye

strongoier avatar Apr 29 '22 06:04 strongoier

After an offline discussion with @ailzhang, we reach the following consensus:

  1. Ndarray is designed mainly for deployment purposes, with the two unique advantages - avoiding memory copy and recompilation. It is important that it can be interpreted by user programs and common third-party frameworks in a trivial way. Therefore, supporting complex data storage mechanism with ndarrays doesn't make much sense. For cases where those complex storage is really needed, users should refer to the SNode AOT solution.
  2. That said, we still want to solve the problem that floating point atomics on mobile phones are too slow, without the SNode AOT solution. The proposal here is to add fixed32 and fixed64 types, and let users convert float from/to them:
f_ty = ti.types.fixed32(scale=100.0, signed=False)
arr = ti.ndarray(float, 10)

@ti.kernel
def foo(a: ti.types.ndarray()):
    for i in a:
        x = ti.cast(a[i], f_ty)
        ...  # calculations on x
        a[i] = ti.cast(x, float)

foo(arr)

WDYT @k-ye If this solution looks good, we can finalize this as an individual feature request and then re-consider other aspects of quant APIs with fewer restrictions.

strongoier avatar Apr 29 '22 09:04 strongoier

To me it seems like fixed32 is just a small wrapper around the quant API? While I agree that

it can be interpreted by user programs and common third-party frameworks in a trivial way.

it's also not too hard to convert custom quant types into primitive types.

One thing I've been thinking about: if we make quant vectors workable on mobile, how much larger scale can we get for simulation. Note that graphical APIs are already offering f16 vectors, e.g. Metal has half4, so this is something to consider.

So yeah, I guess we can agree that Ndarray doesn't need to support fancy bit_struct. But I think it's reasonable to consider quantized scalars and vectors/matrices.

k-ye avatar Apr 29 '22 10:04 k-ye

@k-ye Yup that wrapper is mainly used to solve the floating point atomics problem we've seen.

Is it correct understanding that to achieve much larger scale simulation on mobile, we can try adding e.g. half4 as primitive type which applies to both field and ndarray?

ailzhang avatar Apr 29 '22 11:04 ailzhang

Is it correct understanding that to achieve much larger scale simulation on mobile, we can try adding e.g. half4 as primitive type which applies to both field and ndarray?

Yep. Additionally, this could also help with vec4 loading optimization (cc @turbo0628 @qiao-bo )

k-ye avatar Apr 29 '22 13:04 k-ye

To me it seems like fixed32 is just a small wrapper around the quant API?

This indeed requires us to support quantized scalars. Our current APIs cannot be used outside SNodes. However, when a quant type is used as an individual scalar, number of bits other than 32/64 doesn't make sense. As there are already f32/f64, the only meaningful types to provide are fixed32/64.

Additionally, this could also help with vec4 loading optimization

I don't quite get the point here. Using native half4/vec4 in codegen instead of current ad-hoc expansion will certainly be an optimization strategy for our ti.types.vector(4, dtype=f16/f32). How does it relate to our quantized types?

strongoier avatar May 07 '22 10:05 strongoier

After yet another discussion with @k-ye @ailzhang @jim19930609, I have formed a mental picture of future plans and would like to share it here.

Task A: Refine current APIs of quantized types and make them available again

Although current APIs work only in the SNode system, they are still useful and we hope to expose them in a cleaner way.

Subtask A.1: Determine public APIs of quantized type definitions

Previously, we have two groups of APIs, type_factory and quant. The latter is built on top of the former, and is used in the QuanTaichi paper. However, in some real use cases the former is adopted. Having both adds unnecessary burden for users to learn these APIs.

We would like to only keep quant as it is closer to users, and make it available at ti.types.quant for consistency with other types. type_factory will be removed, and its methods will be made private under ti.types.quant.

To sum up, we will have ti.types.quant.int/fixed/float/_custom_int/_custom_float. All current usages need to be updated.

Subtask A.2: Solve the inconsistency problem of bit_struct SNode

This corresponds to problem 1 and potential change 2 mentioned above. I plan to add an API ti.bit_struct_wrapper(number_of_bits, list_of_fields, with_shared_exponent) to solve the inconsistency problem and also make place() clean. This requires refactoring our SNode system implementation a bit as we are getting rid of the bit_struct SNode.

Task B: Add new all-purpose and deployable APIs of quantized types

For deployment purposes, where performance is valued the most, it is worth providing some new APIs (users have to write things in a new way). The new APIs should work both in the SNode system and for Ndarrays.

Subtask B.1: Allow unrestricted usage of quantized types as dtype

Currently, quantized types ti.types.quant.int/float/fixed can serve as dtype of fields, with the condition that they are placed as a bit_struct or bit_array. We hope to allow direct usage of them as dtype with no limitations, so that they can also be used in Ndarrays and thus easily deployable. Note that in this case, we need to pad a quantized type to a primitive type with minimum number of bits for storage purposes.

You may wonder what is the use case, considering that no memory can be saved. In fact, the above support is mainly targeting acceleration of atomic operations on mobile phones, by replacing float32 with 32-bit fixed point numbers. Meanwhile, it enables experimenting with different precisions and provides basis for subsequent tasks.

Subtask B.2: Add a quantized vector type

To enable the main advantage, saving memory, of quantized types, we hope to add a quantized vector type ti.types.quant.vector(n, dtype), where dtype must be one of ti.types.quant.int/float/fixed. The whole type will be padded to a primitive type with minimum number of bits that can hold n dtype. This targets common cases like packing two or three components of some physical quantities together.

Subtask B.3 (optional): Add a quantized struct type

Similar to Subtask B.2, we can add a quantized struct type ti.types.quant.struct, which was previously mentioned as ti.types.bit_struct. This can be an optional task when real need arises.

Task C: Add documentation and examples for quantized types

After this step we can have an official announcement of the rebirth of quantized types!

strongoier avatar May 10 '22 09:05 strongoier

minor nit: for subtask b.2, I wonder if ti.types.vector(n, dtype) where ti.types.quant.int/float/fixed are added to the whitelist of dtype makes it simpler for users?

ailzhang avatar May 10 '22 09:05 ailzhang

Thanks for writing this up! Overall it looks like a great roadmap. I have a few questions here:

Could you provide an overview of the quant API?


Subtask A.1

I plan to add an API ti.bit_struct_wrapper(number_of_bits, list_of_fields, with_shared_exponent) to solve the inconsistency problem

I wonder if with_shared_exponent is only meaningful for vector/matrix types?

Subtask A.2

type_factory will remain as an internal API at ti.types.quantized_types.type_factory

nit: I feel like we don't have to have both ti.types.quant and ti.types.quantized_types. Maybe just ti.types.quant.type_factory?

k-ye avatar May 11 '22 08:05 k-ye

minor nit: for subtask b.2, I wonder if ti.types.vector(n, dtype) where ti.types.quant.int/float/fixed are added to the whitelist of dtype makes it simpler for users?

ti.types.vector and ti.types.quant.vector are different in many ways. ti.types.quant.vector is actually stored as a primitive type, has limitations on number of bits, and can accept quant-only configurations like with_shared_exponent.

I wonder if with_shared_exponent is only meaningful for vector/matrix types?

For struct it can make sense as well..

nit: I feel like we don't have to have both ti.types.quant and ti.types.quantized_types. Maybe just ti.types.quant.type_factory?

ti.types.quant is the actual API we want to expose. As type_factory is hidden, we have to visit the whole module path ti.types.quantized_types for internal usage.

strongoier avatar May 11 '22 09:05 strongoier

we have to visit the whole module path ti.types.quantized_types for internal usage.

I think there are different ways to handle this: use __all__ to control the public symbols, use quant._type_factory, etc.

k-ye avatar May 11 '22 11:05 k-ye

I think there are different ways to handle this: use __all__ to control the public symbols, use quant._type_factory, etc.

Ah yes. I was stuck at the assumption that we could not break the two same-level classes, quant and type_factory . However it is now a chance to refine things more aggressively.

Now I have a new design: we get rid of the legacy "type_factory" and directly provide the following APIs - ti.types.quant.int/fixed/float/_custom_int/_custom_float. WDYT @k-ye

BTW which one seems better, quant.int or quant_int?

strongoier avatar May 11 '22 13:05 strongoier

Cool! I prefer quant.int more, as they can be scoped in the same namespace quant. WDYT? (cc @ailzhang @jim19930609 )

k-ye avatar May 11 '22 13:05 k-ye

+1 on quant.int!

ailzhang avatar May 11 '22 13:05 ailzhang

I wonder if with_shared_exponent is only meaningful for vector/matrix types?

For struct it can make sense as well..

I feel like in real use cases shared exponents are typically used only in vectors. Do you have an example where you need that in a struct? :-) @strongoier

Another question of mine: if I'd split the 64 bits into x: fixed21, y: fixed22, z: fixed21, can it be expressed as a quantized vector3? See also the RGB565 format in OpenGL etc.: https://www.khronos.org/opengl/wiki/Image_Format

yuanming-hu avatar May 11 '22 13:05 yuanming-hu

I feel like in real use cases shared exponents are typically used only in vectors. Do you have an example where you need that in a struct? :-)

Not really. My point here is just that we don't have to throw an error if those fields are not grouped as a vector.

Another question of mine: if I'd split the 64 bits into x: fixed21, y: fixed22, z: fixed21, can it be expressed as a quantized vector3?

In fact we hope that elements of a vector have the same type. A quantized struct is needed for this purpose.

strongoier avatar May 11 '22 13:05 strongoier

In fact we hope that elements of a vector have the same type. A quantized struct is needed for this purpose.

I see. Thanks for the clarification!

I feel like the user may want to access the components via [] - for example color = (fixed5, fixed6, fixed5) and the user writes luminance = a[0] + a[1] + a[2]. Do we plan to support that? :-)

yuanming-hu avatar May 11 '22 13:05 yuanming-hu

I feel like the user may want to access the components via [] - for example color = (fixed5, fixed6, fixed5) and the user writes luminance = a[0] + a[1] + a[2]. Do we plan to support that? :-)

Yep. It is fine to support that as syntax sugar.

strongoier avatar May 11 '22 13:05 strongoier

Yep. It is fine to support that as syntax sugar.

I'm thinking about this: for ti.types.quant.vector(n, dtype), can dtype be a list of quantized types? For example, we may want to allow something like rgb565 = ti.types.quant.vector(3, [fixed5, fixed6, fixed5]) :-) Then it's not simply a syntax sugar, but a real vector type. (Are we worrying about dynamic indexing here?)

yuanming-hu avatar May 11 '22 13:05 yuanming-hu

I'm thinking about this: for ti.types.quant.vector(n, dtype), can dtype be a list of quantized types? For example, we may want to allow something like rgb565 = ti.types.quant.vector(3, [fixed5, fixed6, fixed5]) :-) Then it's not simply a syntax sugar, but a real vector type. (Are we worrying about dynamic indexing here?)

I understand your point here. TBH this touches some underlying design philosophy of Taichi, which I get a bit confused from time to time.

As far as I understand, in earlier Taichi a vector is a pure math concept. It promises math operations, but nothing about storage. Because of this, it has great flexibility, allowing components to be non-contiguous, and to have different types. Also because of this, it cannot be directly mapped to native vector types, and cannot support dynamic indexing perfectly.

As time goes by, different voices arise in the community. Many users consider vectors as containers of contiguous same-typed values. As a result, many recent or planned efforts go in this direction - dynamic indexing, native types, etc.

However, these two directions are inherently conflicting - giving more support to one of them means giving less support to the other. To avoid getting design choices back and forth, IMHO we need to have a consistent and clear underlying principle. Then we can easily determine whether a quantized vector can have components with different types.

BTW I have another question: why do we have a struct type in the presence of a vector type which can have components with different types?

strongoier avatar May 11 '22 15:05 strongoier

To avoid getting design choices back and forth, IMHO we need to have a consistent and clear underlying principle.

I also agree on this. We have spent some great amount of time debating on this, and concluded that vector/matrix should behave just like how most users would expect: They are containers holding homogeneous elements, dynamically-indexable, and providing linalg methods. Most of the time, using a Taichi vector/matrix should feel no different from using a GLM/GLSL one. It simplifies the user experience, the API design and the implementation.

If it comes to a point where a non-trivial amount of usage for heterogeneous-vector show up, 1) From a storage point of view, this could supposedly be implemented via quant structs; and 2) we should consider how to offer a proxy/adaptor to help them convert between this quant struct and vectors (in the mathematical sense). WDYT?

k-ye avatar May 11 '22 15:05 k-ye

However, these two directions are inherently conflicting

Sorry about the confusion. I don't think the two directions are conflicting actually - let me write down a bit more details.

As far as I understand, in earlier Taichi a vector is a pure math concept. It promises math operations, but nothing about storage. Because of this, it has great flexibility, allowing components to be non-contiguous, and to have different types. Also because of this, it cannot be directly mapped to native vector types, and cannot support dynamic indexing perfectly.

I feel like you are mixing global (field) and local vectors. Local vectors are indeed purely math concept, and it says nothing about storage/data layout. In fact, they are always stored on the stack/register file. Local vectors can easily support dynamic indexing.

Global "vectors" are used to specify storage/quantization. For most of the computation, you convert global vectors to local vectors - the conversion involves loading/storing, as well as decoding/encoding for quantized types.

Then we can easily determine whether a quantized vector can have components with different types.

Perhaps the point is the components can have different (quantized) storage types, but they must share the same compute type? This ensures when loading them you get a formal float32x3/float64x3 etc.

BTW I have another question: why do we have a struct type in the presence of a vector type which can have components with different types?

You still need struct since you may have quantized int and quantized float in the same quant struct :-) It's more about "compute_type" in the QuanTaichi paper.

yuanming-hu avatar May 11 '22 15:05 yuanming-hu

I also agree on this. We have spent some great amount of time debating on this, and concluded that vector/matrix should behave just like how most users would expect: They are containers holding homogeneous elements, dynamically-indexable, and providing linalg methods. Most of the time, using a Taichi vector/matrix should feel no different from using a GLM/GLSL one. It simplifies the user experience, the API design and the implementation.

I totally agree with this. In fact, in the future, we should simply reuse the GLSL vector/matrix operators in the codegen :-)

The rgb565 type should be decoded into a vec3 (float32x3, if the compute_type for the three quantized components is float32) for computation, similar to the imageLoad function in GLSL.

yuanming-hu avatar May 11 '22 15:05 yuanming-hu

If it comes to a point where a non-trivial amount of usage for heterogeneous-vector show up, 1) From a storage point of view, this could supposedly be implemented via quant structs; and 2) we should consider how to offer a proxy/adaptor to help them convert between this quant struct and vectors (in the mathematical sense). WDYT?

Just to clarify: I don't think we should support "heterogeneous-vector" that contains both float and int, or float32 and float64. That's against most users' common practice and is against our recent attempt to support dynamic indexing and native types. I can't come up with a typical use case where you need a vector composed with both int and float.

But I do feel like we should allow different components of a homogeneous vector to be stored as different quantized types, since such usage is common in graphics (e.g., the RGB565 format). The price you have to pay though, is when you load/store such vectors, you always have to load/store them as a whole, instead of loading a single component. I believe paying such price is no big deal in practice :-)

(Sorry about joining this discussion late. Most of the thread makes a lot of sense to me. The only thing that I hold a different opinion is that "we should consider allowing different components of a homogeneous vector to be stored as different quantized types")

yuanming-hu avatar May 11 '22 16:05 yuanming-hu

The rgb565 type should be decoded into a vec3

Yep. To support my argument, rgb565 is-a quantized struct, rather than a quantized vector. And +1 that it will be decoded/converted to a regular vec3.

I believe what @strongoier meant in "these two directions are inherently conflicting... we need to have a consistent and clear underlying principle." is also this point... It is conflicting in the sense that rgb565 itself is only a storage type, and shouldn't be used for computing directly. Before participating any kind of computation, it will need to first go through this decoding stage into a mathematically-legit vector. I think this principle is where we don't have a consensus yet, i.e., vectors should be treated in the purely mathematical way, and should not take much responsibility in fancy storage patterns. To make the quantized type work like a vector, Taichi or the users will need to convert them first.

k-ye avatar May 11 '22 16:05 k-ye

It is conflicting in the sense that rgb565 itself is only a storage type, and shouldn't be used for computing directly.

Exactly. (The only exception is when you want to perform rgb565 + rgb565 using u16 operator +. I assume that is a rare use case.)

You need to either associate a compute_type (e.g., ti.types.vec3) with rgb565, or explicitly let the user do rgb565_array[I, j, k].decode(ti.types.vec3).

I think this principle is where we don't have a consensus yet, i.e., vectors should be treated in the purely mathematical way, and should not take much responsibility in fancy storage patterns. To make the quantized type work like a vector, Taichi or the users will need to convert them first.

I agree with this. Perhaps ti.types.vector is for both computation & storage (since you need AOS/SOA/...), and ti.types.quant.vector is only for (AOS) storage? I can't easily come up with a use case where ti.types.quant.vector needs SOA so I assume it's AOS only.

Yep. To support my argument, rgb565 is-a quantized struct, rather than a quantized vector. And +1 that it will be decoded/converted to a regular vec3.

What confuses me here: if this holds true, isn't ti.types.quant.vector a special case of ti.types.quant.struct? And it sounds like we will need two code paths for ti.types.quant.struct with homogenous and inhomogeneous components, the former automatically/optionally converted into a vector but the later constantly stays a struct.

yuanming-hu avatar May 11 '22 16:05 yuanming-hu

Let me do a quick summary (which contains some personal ideas, though).

  1. Local vectors are always homogeneous, and should be translated to native types if possible, which can naturally provide dynamic indexing support.
  2. ti.types.quant.vector should be loaded into a local vector before doing any calculation. The result should also be stored as a whole.
  3. ti.types.quant.vector can accept its components to have different quant types (e.g. dtype=[fixed5, fixed6, fixed5]), but they should have the same compute type. This will be checked upon type definition.
  4. ti.types.quant.vector can also take an optional physical_type if users don't want the automatically inferred one. This is mainly useful for ti.types.quant.matrix, which may not fit into one primitive type. In this case, users may want to manually specify if they want 32 bits or 64 bits as a storage unit.

strongoier avatar May 12 '22 07:05 strongoier