libCEED
libCEED copied to clipboard
Mixed precision functionality
This issue is for discussions related to adding mixed precision capabilities to libCEED, including:
- Which user-facing functions should have mixed precision capability?
- How should the libCEED interface support this (what will the user need to do)?
- What combinations should be available (e.g. mix-and-match input and output precisions, ability to decouple input/output precision from precision of computations, ...)
- Potential for backend specializations, and how this will fit with the interface changes
An idea from @jedbrown was to handle precision conversions through the CeedVector
objects, which could be modified to store multiple pointers for different precisions, along with a tag to indicate which precision that vector's data currently uses.
I think qfunction variation is the hardest part. Do we need to support different arguments to the qfunction having different precisions or can they be homogeneous (all arguments in matching precision)? Let's defer this part for later.
As a first step, I would add functions like CeedVectorGetArray_f32(CeedVector, CeedMemType, float **)
and have the implementation convert as needed. We can use C11 _Generic
so C11 users don't need to specify these types and C++ overloading to do the same for C++ users.
We can have a macro CEED_SCALAR_F32
that users can define before including ceed.h
so that C99 callers get suitable CeedScalar
and appropriate specialization. Perhaps better is to provide #include <ceed/ceed_f32.h>
that sets the macro and includes ceed/ceed.h
). This would be for callers who want to choose precision at compile time.
We then add a CeedElemRestrictionSetScalarType(CeedElemRestriction, CeedScalarType evec_type)
that converts if necessary (when provided with an L-vector). This will require visiting some backends for the cross-type packing, but I don't think will be a lot of code.
I kinda think CeedBasis
public interfaces (setting matrices and doing linear algebra) could just stay double precision all the time, and find it more compelling to possibly add __float128
than to support direct specification in single precision (they can be converted as needed behind the scenes).
We can have CeedVectorGetArrayUntyped(CeedVector, CeedMemType, CeedScalarType, void *)
to make life easier for dynamic callers who prefer to cast somewhere else.
I think qfunction variation is the hardest part. Do we need to support different arguments to the qfunction having different precisions or can they be homogeneous (all arguments in matching precision)? Let's defer this part for later.
Just to make sure I'm clear: are you referring to the user-provided per-point QFunction, the libCEED QFunction interface, or both?
Regarding CeedBasis
, when users can tolerate lower precision for their application(s), they could potentially get some nice speedup from single precision CeedBasisApply
in the non-tensor case. Specifically, for the MAGMA backend, with the GEMM/batch GEMM computations, the ability to use SGEMM instead of DGEMM could be useful. I don't know if this is considered too narrow of a use case (non-tensor basis and no need for double-precision accuracy in the basis computations)...also, I guess CeedBasisApply
is different than the basis creation functions since it doesn't take any CeedScalar
arguments, only CeedVector
s. If CeedVector
knows about its type, a lot of the "overloading" could be handled by the backends inside the backend-specific implementations. I guess this goes back to the question of whether we want to be able to specify computation precision, or have it deduced from the types of the CeedVector
s? I'd argue there would likely be interest in being able to do computations in double while input/output are single, but also to be able to do everything in single. Edit: was the intention that _f[something]
would be added to routines to specify computation precision, but not necessarily input/output?
CeedBasis
has several interfaces that take CeedScalar *
, and I think there's no harm in making those all double
(perhaps with complex variants at some point). It can be converted internally. We have several right now
- Outer solver/Krylov vector storage (PETSc, MFEM, etc.) which gets placed into a
CeedVector
as an L-vector. - E-vector representation (ephemeral)
- CeedBasisApply
- Q-vectors processed by user-provided QFunction
We could make CeedElemRestriction have no opinion about types, in which case we would convert as needed. Then L-vectors, CeedBasis, and CeedQFunction would have a designated precision. The precision of E-vectors would be inferred to match the compute precision of CeedBasis (converting as needed to/from the L-vector precision).
A simplification would be to make CeedBasis
precision always match the QFunction precision.
To make this incremental, I think it will be simpler to start with the L-vectors and get that merged, then work inward.
I'm not sure I understand what your list is intended to represent. Cases where we use CeedScalar *
directly as a type? CeedBasisApply
only takes CeedVector
s for input/output. E-Vectors are also represented as CeedVector
s. So I must be missing something.
I like the idea of the element restriction being able to infer the output (or input, in transpose mode) type, but I'm not sure how it would work in practice. The CeedElemRestriction
doesn't know about the CeedBasis
or CeedQFunction
, so it can't know which precision they use unless we tell it. I think your initial suggestion of a way to set the precision of the element restriction may be necessary to keep the current separation between/independence of CeedOperator "pieces" (element restriction, basis, qfunction).
-
The list included internal representations (E-vectors), and I intended it to identify each place where precision can be varied in
G^T B^T D B G
. I think it's a significant simplification in the code if we can make the precision ofB
match the precision of the Q-functionD
-- it still decouples L-vector precision from compute precision. Note that each (active or passive, input or output) field provided to aCeedOperator
could have a different precision. This does restrict algorithmic space somewhat, but I don't know how serious it is. On CPUs, I think it's not a big deal because E-vectors only ever exist in cache. It may be a lot more significant for GPU kernels that need to store complete E-vectors. -
Backends can always fuse
CeedElemRestriction
into whatever else they do. The CPU backends, for example, never have an actual E-vector, just a bit of workspace for one element (/serial
) or a batch (/blocked
). That said, we have two vectors, so if those vectors have preferred precision set, then all the necessary information is there.
int CeedElemRestrictionApply(CeedElemRestriction rstr,
CeedTransposeMode t_mode, CeedVector u, CeedVector ru, CeedRequest *request);
The CeedOperator
backend gets to choose how to represent the E-vector data so even if it just calls CeedElemRestrictionApply
, that function knows what precisions to work with.
Right, I guess this raises another important question: how much do we want to limit the multiprecision functionality to things that happen inside a CeedOperatorApply
? I guess if CeedVector
provides a function for converting to other precisions, that would cover any cases where a user (or library) wants to do a stand-alone CeedElemRestrictionApply
for something and have the output in a different precision -- should such a situation arise. Or we could specify that you should set the precision of the input/output vectors ahead of time, as suggested.
Though even if we are only considering the case where the Operator is setting the preferred element restriction output precision based on all the information it has, the function that lets you set the element restriction precision might still make more sense than relying on the input/output CeedVector
s having their precisions set. I'm thinking this would be more consistent across backends, including those that never set up E-Vectors as CeedVector
objects, like cuda-gen -- which does still get information from the element restrictions while building the code string, despite not creating E-Vectors (or using CeedElemRestrictionApply
).
I figured CeedVectorGetArray()
and the like would be able to mirror data to any requested precision, but CeedElemRestrictionApply
would inspect the available precisions of the input vector and the preferred precision of the output vector and apply without needing to convert the input.
I'd think cuda-gen could still propagate the desired (ephemeral) E-vector precisions from the Basis. One thing I'd like to facilitate is having a residual in double precision and Jacobian action in single, or the Jacobian in double and a preconditioner in single.
I'd think cuda-gen could still propagate the desired (ephemeral) E-vector precisions from the Basis.
Yes, it could, I just thought it might be preferable to have more consistency across the CUDA/HIP backends, such that they could all get/set this information in the same way. But cuda(hip)-gen is so different from the other backends already that perhaps that's not really a selling point.
Do you think it'll be hard to handle consistently if CeedElemRestriction is precision-agnostic (uses its arguments rather than set as state)?
My concern is that when we wire up several different operators, we'd like for the data (integers) of a CeedElemRestriction
to be shared between different precisions. That's easier if it's just the same object, rather than needing to reference count the backing storage.
Do you think it'll be hard to handle consistently if CeedElemRestriction is precision-agnostic (uses its arguments rather than set as state)?
I've been thinking about this, and I haven't yet come up with any problems (not that that means there couldn't be any, of course). I was initially concerned about CeedElemRestrictionCreateVector
, since it calls CeedVectorCreate
, so we couldn't set the precision ahead of time. But since CeedVectorCreate
doesn't allocate any memory, we could just have a function for setting the precision of a CeedVector
which is meant to be called after CeedVectorCreate
. In fact, maybe this is what you always had in mind, since it would avoid breaking the current CeedVectorCreate
interface by adding a precision argument?
Yeah, I don't see harm in a separate CeedVectorSetPrecision()
. I also envision being able to convert one way and the other even after memory has been allocated or there are values present (perhaps with a CeedVectorFlush()
to free any internal arrays there were mirroring to a different precision).
Okay, I'm trying to get a clear picture here of what a "CeedVector
-centric" approach to multiprecision could look like, based on discussions so far.
Some potential "general principles":
-
CeedVector
s store information about their current type. The default precision set byCeedVectorCreate
could remainCeedScalar
(probably double), but it can be changed by callingCeedVectorSetPrecision
. - The
Ceed[Get/Set/Take/Restore]Array
functions remain as-is (handling data of typeCeedScalar
), as well asCeedSetValue
. -
Ceed[Get/Set/Take/Restore]Array_f32
functions are added to specifically handle rawfloat *
data. (Perhaps alsof64
in the case that someone wants to setCeedScalar
to single but use double in specific cases?) These would mostly be intended for users wanting to directly interact withCeedVector
s in different precisions. - All of the
Get/Set/Take
functions perform type conversions if the precision of the vector and the data do not match. - Functions like
CeedVectorNorm
,CeedVectorScale
, etc. will use different implementations based on the current precision of the vector.
Internally in the library, in backends, etc. we of course use these Get/Set
functions. These are often in places that we would like to be able to handle different types, e.g. within in an Operator
, where we have set an E-
or Q-Vector
to use some precision determined by its Basis
or QFunction
objects. To keep the code as general as possible, could we use the suggested Ceed[Get/Set/Take]ArrayUntyped
in many (it might end up being most?) places internal to the library, rather than any of the specific _f[prec]
versions, or the default version (using CeedScalar
)? Or would we likely cause some horrible bug-prone mess? Of course, a more general issue is that if we let E-Vectors
have different precisions, that could be tricky for many backends, which use an array of type CeedScalar
for the E-data...(and the Vector[Get/Set]
functions are used to access this data directly). I think this is probably the trickiest issue to handle...
For the "sub"-operators, we could potentially treat them differently:
- Element restrictions deduce which implementation of
Apply
to use based on input/output types. To change their behavior, you must set the precision of the input/outputCeedVector
s. - Basis objects and QFunctions can have their own precision set, declaring that you want their computations to be performed in a specific precision, no matter the input/output types. Incorrect input/output types could result in an error or an automatic conversion. (Is this too dangerous?)
-
CeedOperator
s will use the precision information of their basis and qfunction objects to set the precision for any internal/temporary vectors (E-
orQ-Vector
s), which will determine the behavior of their element restrictions.
Does this follow more or less what you were thinking?
Yeah, that's about right. I think CeedVectorGetArray
(and relatives) can default to CeedVectorGetArray_f64
(or whatever matches CeedScalar
) and we can use C11 _Generic
to make it automatically match types for anyone using a C11 compiler (and similar with templates for C++), so most callers would not need to know about the actual specializations.
I think the implementation becomes much simpler of Basis and QFunction always see homogeneous types. This saves needing to convert within those compute kernels and limits the number of variants that need to be written/maintained/compiled.
Yeah, that's about right. I think CeedVectorGetArray (and relatives) can default to CeedVectorGetArray_f64 (or whatever matches CeedScalar) and we can use C11 _Generic to make it automatically match types for anyone using a C11 compiler (and similar with templates for C++), so most callers would not need to know about the actual specializations.
I'm not sure I follow. Wouldn't this mean we have to have three interfaces for CeedVector
: one for C++, one for C11, and another for C99 (and perhaps that C99 users won't have multiprecision available to them)? How would this work with GetArray
usage inside other library routines, like CeedOperatorApply
?
I think the implementation becomes much simpler of Basis and QFunction always see homogeneous types. This saves needing to convert within those compute kernels and limits the number of variants that need to be written/maintained/compiled.
By homogeneous types, do you mean just that input/output vectors are all the same type, or that input/output vectors also match the precision of the computation, as well?
For vectors, I mean that we can make CeedVectorGetArray
automatically use the precision matching the provided pointer when the caller is C11 or C++, but that C99 callers will either need to call the precision-specific functions like CeedVectorGetArray_f32
or declare the default precision for the compilation unit.
By homogeneous types, I mean that the input and output vectors would all have the same precision. A QFunction is free to convert internally (promoting or demoting) as it likes, though the cost of doing so may be nontrivial.
But we'd still have to have separate interfaces, correct? And a way for the user to specify at compile time whether to compile the C11 or C++ interface? And I'm still not sure how that would work within the library itself (other library functions that call Get/SetArray
).
We'd need to implement both the _f32
and _f64
variants (also so they can be bound by other languages), but C11 _Generic
works something like
#define CeedVectorGetArray(ceed, mtype, x) _Generic((x), \
double **: CeedVectorGetArray_f64, \
float **: CeedVectorGetArray_f32 \
)(ceed, mtype, x)
and we'd provide inline overloads for C++. Whether to provide these within #include <ceed/ceed.h>
can depend on __cplusplus
and __STDC__
.
It's true that we can't call these within the library (so long as we don't want a hard dependency on C11), but I think it's okay because we can be type-agnostic except for the numerical kernels, which need to specialize anyway.
Oh! I think "inline" was the crucial thing I was missing for C++. Otherwise, I couldn't figure out how we could use __cplusplus
if we want to consider the type of compiler being called for the code using the library rather than while compiling the library itself.
This will make things more complicated, but I've been thinking a lot more about where the precision conversion should take place and whether or not we should limit the input/output types of the basis to match -- strictly from the GPU point of view (this is all different from CPU, as mentioned above). This all mostly applies to the non-fused backends.
First, some ground rules: we'll assume we're memory bound, since we want to optimize performance for the tensor case. Thus, our main goal is reducing the high precision read/write actions to main memory, which will matter more than the lower-precision computation in terms of speedup.
We will also consider two main cases of mixed-precision computation:
- The "high-low case", where most computation and data propagation is happening in the higher precision, as well as the ultimate input/output L-vectors, but one or two sub-operators have a lower computation precision.
- The "low-high case", where the vectors are propagated between kernels in the lower precision, but computation happens in a higher precision, to gain the additional memory bandwidth but not lose all of the accuracy of a full-low-precision computation.
For clarity, we'll always use double and float to represent the high and low precisions.
Let's consider the two types of input vectors to a CeedOperator
: active and passive.
- Passive inputs. In the common use case of a passive input which has an identity restriction and no basis action (e.g., qdata produced by the same backend or with the same stride as the current backend), one conversion at the beginning (through a
CeedVector
function which performs proper copying and/or casting, depending on if we are keeping the data in both precisions or only one), and storage of the data in the new precision to be used with every application of the operator afterward, doesn't seem like a bad choice.
However, this won't necessarily be the best choice in at least two cases: one, when the input data can change between applications of the operator. Then, with every application, we pay the cost to read from double, convert to float, write to float. When it's time to use the data -- likely in the QFunction -- we read again in float. Whereas, if we didn't officially convert the "whole" vector, but had the QFunction kernel itself convert local data to float, then we only pay one double read cost at the time of QFunction application. (Again, this only applies if we can't just use the vector we already converted to float, because the data changed.)
The second case is the "low-high" case, where the input to the QFunction will be float, but we want to convert it locally to double for the QFunction computation. If the CeedVector
is already in single precision, then we don't want to convert to double "up front" when applying the operator, but rather read from float and convert inside the QFunction kernel.
This is assuming that we are still allowed to have different computation and input/output precisions for the QFunction, even if we specify that all inputs and outputs should be the same precision.
- Active inputs. Here, the best guideline for memory movement optimization seems to be to have the kernels handle conversion "locally," especially since in general, the vectors being propagated between kernels are internal only (E-Vectors and Q-Vectors). Furthermore, the conversion should be handled at the output of the previous kernel.
Let's consider the case where everything will be in double precision, except the QFunction. The output of the basis kernel for the active input will be an input for the QFunction, so it needs to be in float. If we apply the basis kernel as usual, call a conversion kernel/function (through the libCEED API), then call the QFunction kernel, we have a write in double + read in double/write in float + read in float. If the QFunction handles the conversion locally into workspace data, we still have write in double + read in float. If, however, the internal Q-vector created for the output of this input's E-vector already has its data allocated for float, and the basis kernel performs the write to float at its last step (converting from local double workspace that was used in the computation), then we only have write in float + read in float. Of course, if we want the basis and QFunction to be done in single precision, then we'd want the write to float happening on the output of the element restriction.
Also consider the "low-high" case: here, we never want double precision CeedVectors
to exist internally, to avoid all read/writes to main memory in double precision. Instead, each kernel can read in float, use a local double workspace, then convert back and write in float.
This setup comes with several drawbacks, like the proliferation of kernels (compute precision + input precision + output precision) and needing extra local workspace memory. It also means a lot of code changes in every backend. In the MAGMA backend, we have tensor basis kernels with a lot of template parameters, so we're building a lot of kernels already... However, I'm not sure how we can see much gain on GPUs without letting the kernels handle the conversion locally, which requires doing it in the kernels themselves, unless I am missing something.
All of this raises the question, I suppose, on how much this functionality matters for the non-fused GPU backends, since it's all considered in terms of the tensor basis case, where one would generally want to use cuda-gen, if possible. This could also be an argument for limiting the scope of what's allowed in terms of mix and match, as @jedbrown mentioned previously (e.g., basis and QFunction computation precision must match), which gets us back to the case of the element restriction handling the conversion (though it would still need to be inside the kernel). I do think that the "low-high" case might be especially worth pursuing on GPUs, however. For example, in cuda-gen, due to being memory bound, if we could read/write L-vectors in single precision (requiring the user to provide single precision vectors) but do everything else in double, can we achieve similar speedup as single but with less error?
Thanks for these detailed thoughts. My opinion is that conversion should happen in restriction and that when a vector (active or passive) does not match the compute precision, the restriction (or transpose) should be explicit even if it's otherwise the identity. We could have an interface to make a CeedOperator
coerce its passive inputs so that you can fill it up using whatever precision is convenient and convert so that two copies are not persisted. I consider this a minor optimization.
I think low-high is most important for speed with simple qfunctions. For very expensive qfunctions, there may be benefit of lower precision (but numerical stability is usually not well thought out for very expensive qfunctions). Also, global vectors of double
may be necessary in some cases to maintain orthogonality in Krylov methods, though there is some interesting work (including Kasia and Steve's) to improve accuracy with reduced precision Krylov basis.
If we were to decouple Basis and QFunction precision as a libCEED feature, we'd want to be able to do it automatically, not by forcing the conversion into the user's QFunction. But for the specific experiments I'd like to run to test/demonstrate benefits of stable numerics in QFunctions, I'm happy to manage the code that converts inside the QFunction. (That is, QFunction author is entirely responsible for conversions, and libCEED doesn't need to do anything.)
I think ability to convert in Restriction is tractable without code/template explosions and would open up significant research and production options. So I would accept that scope and only consider further extensions if/when a compelling case is mode for them.
One potential issue with always doing the conversion in the restriction is that minimized memory movement will depend on the mode: for high-low, it's better to convert in the element restriction (write to float, then read from float for basis), but for low-high, it'd be better to wait until the basis application and do a local conversion (write to float, read from float + convert to double in registers vs. write to double, read from double).
That said, I have gone ahead and created two branches where I am playing around with conversion in the restriction for cuda-gen -- not in a way we'd want to merge, but to compare performance. The two versions are low-high and high-low, with everything after the conversion in the restriction happening in the same precision (and with basis data stored in the computation precision, i.e., the "second" precision). I think the exact location of the conversion to higher precision matters less for cuda-gen than the non-fused CUDA backends due to how the element restrictions work (reading into registers, not doing entire elem restriction as a separate kernel). The one snag here is that technically, not quite all computation happens in the "second" precision, due to the atomicAdd calls in the transpose element restriction needing to match the output (L-vector) precision. The addition/reduction of E-vector into L-vector isn't much computation compared to the basis or QFunction, but it is still a different situation than the input element restrictions, which just convert the data.
Good points. I'd say the conversion is logically part of the restriction, though an implementation could choose to associate/fuse it differently. I think before we put too much thought into optimizing this pattern, we'll want some real-world tests. On Gauss-Lobatto nodes to Gauss quadrature points, the condition number of the basis interpolation matrix is less than 3 and the basis derivative matrix is around 10. So this part gives up about one digit of accuracy to perform in single precision versus double. I ultimately think we may consider methods that use fixed-point 32-bit integers for some of the intermediate representations, but I think it's premature until we have use cases showing where the extra bits are useful. And I think associating with restriction is entirely reasonable for these tests and performance with cuda-gen won't care.
After some informal discussions on Slack, I have the following proposal for mixed-precision "rules" for CeedVectors
:
- For backends with mixed-precision usage,
CeedVector
s backenddata
structs will no longer containCeedScalar *
arrays, but rather structs which can hold pointers to arrays in different precisions. - A suggestion for what this struct might be from @jedbrown was:
struct CeedScalarArray {
size_t length;
double *double_data;
float *float_data;
< maybe other precisions if we add them in the future >
}
- The
CeedVector
struct itself will have a new member for its preferred precision. This member will be aCeedScalarType
. - When a vector is accessed for read-only and a precision other than the preferred precision is indicated, a new data array will be allocated in the
CeedScalarArray
if necessary, and the data from the array corresponding to the preferred precision will be used to copy/convert to the requested precision. (Theoretically, then if we allowed more precisions in the future, a second read-only access in a third precision would still copy from whatever the preferred precision array is.) We may or may not need a way to keep track of which data arrays are currently allocated inside the struct or the vector. - When a vector is accessed for read-write access, the
CeedScalarArray
will "collapse" down to only the requested access precision. Any other arrays that may be allocated will be deallocated. The vector's preferred precision will be updated to match the requested precision. (Would we want two versions of this function, such that the default version will copy/convert from the preferred precision into the new array, but another version of the function will just allocate new memory to save time when we are just going to overwrite the data anyway?)
However, I realize this will cause problems, perhaps, with memory management. Right now we have a clear distinction in the backend data structs between arrays that have been allocated by libCEED vs given as a pointer, to make sure and deallocate as necessary when the vector is destroyed. A memory management system like the one above could lead to a CeedScalarArray
containing a pointer to double data owned elsewhere and a pointer to float data that was allocated by libCEED, for example. So I guess we could keep the array
and array_allocated
setup in the backend data, but where data in array
could be used to create data in array_allocated
upon read access request in a different precision? Could there be any problems there that I'm missing?
I think this is a good model. Yes, the array_allocated
would be per-precision. I don't think eager deallocation is necessary since I think it usually won't change the peak memory use.
Rough code for what I was thinking of.
typedef enum {
CEED_PRECISION_HALF = 0,
CEED_PRECISION_SINGLE = 1,
CEED_PRECISION_DOUBLE = 2,
CEED_PRECISION_DOUBLE_DOUBLE = 3,
} CeedPrecisionType;
...
char *data_ptrs[4];
...
double *double_ptr = data_ptrs[CEED_PRECISION_DOUBLE];
// Set two pointers of user arrays to use
CeedVectorSetArray(v, CEED_MEM_HOST, CEED_PRECISION_HALF, CEED_USE_POINTER, &h_ptr);
CeedVectorSetArray(v, CEED_MEM_HOST, CEED_PRECISION_DOUBLE, CEED_USE_POINTER, &d_ptr);
// Copy to half precision
// Would contain data from d_ptr converted to half
CeedVectorTakeArray(v, CEED_MEM_HOST, CEED_PRECISION_HALF, &h_ptr);
Did you want to create a separate CeedPrecisionType
when we already have CeedScalarType
? Do you think they should be separate?
Would a data_ptrs
array take the place of each of array
/array_allocated
(or array
/array_borrowed
/array_owned
as proposed in #853) in the backend's private data for the Vector?
I was thinking something like
(inside backend data)
CeedScalarArray array;
CeedScalarArray array_borrowed;
CeedScalarArray array_owned;
so the double_ptr
would be array.double_data
.
One reason I thought it would be better to stick with something like the CeedScalarArray
object is that we could use it cleanly in, e.g., edata
arrays inside Operators (in case not all E-Vectors
have to have the same precision, or just so we don't have to have separate pointers for every precision the E-Vectors
may use, even if they're all the same). However, we could probably work around this if something like the data_ptrs
setup is preferred.
(And maybe we should pick a different name for CeedScalarArray
-- like CeedData
or something more generic?)
Oh, I was just scratching down some ideas. I wouldn't replicate two enums for basically the same thing, I just didn't remember the enum names offhand.
I don't think we would want to pass around an object of different precision pointers divorced from their validity/staleness information, right? There is a lot of state management logic, and I'd think we would want to hide that inside of the CeedVector so that logic doesn't have to leak into other parts of the interface.
Do we want to adopt the convention that all QFunction data will be in the same precision? That would simplify our internal needs.
I think it's reasonable for QFunctions to use homogeneous precision because
- conversion isn't free, so not great to have occur in an inner loop that needs to vectorize
- conversion is clutter in user code and implicit promotion rules may be surprising and not match between languages
- it'll be error-prone to ensure safety between the source of a QFunction and creating the object
We should probably expect the user to explicitly set the precision of the context data?