zig
zig copied to clipboard
Add new builtin: @typeId
Add a new builtin called @typeId
:
@typeId(comptime T: type) usize
This builtin returns a unique integer for each type passed, and will return the same integer for the same type.
The return value must not be consistent inbetween builds, so a second build might return completly different numbers for the same types.
An alternative variant might return u32
or u64
to have a stable interface between different platforms.
Use cases
- Type erasure
- Additional programmatic type safety
- Variadic in-memory serialization
Prior art:
User-land implementation
The following version is runtime only, as we can't perform intFromPtr at compiletime:
fn typeId(comptime T: type) TypeId {
const Tag = struct {
var name: u8 = @typeName(T)[0]; // must depend on the type somehow!
inline fn id() TypeId {
return @enumFromInt(@intFromPtr(&name));
}
};
return Tag.id();
}
One small request: it would be really nice if this returned u32
or another smaller integer type, instead of usize
It could also take inspiration from the @intFromEnum
behavior and return an integer denoted as anytype
which may have the smallest integer type possible. This leaves room for shrinking the return type in the future because I'm not sure how/if this would affect incremental story.
There is actually an at-comptime solution for typeId
that works today but it is absolutely horrible:
fn typeId(comptime T: type) u32 {
return @intFromError(@field(anyerror, @typeName(T)));
}
Bad status quo solutions have helped back changes such as this one before so just wanted to share. :)
@sno2 Using RLS here is a bit tricky. There are two options:
- The same integer value is returned regardless of the type. In this case, that integer must fit in some minimum size (e.g.
u32
), and so there is no difference to just returning that size integer! - The integer value differs based on the result type. I presume this is what you intend. The issue here is that it's a bit of a footgun; if you accidentally pass the result type as
u32
and upcast tou64
, or vice versa, you might accidentally get a different ID to a different part of your codebase! This would probably lead to quite tricky bugs.
If this proposal is accepted, I definitely think the returned integer should have a fixed size (probably 32 bits). 32 bits is a sweet spot: 16 is too few to represent the number of types which might exist in a large application, but 64 (or usize) is very excessive (in fact, the canonical compiler implementation won't let you have more than 2^32 distinct types today, and I am beyond certain nobody will ever hit this limitation).
@sno2 Using RLS here is a bit tricky. There are two options:
- The same integer value is returned regardless of the type. In this case, that integer must fit in some minimum size (e.g.
u32
), and so there is no difference to just returning that size integer!
I was more thinking of the compiler counting how many types we have defined and log2 that into an integer type as the TypeId
integer type. Although, this is now seeming like a hard task with many different edge cases depending on when you call @typeId
so I think using u32
as you said is the best option.
Also, ~~possibly useless~~ side note but Rust's type id uses a u128 but I wasn't able to find any reasoning or investigations into shrinking it anywhere.
32 bits is a sweet spot: 16 is too few to represent the number of types which might exist in a large application, [...]
I can't think of a use case I would ever use this builtin for (it's a bit at odds with my fundamental design philosophy), but for everyone here who seems to have use cases: Do you expect to use it for all/most types that appear anywhere (including intermediately) in your entire program? Think of every
- integer type, signed and unsigned,
- enum, union
- pointer with and without mutability, for every alignment
- arrays with different lengths
- optional
That multiplies into a really big number.
I would expect many use cases actually only use this builtin when serializing a rather small set of types (and maybe their fields' types, recursively) over particular interfaces.
Therefore it might be less work for the compiler to only assign ids to types that have been passed to @typeId
.
Doing that, maybe a u8
would be enough for some use cases, and we can guarantee only code which uses this feature "pays" for it (assuming this isn't something we already get for basically free via the intern pool - which it might be).
(Then again, maybe this is more of an ergonomics feature than a performance-oriented one? Plus, deduplicating ids in userland is also possible, even if it poses some of the same global-type-list challenges this proposal fundamentally tries to move into the compiler.)
Could someone give a solid use-case for this, I have nothing in my head. And never came across situation I need this even remotely.
The one reason I've wanted it in the past is for safety on *anyopaque
. Assuming @typeName
is guaranteed to be unique, it can be used instead.
const AnyPtr = struct {
type_id: usize, // alternatively, [*:0]u8
ptr: *anyopaque,
pub fn from(item: AnyPtr, comptime T: type, ptr: *T) AnyPtr {
return .{
.type_id = @typeId(T), // alternatively @typeName(T).ptr
.ptr = @ptrCast(@alignCast(ptr)),
};
}
pub fn readAs(item: AnyPtr, comptime T: type) *T {
if(item.type_id != @typeId(T)) unreachable; // alternatively `item.type_id != @typeName(T).ptr`
return @ptrCast(@alignCast(item.ptr));
}
};
Could someone give a solid use-case for this, I have nothing in my head. And never came across situation I need this even remotely.
Basically type checking (see linked any-pointer project) when doing type erasure, then see #19859 where you need to store a user-defined type in a non-generic datastructure (think HashMap(TypeId, *anyopaque)
, you can implement RTTI with, ...
As @MasterQ32 indicated in the original issue, his any-pointer
is a great example of an exact usecase, but another explicit example of where @typeId
would also be useful is in an ECS. The hack I shared above was actually created while attempting to solve an enum
identification system with @slimsag where values could be detached and reattached from their respective enum
types to identify components and events and store their identities easily.
Of course, all of these issues can be solved with userspace hacks but:
- The runtime tricks are of course runtime-only, which didn't solve the ECS problem above
- The comptime tricks are all awful hacks
- The runtime tricks are also hacky, albeit less awful
To not use any hacks while obtaining unique type identifiers, you can do something like this, but:
- It would require passing it to all code (perhaps this is acceptable considering Zig's dislike of global state)
- It requires tying all code that requires shared type ID together, including dependencies/dependents, which is a huge pain if you care about extensibility (and I'm not sure if it's even possible as you can't store this comptime-only type globally and I can't think of a way to effectively pass it around to dependencies/dependents 😅)
- The performance/memory usage is extremely questionable, especially considering (unless this has changed recently) that the compiler does not garbage collect unused data
In my opinion, any sort of RTTI-ish solution would greatly benefit from this builtin. I imagine Felix sees it the same way, thus why he opened this issue.
About implementation details @rohlem, check out my PR to see how easy it is to implement from the InternPool. In short, the InternPool stores types (and other deduplication-dependent data like default values, memoized calls, etc. though this is not important for this explanation) by inserting them into a std.AutoArrayHashMapUnmanaged(void, void)
which produces a single, unique InternPool.Index
(a u32
-backed enum
) which we can then reuse for @typeId
. If, understandably, the compiler folks wouldn't like exposing InternPool
indices directly, we could simply create a second map, a std.AutoArrayHashMapUnmanaged(InternPool.Index, void)
, which would also be relatively inexpensive.
I think #5459 is directly related (and solved by #19861).
@rohlem Do you expect to use it for all/most types that appear anywhere (including intermediately) in your entire program? Think of every
* integer type, signed and unsigned, * enum, union * pointer with and without mutability, for every alignment * arrays with different lengths * optional
This made me wonder about how the technical implementation would solve something like this (I'm pretending like @typeId
just returns a usize for convenience here. You could insert any necessary @intFromEnum
or whatever to make it work)
fn SelfReferentialStruct(comptime T: type) type {
return struct {
const Self = @This();
const array: [@typeId(Self)] u32 = undefined;
};
}
edit: Nevermind. I just checked and there already is a check for similar transitive failures in the compiler. :)
Am I correct that the idea os basically to split pointer to T
to void pointer and type separately, and to identify type to use it's unique identifier?
If that's correct, that's very interesting feature. But I would like to extend it even further. If it's stored separately, we can save this information to disk and restore back. But only if we have stable guarantee not only within one build.
Do I'd to take into account this feature too with this proposal. Where this might be useful? I think in databases, where there information about types is stored on disk, like in PostgreSQL where is used Oid type which uniquely identifies almost any object in database - types, attributes, tables, etc.
After sharing my first terrible comptime
typeId
hack, I'm back for more:
pub fn typeId(comptime T: type) u32 {
_ = T;
const fn_name = @src().fn_name;
return std.fmt.parseInt(u32, fn_name[std.mem.lastIndexOfScalar(u8, fn_name, '_').? + 1 ..], 10) catch unreachable;
}
This one even exposes the InternPool.Index
of the memoized call - enjoy! :^) (this one is runtime only though :()
stable guarantee not only within one build
This is not practical or even really possible. You can already assign explicit IDs to types manually, through a variety of methods, which is a much better option for serialization usecases.
@rohlem:
Do you expect to use it for all/most types that appear anywhere (including intermediately) in your entire program?
I don't necessarily care about having an ID for every single type in the program, because there are also going to be lots of comptime utility types for which it's not necessarily helpful to have an ID for. And I definitely don't need IDs for almost all of std.
For the specific examples you cited, yes I would want all power-of-two sized integer types signed and unsigned (and some non-power-of-two, but not the whole 64k spectrum, that'd be a little crazy). Var and const pointers and slices for any used type, yes, arrays of varying lengths yes, optionals definitely.
A u8 is definitely not enough to cover my needs, I can easily see needing at least a few thousand IDs to cover all the type variations (working from an "unadorned" set of 200-300 declared types). These would mainly be keys in tables, so even if they were pointer-sized that'd be fine by me. If I want a compressed type ID for sending over the wire or for cutting down on memory usage, I can do that in userspace on top of a builtin @typeId
like you said.
(Then again, maybe this is more of an ergonomics feature than a performance-oriented one?)
This is definitely about ergonomics and "work scaling" (as in, scaling how many people are working on a codebase and reducing LOC needed to implement functionality) for me. A chonky type ID is the cost of doing business, so if it's straightforward to just use an ID from the InternPool and it's always u32 or u64 or w.e., that totally serves my needs. Using something abnormally large like u128 seems excessive (is Rust just using UUIDs? mehhhh) but I'd learn to live with it.
The biggest deal to me is ensuring it's consistent between comptime and runtime. With the current method of taking the address of storage you have to add a wrapper function to not mess that up, it can be tricky if you're not just directly using someone else's userland typeId
.
For ref, here's how I handle keeping runtime and comptime consistent (the type id stays a pointer, and I have an extra enum type to convert to/from for smuggling purposes):
pub const TypeID = *const Type;
pub const TypeIntID = enum(usize) {
invalid = 0,
_,
pub fn from(tid: TypeID) @This() {
return @enumFromInt(@intFromPtr(tid));
}
pub fn toTypeID(self: @This()) TypeID {
return @ptrFromInt(@intFromEnum(self));
}
};
pub const Type = opaque {
pub const id: fn (comptime T: type) TypeID = struct {
inline fn tid(comptime T: type) TypeID {
const TypeIDSlot = struct {
var slot: u8 = undefined;
comptime {
_ = T;
}
};
return @ptrCast(&TypeIDSlot.slot);
}
fn typeID(comptime T: type) TypeID {
return comptime tid(T);
}
}.typeID;
pub fn toIntId(self: *const Type) TypeIntID {
return TypeIntID.from(self);
}
};
I've been using it for a while, I think InK or Vexu helped me arrive at this based on ikskuh's typeId.
The return value must not be consistent inbetween builds, so a second build might return completly different numbers for the same types.
Is the intention that the returned value is comptime or runtime? If the former, this essentially introduces undeterminism in the type system. I'd suggest to make the ID at least stable between compilations of the same source code, to ensure reproducibility of builds, and not forcibly require randomness here for debugging purposes.
@sno2 Using RLS here is a bit tricky. There are two options:
- The same integer value is returned regardless of the type. In this case, that integer must fit in some minimum size (e.g.
u32
), and so there is no difference to just returning that size integer!- The integer value differs based on the result type. I presume this is what you intend. The issue here is that it's a bit of a footgun; if you accidentally pass the result type as
u32
and upcast tou64
, or vice versa, you might accidentally get a different ID to a different part of your codebase! This would probably lead to quite tricky bugs.If this proposal is accepted, I definitely think the returned integer should have a fixed size (probably 32 bits). 32 bits is a sweet spot: 16 is too few to represent the number of types which might exist in a large application, but 64 (or usize) is very excessive (in fact, the canonical compiler implementation won't let you have more than 2^32 distinct types today, and I am beyond certain nobody will ever hit this limitation).
But what about microcontrollers, for example 8 bit? usize is universal.
@sibkit
I dare you to measure it, less than usize will not stand out
Ehhhh I'd be careful about the absolute language here. 64bit vs 32bit identifiers can totally show up in memory profiles depending on how they're used.
Common memory optimization pattern: given two 32bit identifiers comprising some sort of combined category+id tag value, if you are reasonably certain you can mask & shift them onto a single 32bit field, and you have tens of thousands of instances of structs containing one or more of these, you're looking at multiple megabytes of memory saved. I care about saving multiple megabytes, and I'm not even writing for microcontrollers. Sometimes you need to save memory, and good targets aren't even the things that take up the most memory overall, but can give you a short enough haircut to meet your target without having to suffer other tradeoffs.
If we know the type ID value space is dense and predictably fills up from the bottom, those kinds of optimizations are possible even if the width is usize. But I'd be inclined to treat the value space here as a black box unless somebody on the core team says otherwise.
But what about microcontrollers, for example 8 bit? usize is universal.
usize
on AVR should be 32 bit, while register types would be 8 bit and @sizeOf(*void)
should be 16 bit
After some extra thinking, Im completelt against possible nondeterminism at compile time. For a runtime value I think that it's fine provided that the result is stable across compiler runs (ideally even across different computers).
It would be fine if the ID can be spec'd somehow, but that would probably be quite hard / can be done just as well in user code. I suggest to make the return value runtime instead to avoid that issue.
@Snektron Making the return value of a this builtin runtime-only would hinder major use cases for this feature which are currently already possible via userland hacks.
My need for this feature is to be able to store type IDs at comptime while building lookup structures and then use those values later to cross-reference types and as keys in hashtables. Ideally I'd also like to be able to create lookup tables at comptime.
If this restriction could be: the actual typeID values can be stored and compare for equality but are otherwise opaque until runtime, that's a restriction I could live with (that's the status quo for my current typeID hack), provided the stored typeIDs are fixed-up in the final stage of compilation to avoid the case where typeID equality stays consistent between comptime and runtime. It's easy to break that property currently with the userland hacks if you aren't careful.
I can understand not wanting non-deterministic compilation. Where would the non-determinism come from when allowing these values to exist at comptime?
If the value is determinstic across runs, compiler versions, etc, I dont have a problem with it. In Auguste's proof of concept, the value is derived from a compiler-internal type database key, and is depended on the order that the compiler processes the code. This is an implementation detail. My main concern is that when the compiler is parallelized further, this processing becomes unstable.
@ikskuh pointed out to me that intFromError has the same issue (and this is also what makes the current hack work), and I think that's a problem too.
That makes sense. I guess incremental compilation would also introduce non-determinism. I wouldn't really care about IDs shifting around between subsequent builds of the same source, but like I said I can see how that's a desirable property. I wonder how many uses of @typeName
would have the same issue.
What about the constraint I mentioned: typeIDs can be checked for equality at comptime, but otherwise can't be observed or compared--including logging out the std.fmt representation, which could just be an opaque thing like "TypeID(comptime)"
. Equality is the primary thing that matters, as well as knowing that if I store one that the equality will stay consistent between comptime and runtime. Sort of similar to how @intFromPtr
currently works.
That would be a big loss since it would mean we can't hash them at comptime, but it would at least provide a "blessed" version of the current hacks which wouldn't be at risk of breaking since it would be an explicit language feature.
@Snektron Making the return value of a this builtin runtime-only would hinder major use cases for this feature which are currently already possible via userland hacks.
...
If this restriction could be: the actual typeID values can be stored and compare for equality but are otherwise opaque until runtime, that's a restriction I could live with (that's the status quo for my current typeID hack), provided the stored typeIDs are fixed-up in the final stage of compilation to avoid the case where typeID equality stays consistent between comptime and runtime. It's easy to break that property currently with the userland hacks if you aren't careful.
I really love that. It has the best of both worlds:
- We get comptime values that are not completely useless
- We have stable runtime values (Compiler can just do a single sort post sema)
If this restriction could be: the actual typeID values can be stored and compare for equality but are otherwise opaque until runtime
type
already supports ==
?
runtime-only means this fits better in the stdlib (if anywhere) and not a builtin
I need to compare and store those identifiers into const decls at comptime and then compare them later at runtime, so comparing type equality doesn't work. I need a consistent correlation that goes across the comptime/runtime boundary.
At first sight Rust's approach of using u128
s with a cryptographic generation (basically uuids) might seem overkill. But there are also cases where you might want this:
Suppose you are writing a ECS library which is later linked to your application. If the library keeps track of its Data with comptime generated ids we are going to have a problem: there might be internal structs with a type id in the library which the compiler can not know of leading to ambigous type ids. In This case UUIDs might come in handy since it's way less likely of getting a type collision.
Though I would still choose another approach:
My idea is to add in fact 3 more builtin functions:
1. @genericId()
This one works in combination with a generic function (generator function). It should result in each version of this generic function getting a different Id. e.g.
pub fn type_id(comptime T: type) usize {
return @genericId();
}
assert(type_id(u32) != type_id(*const bool))
assert(type_id(u32) == type_id(u32))
this may expand to
pub fn type_id(u32) usize {
return 0;
}
pub fn type_id(*const bool) usize {
return 1;
}
In regard of the return type, the amount of different versions version_count
of the generic function can be counted and the return type is infered
with a minimum of log2 version_count
of bits. This way it is possible to have low memory usage of type ids if you work with embeded systems and you need only a few type ids.
Furthermore this allows type ids to be scoped. Meaning the generator function describes the meaning of the type id and only this generator function should be used for this single purpose.
// type_id generator function added in the std
std.builtin.type_id(u32);
// this may be the same value since they implement two different generator functions
myLib.type_id(bool);
// with
const myLib = struct {
// use u8 for my very efficient embeded library
pub fn type_id(comptime T: type) u8 {
return @genericId();
}
};
If any other generator funtion is used there will be an overlap, but this is the users fault. If you want a global (default scoped) type id, just add a generator function to the std and use this.
2 @useVal(comptime t: anytype)
This one might not be necessary but if generic funtions are lazy and do not generate different versions if T
of the generator function is not used or if T
is disgarded via _ = T
then @useVal(T);
would force the generic function to generate a version for each different T
which is passed to it.
Hence the real implementation would look like
pub fn type_id(comptime T: type) usize {
@useVal(T);
return @genericId();
}
Still, this is only needed if this does not work
pub fn type_id(comptime T: type) usize {
_ = T;
return @genericId();
}
3. @UUID()
Afaik we still have no possibility of generating uuids at compile time and this would come in handy in some ways (e.g. if you want to be pretty sure there are no type collisions with external libraries, if you need an random number at comptime, etc.). This builtin function should work again via inline but can also be used outside of generic functions to create a uuid without any type context and there need to be no scopes since they would not matter anyways. In terms of the return type i am still not sure which one would suit it the best but in my examples i am just using u128
.
pub fn type_uuid(comptime T: type) u128 {
@useVal(T);
return @UUID();
}
const program_uuid = @UUID();
assert(type_uuid(u42) != type_uuid(i37));
assert(type_uuid(u16) == type_uuid(u16));
Now to solve the hypthetical issue of the linked ECS library you could either use the uuid approach or expose a type id generator function implementation in your zig binding to that library which adds the amount of internal types to the type id value to avoid any collision.
A last thing which would be nice to have, is a way to figure out how many versions of a generic function are actually generated.
I know adding 3 new builtin functions for a single problem would be a little bit over the top, though i hope this idea serves to find an optimal solution while maintining the philosophy of zig:
* Only one obvious way to do things.
* Incremental improvements.
* Reduce the amount one must remember.
* Focus on code rather than style.
...
Generating type ids using hash functions is definitely the way to go, because we get a stable interface between platforms. So, for example, if you export your library as a dynamic one, you can make an interface that provides a type id, and these type ids will match between different libraries.
Here is the code I used:
fn typeId(comptime T: type) u128 {
const Type = struct {
const id: u128 = result: {
const a = std.hash.Wyhash.hash(3832269059401244599, @typeName(T));
const b = std.hash.Wyhash.hash(5919152850572607287, @typeName(T));
break :result @bitCast([2]u64{ a, b });
};
};
return Type.id;
}