zig icon indicating copy to clipboard operation
zig copied to clipboard

Add new builtin: @typeId

Open ikskuh opened this issue 9 months ago • 29 comments

Add a new builtin called @typeId:

@typeId(comptime T: type) usize

This builtin returns a unique integer for each type passed, and will return the same integer for the same type.

The return value must not be consistent inbetween builds, so a second build might return completly different numbers for the same types.

An alternative variant might return u32 or u64 to have a stable interface between different platforms.

Use cases

  • Type erasure
  • Additional programmatic type safety
  • Variadic in-memory serialization

Prior art:

User-land implementation

The following version is runtime only, as we can't perform intFromPtr at compiletime:

fn typeId(comptime T: type) TypeId {
    const Tag = struct {
        var name: u8 = @typeName(T)[0]; // must depend on the type somehow!
        inline fn id() TypeId {
            return @enumFromInt(@intFromPtr(&name));
        }
    };
    return Tag.id();
}

ikskuh avatar May 04 '24 11:05 ikskuh

One small request: it would be really nice if this returned u32 or another smaller integer type, instead of usize

silversquirl avatar May 04 '24 17:05 silversquirl

It could also take inspiration from the @intFromEnum behavior and return an integer denoted as anytype which may have the smallest integer type possible. This leaves room for shrinking the return type in the future because I'm not sure how/if this would affect incremental story.

sno2 avatar May 04 '24 18:05 sno2

There is actually an at-comptime solution for typeId that works today but it is absolutely horrible:

fn typeId(comptime T: type) u32 {
    return @intFromError(@field(anyerror, @typeName(T)));
}

Bad status quo solutions have helped back changes such as this one before so just wanted to share. :)

SuperAuguste avatar May 04 '24 18:05 SuperAuguste

@sno2 Using RLS here is a bit tricky. There are two options:

  • The same integer value is returned regardless of the type. In this case, that integer must fit in some minimum size (e.g. u32), and so there is no difference to just returning that size integer!
  • The integer value differs based on the result type. I presume this is what you intend. The issue here is that it's a bit of a footgun; if you accidentally pass the result type as u32 and upcast to u64, or vice versa, you might accidentally get a different ID to a different part of your codebase! This would probably lead to quite tricky bugs.

If this proposal is accepted, I definitely think the returned integer should have a fixed size (probably 32 bits). 32 bits is a sweet spot: 16 is too few to represent the number of types which might exist in a large application, but 64 (or usize) is very excessive (in fact, the canonical compiler implementation won't let you have more than 2^32 distinct types today, and I am beyond certain nobody will ever hit this limitation).

mlugg avatar May 04 '24 19:05 mlugg

@sno2 Using RLS here is a bit tricky. There are two options:

  • The same integer value is returned regardless of the type. In this case, that integer must fit in some minimum size (e.g. u32), and so there is no difference to just returning that size integer!

I was more thinking of the compiler counting how many types we have defined and log2 that into an integer type as the TypeId integer type. Although, this is now seeming like a hard task with many different edge cases depending on when you call @typeId so I think using u32 as you said is the best option.

Also, ~~possibly useless~~ side note but Rust's type id uses a u128 but I wasn't able to find any reasoning or investigations into shrinking it anywhere.

sno2 avatar May 04 '24 20:05 sno2

32 bits is a sweet spot: 16 is too few to represent the number of types which might exist in a large application, [...]

I can't think of a use case I would ever use this builtin for (it's a bit at odds with my fundamental design philosophy), but for everyone here who seems to have use cases: Do you expect to use it for all/most types that appear anywhere (including intermediately) in your entire program? Think of every

  • integer type, signed and unsigned,
  • enum, union
  • pointer with and without mutability, for every alignment
  • arrays with different lengths
  • optional

That multiplies into a really big number.

I would expect many use cases actually only use this builtin when serializing a rather small set of types (and maybe their fields' types, recursively) over particular interfaces. Therefore it might be less work for the compiler to only assign ids to types that have been passed to @typeId. Doing that, maybe a u8 would be enough for some use cases, and we can guarantee only code which uses this feature "pays" for it (assuming this isn't something we already get for basically free via the intern pool - which it might be).

(Then again, maybe this is more of an ergonomics feature than a performance-oriented one? Plus, deduplicating ids in userland is also possible, even if it poses some of the same global-type-list challenges this proposal fundamentally tries to move into the compiler.)

rohlem avatar May 05 '24 10:05 rohlem

Could someone give a solid use-case for this, I have nothing in my head. And never came across situation I need this even remotely.

likern avatar May 05 '24 12:05 likern

The one reason I've wanted it in the past is for safety on *anyopaque. Assuming @typeName is guaranteed to be unique, it can be used instead.

const AnyPtr = struct {
    type_id: usize, // alternatively, [*:0]u8
    ptr: *anyopaque,
    pub fn from(item: AnyPtr, comptime T: type, ptr: *T) AnyPtr {
        return .{
            .type_id = @typeId(T), // alternatively @typeName(T).ptr
            .ptr = @ptrCast(@alignCast(ptr)),
        };
    }
    pub fn readAs(item: AnyPtr, comptime T: type) *T {
        if(item.type_id != @typeId(T)) unreachable; // alternatively `item.type_id != @typeName(T).ptr`
        return @ptrCast(@alignCast(item.ptr));
    }
};

pfgithub avatar May 05 '24 17:05 pfgithub

Could someone give a solid use-case for this, I have nothing in my head. And never came across situation I need this even remotely.

Basically type checking (see linked any-pointer project) when doing type erasure, then see #19859 where you need to store a user-defined type in a non-generic datastructure (think HashMap(TypeId, *anyopaque), you can implement RTTI with, ...

ikskuh avatar May 05 '24 17:05 ikskuh

As @MasterQ32 indicated in the original issue, his any-pointer is a great example of an exact usecase, but another explicit example of where @typeId would also be useful is in an ECS. The hack I shared above was actually created while attempting to solve an enum identification system with @slimsag where values could be detached and reattached from their respective enum types to identify components and events and store their identities easily.

Of course, all of these issues can be solved with userspace hacks but:

  • The runtime tricks are of course runtime-only, which didn't solve the ECS problem above
  • The comptime tricks are all awful hacks
  • The runtime tricks are also hacky, albeit less awful

To not use any hacks while obtaining unique type identifiers, you can do something like this, but:

  • It would require passing it to all code (perhaps this is acceptable considering Zig's dislike of global state)
  • It requires tying all code that requires shared type ID together, including dependencies/dependents, which is a huge pain if you care about extensibility (and I'm not sure if it's even possible as you can't store this comptime-only type globally and I can't think of a way to effectively pass it around to dependencies/dependents 😅)
  • The performance/memory usage is extremely questionable, especially considering (unless this has changed recently) that the compiler does not garbage collect unused data

In my opinion, any sort of RTTI-ish solution would greatly benefit from this builtin. I imagine Felix sees it the same way, thus why he opened this issue.

About implementation details @rohlem, check out my PR to see how easy it is to implement from the InternPool. In short, the InternPool stores types (and other deduplication-dependent data like default values, memoized calls, etc. though this is not important for this explanation) by inserting them into a std.AutoArrayHashMapUnmanaged(void, void) which produces a single, unique InternPool.Index (a u32-backed enum) which we can then reuse for @typeId. If, understandably, the compiler folks wouldn't like exposing InternPool indices directly, we could simply create a second map, a std.AutoArrayHashMapUnmanaged(InternPool.Index, void), which would also be relatively inexpensive.

SuperAuguste avatar May 05 '24 18:05 SuperAuguste

I think #5459 is directly related (and solved by #19861).

yzrmn avatar May 06 '24 12:05 yzrmn

@rohlem Do you expect to use it for all/most types that appear anywhere (including intermediately) in your entire program? Think of every

* integer type, signed and unsigned,

* enum, union

* pointer with and without mutability, for every alignment

* arrays with different lengths

* optional

This made me wonder about how the technical implementation would solve something like this (I'm pretending like @typeId just returns a usize for convenience here. You could insert any necessary @intFromEnum or whatever to make it work)

fn SelfReferentialStruct(comptime T: type) type {
    return struct {
        const Self = @This();
        const array: [@typeId(Self)] u32 = undefined;
    };
}

edit: Nevermind. I just checked and there already is a check for similar transitive failures in the compiler. :)

greytdepression avatar May 06 '24 15:05 greytdepression

Am I correct that the idea os basically to split pointer to T to void pointer and type separately, and to identify type to use it's unique identifier?

If that's correct, that's very interesting feature. But I would like to extend it even further. If it's stored separately, we can save this information to disk and restore back. But only if we have stable guarantee not only within one build.

Do I'd to take into account this feature too with this proposal. Where this might be useful? I think in databases, where there information about types is stored on disk, like in PostgreSQL where is used Oid type which uniquely identifies almost any object in database - types, attributes, tables, etc.

likern avatar May 06 '24 17:05 likern

After sharing my first terrible comptime typeId hack, I'm back for more:

pub fn typeId(comptime T: type) u32 {
    _ = T;
    const fn_name = @src().fn_name;
    return std.fmt.parseInt(u32, fn_name[std.mem.lastIndexOfScalar(u8, fn_name, '_').? + 1 ..], 10) catch unreachable;
}

This one even exposes the InternPool.Index of the memoized call - enjoy! :^) (this one is runtime only though :()

SuperAuguste avatar May 14 '24 08:05 SuperAuguste

stable guarantee not only within one build

This is not practical or even really possible. You can already assign explicit IDs to types manually, through a variety of methods, which is a much better option for serialization usecases.

silversquirl avatar May 14 '24 08:05 silversquirl

@rohlem:

Do you expect to use it for all/most types that appear anywhere (including intermediately) in your entire program?

I don't necessarily care about having an ID for every single type in the program, because there are also going to be lots of comptime utility types for which it's not necessarily helpful to have an ID for. And I definitely don't need IDs for almost all of std.

For the specific examples you cited, yes I would want all power-of-two sized integer types signed and unsigned (and some non-power-of-two, but not the whole 64k spectrum, that'd be a little crazy). Var and const pointers and slices for any used type, yes, arrays of varying lengths yes, optionals definitely.

A u8 is definitely not enough to cover my needs, I can easily see needing at least a few thousand IDs to cover all the type variations (working from an "unadorned" set of 200-300 declared types). These would mainly be keys in tables, so even if they were pointer-sized that'd be fine by me. If I want a compressed type ID for sending over the wire or for cutting down on memory usage, I can do that in userspace on top of a builtin @typeId like you said.

(Then again, maybe this is more of an ergonomics feature than a performance-oriented one?)

This is definitely about ergonomics and "work scaling" (as in, scaling how many people are working on a codebase and reducing LOC needed to implement functionality) for me. A chonky type ID is the cost of doing business, so if it's straightforward to just use an ID from the InternPool and it's always u32 or u64 or w.e., that totally serves my needs. Using something abnormally large like u128 seems excessive (is Rust just using UUIDs? mehhhh) but I'd learn to live with it.

The biggest deal to me is ensuring it's consistent between comptime and runtime. With the current method of taking the address of storage you have to add a wrapper function to not mess that up, it can be tricky if you're not just directly using someone else's userland typeId.

michaelbartnett avatar May 19 '24 18:05 michaelbartnett

For ref, here's how I handle keeping runtime and comptime consistent (the type id stays a pointer, and I have an extra enum type to convert to/from for smuggling purposes):

pub const TypeID = *const Type;

pub const TypeIntID = enum(usize) {
    invalid = 0,
    _,

    pub fn from(tid: TypeID) @This() {
        return @enumFromInt(@intFromPtr(tid));
    }

    pub fn toTypeID(self: @This()) TypeID {
        return @ptrFromInt(@intFromEnum(self));
    }
};

pub const Type = opaque {
    pub const id: fn (comptime T: type) TypeID = struct {
        inline fn tid(comptime T: type) TypeID {
            const TypeIDSlot = struct {
                var slot: u8 = undefined;
                comptime {
                    _ = T;
                }
            };
            return @ptrCast(&TypeIDSlot.slot);
        }

        fn typeID(comptime T: type) TypeID {
            return comptime tid(T);
        }
    }.typeID;

    pub fn toIntId(self: *const Type) TypeIntID {
        return TypeIntID.from(self);
    }
};

I've been using it for a while, I think InK or Vexu helped me arrive at this based on ikskuh's typeId.

michaelbartnett avatar May 19 '24 19:05 michaelbartnett

The return value must not be consistent inbetween builds, so a second build might return completly different numbers for the same types.

Is the intention that the returned value is comptime or runtime? If the former, this essentially introduces undeterminism in the type system. I'd suggest to make the ID at least stable between compilations of the same source code, to ensure reproducibility of builds, and not forcibly require randomness here for debugging purposes.

Snektron avatar May 27 '24 06:05 Snektron

@sno2 Using RLS here is a bit tricky. There are two options:

  • The same integer value is returned regardless of the type. In this case, that integer must fit in some minimum size (e.g. u32), and so there is no difference to just returning that size integer!
  • The integer value differs based on the result type. I presume this is what you intend. The issue here is that it's a bit of a footgun; if you accidentally pass the result type as u32 and upcast to u64, or vice versa, you might accidentally get a different ID to a different part of your codebase! This would probably lead to quite tricky bugs.

If this proposal is accepted, I definitely think the returned integer should have a fixed size (probably 32 bits). 32 bits is a sweet spot: 16 is too few to represent the number of types which might exist in a large application, but 64 (or usize) is very excessive (in fact, the canonical compiler implementation won't let you have more than 2^32 distinct types today, and I am beyond certain nobody will ever hit this limitation).

But what about microcontrollers, for example 8 bit? usize is universal.

sibkit avatar Jul 15 '24 21:07 sibkit

@sibkit

I dare you to measure it, less than usize will not stand out

Ehhhh I'd be careful about the absolute language here. 64bit vs 32bit identifiers can totally show up in memory profiles depending on how they're used.

Common memory optimization pattern: given two 32bit identifiers comprising some sort of combined category+id tag value, if you are reasonably certain you can mask & shift them onto a single 32bit field, and you have tens of thousands of instances of structs containing one or more of these, you're looking at multiple megabytes of memory saved. I care about saving multiple megabytes, and I'm not even writing for microcontrollers. Sometimes you need to save memory, and good targets aren't even the things that take up the most memory overall, but can give you a short enough haircut to meet your target without having to suffer other tradeoffs.

If we know the type ID value space is dense and predictably fills up from the bottom, those kinds of optimizations are possible even if the width is usize. But I'd be inclined to treat the value space here as a black box unless somebody on the core team says otherwise.

michaelbartnett avatar Jul 16 '24 01:07 michaelbartnett

But what about microcontrollers, for example 8 bit? usize is universal.

usize on AVR should be 32 bit, while register types would be 8 bit and @sizeOf(*void) should be 16 bit

ikskuh avatar Jul 16 '24 07:07 ikskuh

After some extra thinking, Im completelt against possible nondeterminism at compile time. For a runtime value I think that it's fine provided that the result is stable across compiler runs (ideally even across different computers).

It would be fine if the ID can be spec'd somehow, but that would probably be quite hard / can be done just as well in user code. I suggest to make the return value runtime instead to avoid that issue.

Snektron avatar Jul 21 '24 13:07 Snektron

@Snektron Making the return value of a this builtin runtime-only would hinder major use cases for this feature which are currently already possible via userland hacks.

My need for this feature is to be able to store type IDs at comptime while building lookup structures and then use those values later to cross-reference types and as keys in hashtables. Ideally I'd also like to be able to create lookup tables at comptime.

If this restriction could be: the actual typeID values can be stored and compare for equality but are otherwise opaque until runtime, that's a restriction I could live with (that's the status quo for my current typeID hack), provided the stored typeIDs are fixed-up in the final stage of compilation to avoid the case where typeID equality stays consistent between comptime and runtime. It's easy to break that property currently with the userland hacks if you aren't careful.

I can understand not wanting non-deterministic compilation. Where would the non-determinism come from when allowing these values to exist at comptime?

michaelbartnett avatar Jul 21 '24 16:07 michaelbartnett

If the value is determinstic across runs, compiler versions, etc, I dont have a problem with it. In Auguste's proof of concept, the value is derived from a compiler-internal type database key, and is depended on the order that the compiler processes the code. This is an implementation detail. My main concern is that when the compiler is parallelized further, this processing becomes unstable.

@ikskuh pointed out to me that intFromError has the same issue (and this is also what makes the current hack work), and I think that's a problem too.

Snektron avatar Jul 21 '24 16:07 Snektron

That makes sense. I guess incremental compilation would also introduce non-determinism. I wouldn't really care about IDs shifting around between subsequent builds of the same source, but like I said I can see how that's a desirable property. I wonder how many uses of @typeName would have the same issue.

What about the constraint I mentioned: typeIDs can be checked for equality at comptime, but otherwise can't be observed or compared--including logging out the std.fmt representation, which could just be an opaque thing like "TypeID(comptime)". Equality is the primary thing that matters, as well as knowing that if I store one that the equality will stay consistent between comptime and runtime. Sort of similar to how @intFromPtr currently works.

That would be a big loss since it would mean we can't hash them at comptime, but it would at least provide a "blessed" version of the current hacks which wouldn't be at risk of breaking since it would be an explicit language feature.

michaelbartnett avatar Jul 21 '24 16:07 michaelbartnett

@Snektron Making the return value of a this builtin runtime-only would hinder major use cases for this feature which are currently already possible via userland hacks.

...

If this restriction could be: the actual typeID values can be stored and compare for equality but are otherwise opaque until runtime, that's a restriction I could live with (that's the status quo for my current typeID hack), provided the stored typeIDs are fixed-up in the final stage of compilation to avoid the case where typeID equality stays consistent between comptime and runtime. It's easy to break that property currently with the userland hacks if you aren't careful.

I really love that. It has the best of both worlds:

  • We get comptime values that are not completely useless
  • We have stable runtime values (Compiler can just do a single sort post sema)

ikskuh avatar Jul 21 '24 20:07 ikskuh

If this restriction could be: the actual typeID values can be stored and compare for equality but are otherwise opaque until runtime

type already supports == ? runtime-only means this fits better in the stdlib (if anywhere) and not a builtin

nektro avatar Jul 21 '24 20:07 nektro

I need to compare and store those identifiers into const decls at comptime and then compare them later at runtime, so comparing type equality doesn't work. I need a consistent correlation that goes across the comptime/runtime boundary.

michaelbartnett avatar Jul 21 '24 21:07 michaelbartnett

At first sight Rust's approach of using u128s with a cryptographic generation (basically uuids) might seem overkill. But there are also cases where you might want this: Suppose you are writing a ECS library which is later linked to your application. If the library keeps track of its Data with comptime generated ids we are going to have a problem: there might be internal structs with a type id in the library which the compiler can not know of leading to ambigous type ids. In This case UUIDs might come in handy since it's way less likely of getting a type collision. Though I would still choose another approach: My idea is to add in fact 3 more builtin functions:

1. @genericId()

This one works in combination with a generic function (generator function). It should result in each version of this generic function getting a different Id. e.g.

pub fn type_id(comptime T: type) usize {
	return @genericId();
}

assert(type_id(u32) != type_id(*const bool))
assert(type_id(u32) == type_id(u32))

this may expand to

pub fn type_id(u32) usize {
	return 0;
}
pub fn type_id(*const bool) usize {
	return 1;
}

In regard of the return type, the amount of different versions version_count of the generic function can be counted and the return type is infered with a minimum of log2 version_count of bits. This way it is possible to have low memory usage of type ids if you work with embeded systems and you need only a few type ids. Furthermore this allows type ids to be scoped. Meaning the generator function describes the meaning of the type id and only this generator function should be used for this single purpose.

// type_id generator function added in the std
std.builtin.type_id(u32);
// this may be the same value since they implement two different generator functions
myLib.type_id(bool);

// with
const myLib = struct {
	// use u8 for my very efficient embeded library
	pub fn type_id(comptime T: type) u8 {
		return @genericId();	
	}
};

If any other generator funtion is used there will be an overlap, but this is the users fault. If you want a global (default scoped) type id, just add a generator function to the std and use this.

2 @useVal(comptime t: anytype)

This one might not be necessary but if generic funtions are lazy and do not generate different versions if T of the generator function is not used or if T is disgarded via _ = T then @useVal(T); would force the generic function to generate a version for each different T which is passed to it. Hence the real implementation would look like

pub fn type_id(comptime T: type) usize {
	@useVal(T);
	return @genericId();
}

Still, this is only needed if this does not work

pub fn type_id(comptime T: type) usize {
	_ = T;
	return @genericId();
}

3. @UUID()

Afaik we still have no possibility of generating uuids at compile time and this would come in handy in some ways (e.g. if you want to be pretty sure there are no type collisions with external libraries, if you need an random number at comptime, etc.). This builtin function should work again via inline but can also be used outside of generic functions to create a uuid without any type context and there need to be no scopes since they would not matter anyways. In terms of the return type i am still not sure which one would suit it the best but in my examples i am just using u128.

pub fn type_uuid(comptime T: type) u128 {
	@useVal(T);
	return @UUID();
}

const program_uuid = @UUID();
assert(type_uuid(u42) != type_uuid(i37));
assert(type_uuid(u16) == type_uuid(u16));

Now to solve the hypthetical issue of the linked ECS library you could either use the uuid approach or expose a type id generator function implementation in your zig binding to that library which adds the amount of internal types to the type id value to avoid any collision.

A last thing which would be nice to have, is a way to figure out how many versions of a generic function are actually generated.

I know adding 3 new builtin functions for a single problem would be a little bit over the top, though i hope this idea serves to find an optimal solution while maintining the philosophy of zig:

* Only one obvious way to do things.
* Incremental improvements.
* Reduce the amount one must remember.
* Focus on code rather than style.
...

Emmmmllll avatar Aug 02 '24 14:08 Emmmmllll

Generating type ids using hash functions is definitely the way to go, because we get a stable interface between platforms. So, for example, if you export your library as a dynamic one, you can make an interface that provides a type id, and these type ids will match between different libraries.

Here is the code I used:

fn typeId(comptime T: type) u128 {
    const Type = struct {
        const id: u128 = result: {
            const a = std.hash.Wyhash.hash(3832269059401244599, @typeName(T));
            const b = std.hash.Wyhash.hash(5919152850572607287, @typeName(T));
            break :result @bitCast([2]u64{ a, b });
        };
    };
    return Type.id;
}

engusmaze avatar Sep 22 '24 08:09 engusmaze