godot Add BitMap methods for bitwise boolean operations

Created methods for BitMap for boolean operators AND, OR, XOR, and NOT.

The methods for the binary operators (AND, OR, XOR) all have a parameter "other" for the secondary bitmap (the one being compared to). ~~Additionally, there is a BitwiseSizeMode enum which controls the size of the new bitmap being made, in the case that the bitmaps are not the same size.~~ (Edit: For performance reasons, this proposed size mode implementation was scrapped.) The method bitwise_not is unary, and thus does not have any parameters.

This is an uninstrusive QOL change for handling bitmaps. Combining and operating on bitmaps is a common operation in many games that use them. I, for one, am working on a puzzle game where geometry is generated from bitmaps you edit. Ordinarily, comparing two bitmaps can be tedious, usually involving some sort of nested double for loop. This can obviously bloat your game's codebase. The new methods in this PR will make combining bitmaps much simpler for developers.

Bugsquad edit, closes: https://github.com/godotengine/godot-proposals/issues/9609

Apr 24 '24 01:04 basilicon

Thanks for opening a pull request :slightly_smiling_face:

Feature pull requests should be associated to a feature proposal to justify the need for implementing the feature in Godot core. Please open a proposal on the Godot proposals repository (and link to this pull request in the proposal's body).

Apr 26 '24 22:04 Calinou

Alright, I just opened a proposal and linked this PR. Sorry; it's my first time committing to this engine; I've worked on another engine before but it was niche and the organizer could personally review each PR. I have no idea what is going on with these failed test cases. None of them seem relevant to BitMap, which is the only thing which I've changed.

Apr 27 '24 08:04 basilicon

Okiedokie, now, it would be nice to squash your commits! See the Pull Request Workflow.

Jul 09 '24 18:07 Mickeon

OK, I just squashed the commits. I think I did the right thing, but this is my first time doing this, so I may have screwed it up.

Jul 22 '24 20:07 basilicon

Please remove unnecessary information from the squashed message and change the title to match that of the PR, things like steps done in review isn't relevant for the message, only things like what it does

Jul 23 '24 08:07 AThousandShips

I'm not entirely clear why this works on 8 bits at a time instead of 32 or 64 (like most non-compressed bit set implementations I've seen). Is this a technical limitation of BitMap? If there's an advantage to working with 8 bits instead of 1 bit, then not only would working on 32 or 64 bits be an extra advantage, it also can benefit from word alignment.

Aug 02 '24 20:08 tommyettinger

Currently, the implementation of BitMap uses a uint8_t array and not an int (i.e. 32 or 64 bits) array. I imagine that when it was made initially, they used uint8_t with the intention to optimize a tiny bit of memory, but I'm not even sure that works on most architectures (as far as word alignment goes).

~~There isn't any reason I didn't do 32 or 64 bits, and I agree it would be optimal, but I would have to refactor most of bit_map.cpp and bit_map.h to support that instead of the 8 bit implementation.~~

Edit: Actually, I think I can do 32 or 64 bits. I'll try it later tonight.

Aug 02 '24 23:08 basilicon

Unless changes to the bit width is strictly involving the access methods the change from uint8_t to uint32_t or uint64_t is entirely out of scope for this PR, and would have to be done carefully and would warrant a separate discussion

Aug 04 '24 20:08 AThousandShips

No, I think I can cast the pointer to a uint64_t so I can access the uint8_t vector like a uint64_t one. Then, I can just use my old method for the leftover bits (since the length of the uint8_t vector isn't guaranteed to be divisible by 8).

Aug 05 '24 04:08 basilicon

Good idea, that's a trick I have seen often in SMHasher's C code that tests hashing functions. Those functions receive effectively an array of bytes, but operate on it internally as the largest word size they can use efficiently. Endian-ness is a frequent concern in that space, though. I don't know how Godot handles big-endian vs. little-endian architectures. Bit sets/arrays/vectors definitely seem like a great candidate for the approach you're using, since a bulk operation that ANDs a pair of 64-bit blocks of bits will likely take much less time than eight separate operations to AND eight pairs of 8-bit blocks. And it's perfectly fine for it to be internal, since only bulk operations like these have a real benefit from it, and this may cover all the relevant bulk operations.

Aug 05 '24 05:08 tommyettinger

Does anything else need to happen before we can merge?

Aug 14 '24 16:08 basilicon

It'd need a good review by the core team, but otherwise it's mostly that we're waiting for the new release cycle following 4.3

Aug 14 '24 16:08 AThousandShips

I'm considering adding an in-place option for each of the new methods, which defaults as false. Right now, it always creates a new bitmap of the correct size, which takes up some memory and time to allocate. Most of the time, this will be fine, but for very large bitmaps, this could be a little costly, especially if you use several boolean operations in sequence. I think it would be useful to have the option to just operate on the current bitmap in-place, which would save that extra memory and time.

Oct 09 '24 21:10 basilicon

I'd also expect these methods to work in-place, given this use case can be performance-critical. If you need to create a new BitMap as a copy of the current BitMap, you can do it before calling the method.

Oct 11 '24 09:10 Calinou

OK, I just modified the behavior so that instead of making a new bitmap, it performs the boolean operations in-place. The original behavior can be achieved by calling .duplicate() before any given operation.

Oct 20 '24 17:10 basilicon

A quick look with godbolt suggests clang with -O3 might already autovectorize byte bitwise operations to SIMD instructions, and handle the leftovers automatically (it's using pxor on xmm registers). As such when you try and do this manually there is the potential it can be making it harder for the autovectorizer.

That's a bit of a nitpick though I'm sure it's a lot faster than it was. :smile: (Note I haven't used SSE for bitwise operations, so I don't know what is available in SSE2 / Neon, the intel manual isn't clear, but worth checking, sounds as if it might be 128 bit from Grok. Ah yes _mm_or_ps was available from SSE1.)

Apr 07 '25 08:04 lawnjelly

A quick look with godbolt suggests clang with -O3 might already autovectorize byte bitwise operations to SIMD instructions, and handle the leftovers automatically (it's using pxor on xmm registers).

I had a feeling that was the case, but i wouldn't expect GCC or more, MSVC to be as insightful. Explicit expansion probably doesn't hurt anyway, at least that was my gut feeling about it.

Apr 07 '25 09:04 Ivorforce

I had a feeling that was the case, but i wouldn't expect GCC or more, MSVC to be as insightful.

Just tried, GCC seems to autovectorize on -O3 and MSVC and clang on -O2 (and above). Notable that GCC doesn't seem to on -O2, I don't know what 4.x is compiled as, I remember there was some problem with -O3 a while back and it could possibly be -O2. (If it is still -O2 we should probably revisit this and see if it can be fixed.)

For ARM, I suspect it will depend what the min spec is for the target and it will probably do the most appropriate instruction.

Apr 07 '25 09:04 lawnjelly

I had a feeling that was the case, but i wouldn't expect GCC or more, MSVC to be as insightful.

Just tried, GCC seems to autovectorize on -O3 and MSVC and clang on -O2 (and above). Notable that GCC doesn't seem to on -O2, I don't know what 4.x is compiled as, I remember there was some problem with -O3 a while back and it could possibly be -O2. (If it is still -O2 we should probably revisit this and see if it can be fixed.)

Here's what we use for GCC/Clang:

https://github.com/godotengine/godot/blob/a210fe6dbd27cc01bffacb0ce2816e0cf303af5e/SConstruct#L785-L797

Official builds are made with optimize=speed_trace for debug builds (editor, template_debug) and optimize=speed for release builds (template_release). So we do use -O3 for release templates.

Apr 07 '25 13:04 akien-mga

Official builds are made with optimize=speed_trace for debug builds (editor, template_debug) and optimize=speed for release builds (template_release). So we do use -O3 for release templates.

That sounds fine to me (as editor not auto-vectorizing is not the end of the world).

Going a little off topic, but related, Grok suggests that using the -ftree_vectorize switch might help turn on vectorization even at -O2 for GCC (this appears to work so we should consider doing for editor builds), and -fopt-info-vec or -fopt-info-vec-missed can show what GCC did or why it failed (can't get these to work yet, flags appear to have changed at GCC 9).

There's also some pragmas to help with this (I wasn't aware of these, maybe they are new):

#pragma simd
#pragma vector always
#pragma ivdep

Apr 07 '25 13:04 lawnjelly

Happy birthday, pull request! 🎂 You've been open one whole year! Old enough to walk and talk!

Apr 24 '25 06:04 tommyettinger

We discussed this PR this morning in the Core meeting. We didn't reach a consensus on merging this. There were a couple factors weighing against merging this PR:

It is adding a bunch of code into the core engine in order to save a few lines of GDScript code. In general, we want to keep core as streamlined as possible to avoid the engine becoming bloating.
The situation of applying the exact same operation on the entire bitmap is kind of niche as many games will need to apply the operation to only a region of the Bitmap, or a masked out area. In both cases they would need to fall back to a loop in scripting anyway
There has been relatively low user demand for this feature since it can just be done in script anyway. The only downside to using a scripting language is performance
If performance is a concern, then this can also be done in a GDExtension

There are a few factors weighing in favour of merging:

This is way faster than doing the loop in GDScript
This is still faster than doing the loop in a GDExtension (most likely)

Ultimately, we agreed that we need to wait and see if there is anyone who is running into performance problems with the current approach (looping in script). The proposal only mentions that writing the full loop in GDScript is tedious and they want to write fewer lines of code. While writing fewer lines of code can be a big benefit, it isn't sufficient reason for us to bloat core. Many contributors are excited about the potential performance benefits from this PR though, so if we can find real use-cases where performance is a significant concern and this PR fully resolves them, then we can reconsider.

May 01 '25 06:05 clayjohn