integration in warm start
pull integration into the constraint solving loop, so solving constraints also does integration for the involved bodies (when needed- an 'integration responsibility' mask is maintained for this, and on-demand conditions are often avoided by knowing ahead of time that an entire batch has no integration responsibilities at all) so the bandwidth associated with integration just vanishes entirely, since all the data was being loaded by the solver anyway; measurements showed it was roughly free for any simulation with a lot of constraints (as expected due to the solver being bandwidth limited)
it's partially on-demand, but it operates on the bitmasks that are incrementally maintained for coloring (they represent the body handles that are referenced within a constraint batch; for a given bit, a 1 means the corresponding body is present in the batch) the on-demand part scans through these and merges the bitmasks together; comparing the new merged mask against the previous mask tells you what bodies are first represented in that batch https://github.com/bepu/bepuphysics2/blob/master/BepuPhysics/Solver_Solve.cs#L1126 (note that there's actually a bug there i need to fix; the clear should be on the merged handles not batch 0!) the scalar fallback can still process 64 bodies at once, while the avx path can handle 256 typically takes microseconds couldn't actually multithread the merge phase since the sync costs were way higher than just slamming through it on one thread
Currently integration is super fast. Not sure this is worth it.