Friflo.Engine.ECS Recommended to add a new approach where each chunk is processed by a separate thread

public QueryJob<T1> ForEach(Action<Chunk<T1>, ChunkEntities> action)  => new (this, action);

In terms of this parallel processing, it is recommended to add a new approach where each chunk is processed by a separate thread (similar to sharding). During the processing of a chunk, operations can be carried out based on the current segmentation status.

Sep 01 '25 08:09 WayneHalake

The current approach: split a big chunk (thousands of components) into smaller segments. Each segment is processed in a separate thread. In this case multi thread is beneficial.

The penalty introduced by multi threading are:

Additional resources to schedule work on different queues
The for / join of multiple threads. ~2000 ns.

The additional overhead is significant in the following case.

In case a query has multiple archetypes (archetype fragmentation) the expectation is the number of components within an archetype is small (1-100=). So the size of its chunks are also small.

The overhead to distribute these small chunks to multiple threads is significant. So its very likely the over all execution time is worse. It would also require more CPU resources.

It is likely processing chunks in this case on a single thread is beneficial.

Sep 01 '25 11:09 friflo

When there are multiple archetypes and the data in each archetype is relatively small, it may indeed lead to worse performance. However, if the amount of data in each archetype is relatively large, I believe that allocating a separate thread to each archetype for processing may result in better efficiency. I believe that when the amount of data in an archetype is relatively small, we can make proper boundary value judgments and process it directly in the main thread.

Sep 01 '25 12:09 WayneHalake

An example. A query with 10 matching archetypes. Their sizes:

- 1 archetype  - 100000 components
- 9 archetypes - 10 components

var runner = new ParallelJobRunner(4); // 4 threads

The 100000 archetype is split into 4 sub chunks each 25000 components which are executed in 4 parallel threads. This also ensures that the workload is even distributed among those 4 threads. The 9 archetypes with 10 components are executed single threaded.

Feel free to comment on this example or create a different.

Sep 01 '25 12:09 friflo

I think the comparison should be made based on the same total amount of data.

eg:

- 1 archetype  - 400000 components
- 4 archetypes - 100000 components    // 400000 components

var runner = new ParallelJobRunner(4); // 4 threads

And I think these two solutions should be complementary, and it's not that only one of them can exist.

Sep 01 '25 13:09 WayneHalake

The current implementation does this:

- 1 archetype  - 400000 components
- 4 archetypes - 100000 components    // 400000 components

var runner = new ParallelJobRunner(4); // 4 threads

1. 100000 in 4 threads + fork/join
2.  25000 in 4 threads + fork/join
3.  25000 in 4 threads + fork/join
4.  25000 in 4 threads + fork/join
5.  25000 in 4 threads + fork/join

Please describe how the workloads should be distributed in an alternative implementation.

An alternative solution should avoid creating uneven workloads. E.g. one thread has to process 400000. The remaining threads process only 100000. The effect will be that one thread require a long execution. The other treads will finish much faster. So the overall execution time is dominated by a single thread.

Sep 01 '25 13:09 friflo

- 1 archetype  - 400000 components
- 4 archetypes - 100000 components    // 400000 components

var runner = new ParallelJobRunner(4); // 4 threads

//  current implementation
1 archetype          400000 in 4 threads + fork/join    


// alternative implementation  every archetype 1 thread
archetype1         100000 in thread1
archetype2         100000 in thread2
archetype3         100000 in thread3
archetype4         100000 in thread4

thread1  thread2  thread3  thread4   fork/join

I am aware that this kind of solution may lead to unbalanced thread workloads, so I believe these two solutions can complement each other.

Sep 01 '25 13:09 WayneHalake

Thanks, understand.

In this case your scheduling would be better, right. But as you mentioned in a general case it is unlikely having this kind of distribution.

Factors that have negative impact - and these factors can vary a lot. E.g.

You have only 1, 2 or 3 archetypes with 100000 components but 4 threads.
Thread count != 4
The size of each archetype is very different - the typical case.

Covering all these cases for a general solution would make scheduling very complex.

Sep 01 '25 14:09 friflo