godot icon indicating copy to clipboard operation
godot copied to clipboard

Replace `List` with `LocalVector` if used as simple buffer

Open YYF233333 opened this issue 1 year ago • 9 comments

A byproduct when I go through List. These list are used as simple buffer to store value temporarily, which means:

  • They are local variables.
  • They get value by push_back.
  • Values are consumed by iterating the List.

So I change them to LocalVector, as they do not depend on container implementation.

There are other kinds of usages which can be replaced, but need some prepare first:

  • Same as above, but iterate from tail to head. Need a reverse iterator for LocalVector.
  • Used as stack. Need a pop_back() method for LocalVector.
  • Used as queue. Shall we implement a dedicated Queue?

YYF233333 avatar Dec 02 '24 18:12 YYF233333

This needs some checking to ensure these are safe uses, the swap is only valid if the container size is not changed while iterating, ~the code in Node for example looks unsafe~

AThousandShips avatar Dec 03 '24 11:12 AThousandShips

Also important to evaluate is the tradeoff between locality and reduced overhead in memory on one hand, and the cost of reallocation and moving of data on the other

One of the big benefits of List is that changing the size reallocates no memory, so it would depend on the data type stored and the cost of repeatedly reallocating data depending on how the size of the container changes and the overhead there

AThousandShips avatar Dec 03 '24 11:12 AThousandShips

This needs some checking to ensure these are safe uses, the swap is only valid if the container size is not changed while iterating, the code in Node for example looks unsafe

My bad, that's a queue. Double checked, no other resize during iterating.

One of the big benefits of List is that changing the size reallocates no memory, so it would depend on the data type stored and the cost of repeatedly reallocating data depending on how the size of the container changes and the overhead there

All these containers gather value by push_back, for List it is an alloc per call. I think alloc times matters more than alloc size.

If performance turns out to be problem, we can call reserve().

YYF233333 avatar Dec 03 '24 12:12 YYF233333

We should be using reserve everywhere that we can. I.e. for places where we know the total number of elements, reserve exactly enough for the total number of elements, and where we only know a possible range, we should allocate enough to cover the range.

clayjohn avatar Dec 03 '24 21:12 clayjohn

We should be using reserve everywhere that we can. I.e. for places where we know the total number of elements, reserve exactly enough for the total number of elements, and where we only know a possible range, we should allocate enough to cover the range.

That should be another PR I guess, we should check all LocalVector usage, not just these. Will implement that if have time. Or you just prefer refactor these first?

YYF233333 avatar Dec 04 '24 06:12 YYF233333

I imagine lawnjelly's FixedArray would be beneficial in some cases, if the size is known:

https://github.com/godotengine/godot/blob/71bbe3b00c40986859045e0397f6a3afe311c3e2/core/templates/fixed_array.h

It allocates on the stack, avoiding a heap allocation. It would be useful if VectorView accepted it as it would mean there are numerous cases where it could be used to create things like uniform sets without allocating on the heap.

stuartcarnie avatar Dec 04 '24 06:12 stuartcarnie

Just see #99328 and I think I have to clarify something with this performance tag.

The first priority of this PR is not to improve performance (If these List cause problem we should have found them long ago).

My major concern here is List is generally not considered to play well with modern processors (cache stuff). If you use it, you probably have some special needs (e.g. insert in the middle). If not, like in this PR, the use of List may confuse someone when they touch the code.

So this should be considered a codestyle stuff first, and I did some test to show it is not going to cause systematic regression.

MicroBenchmark

push_back one by one from a std::vector and iterating through to do some nonsense stuff. Also test if you sort before iterate because many code do this.

| relative |               ns/op |                op/s |    err% |     total | benchmark
|---------:|--------------------:|--------------------:|--------:|----------:|:----------
|   100.0% |              121.56 |        8,226,243.55 |    0.1% |      0.01 | `List        itemsize: 4 length: 1 sort: 0`
|   207.7% |               58.53 |       17,085,847.77 |    0.1% |      0.01 | `LocalVector itemsize: 4 length: 1 sort: 0`
|   100.0% |              484.70 |        2,063,120.98 |    0.2% |      0.01 | `List        itemsize: 4 length: 7 sort: 0`
|   195.7% |              247.67 |        4,037,581.08 |    0.1% |      0.01 | `LocalVector itemsize: 4 length: 7 sort: 0`
|   100.0% |            3,865.53 |          258,696.92 |    0.2% |      0.05 | `List        itemsize: 4 length: 63 sort: 0`
|   812.5% |              475.78 |        2,101,827.68 |    0.1% |      0.01 | `LocalVector itemsize: 4 length: 63 sort: 0`
|   100.0% |           34,327.83 |           29,130.88 |    1.1% |      0.41 | `List        itemsize: 4 length: 511 sort: 0`
| 3,671.7% |              934.93 |        1,069,603.20 |    0.7% |      0.01 | `LocalVector itemsize: 4 length: 511 sort: 0`
|   100.0% |          254,799.90 |            3,924.65 |    0.2% |      3.05 | `List        itemsize: 4 length: 4095 sort: 0`
| 8,035.9% |            3,170.79 |          315,378.95 |    1.0% |      0.04 | `LocalVector itemsize: 4 length: 4095 sort: 0`
|   100.0% |              128.63 |        7,774,033.51 |    0.2% |      0.01 | `List        itemsize: 24 length: 1 sort: 0`
|   174.8% |               73.58 |       13,590,218.24 |    0.0% |      0.01 | `LocalVector itemsize: 24 length: 1 sort: 0`
|   100.0% |              508.46 |        1,966,727.22 |    0.2% |      0.01 | `List        itemsize: 24 length: 7 sort: 0`
|   163.0% |              311.90 |        3,206,153.31 |    0.1% |      0.01 | `LocalVector itemsize: 24 length: 7 sort: 0`
|   100.0% |            4,053.12 |          246,723.21 |    0.3% |      0.05 | `List        itemsize: 24 length: 63 sort: 0`
|   368.0% |            1,101.36 |          907,966.79 |    0.2% |      0.01 | `LocalVector itemsize: 24 length: 63 sort: 0`
|   100.0% |           32,916.59 |           30,379.81 |    0.2% |      0.39 | `List        itemsize: 24 length: 511 sort: 0`
|   543.4% |            6,057.23 |          165,092.06 |    1.3% |      0.07 | `LocalVector itemsize: 24 length: 511 sort: 0`
|   100.0% |          284,490.54 |            3,515.06 |    0.2% |      3.40 | `List        itemsize: 24 length: 4095 sort: 0`
|   680.3% |           41,816.05 |           23,914.26 |    0.5% |      0.50 | `LocalVector itemsize: 24 length: 4095 sort: 0`
|   100.0% |              121.97 |        8,198,987.54 |    0.1% |      0.01 | `List        itemsize: 4 length: 1 sort: 1`
|   197.9% |               61.64 |       16,223,083.55 |    0.1% |      0.01 | `LocalVector itemsize: 4 length: 1 sort: 1`
|   100.0% |              561.59 |        1,780,657.10 |    0.6% |      0.01 | `List        itemsize: 4 length: 7 sort: 1`
|   219.0% |              256.46 |        3,899,191.63 |    0.1% |      0.01 | `LocalVector itemsize: 4 length: 7 sort: 1`
|   100.0% |            4,212.65 |          237,380.50 |    0.4% |      0.05 | `List        itemsize: 4 length: 63 sort: 1`
|   678.8% |              620.59 |        1,611,370.31 |    0.2% |      0.01 | `LocalVector itemsize: 4 length: 63 sort: 1`
|   100.0% |           38,429.21 |           26,021.87 |    0.2% |      0.46 | `List        itemsize: 4 length: 511 sort: 1`
| 1,569.3% |            2,448.83 |          408,357.83 |    0.5% |      0.03 | `LocalVector itemsize: 4 length: 511 sort: 1`
|   100.0% |          309,527.59 |            3,230.73 |    0.2% |      3.70 | `List        itemsize: 4 length: 4095 sort: 1`
| 1,573.5% |           19,671.17 |           50,835.82 |    0.3% |      0.23 | `LocalVector itemsize: 4 length: 4095 sort: 1`
|   100.0% |              128.69 |        7,770,760.97 |    1.3% |      0.01 | `List        itemsize: 24 length: 1 sort: 1`
|   154.5% |               83.30 |       12,004,334.72 |    4.2% |      0.01 | `LocalVector itemsize: 24 length: 1 sort: 1`
|   100.0% |              738.00 |        1,355,013.55 |    1.8% |      0.01 | `List        itemsize: 24 length: 7 sort: 1`
|   138.9% |              531.46 |        1,881,610.87 |    0.2% |      0.01 | `LocalVector itemsize: 24 length: 7 sort: 1`
|   100.0% |            6,569.49 |          152,218.86 |    0.1% |      0.08 | `List        itemsize: 24 length: 63 sort: 1`
|   150.1% |            4,377.90 |          228,420.02 |    0.2% |      0.05 | `LocalVector itemsize: 24 length: 63 sort: 1`
|   100.0% |           69,331.06 |           14,423.55 |    0.2% |      0.83 | `List        itemsize: 24 length: 511 sort: 1`
|   132.0% |           52,521.50 |           19,039.82 |    0.2% |      0.63 | `LocalVector itemsize: 24 length: 511 sort: 1`
|   100.0% |          724,062.43 |            1,381.10 |    0.1% |      8.65 | `List        itemsize: 24 length: 4095 sort: 1`
|   123.5% |          586,161.22 |            1,706.02 |    0.1% |      7.01 | `LocalVector itemsize: 24 length: 4095 sort: 1`
Source Code

Just try to run MyTest() in any godot environment. I hack test_main.cpp, thanks @Ivorforce for the idea.

Using this nice small lib nanobench.h

#define ANKERL_NANOBENCH_IMPLEMENT
#include "nanobench.h"
#include "core/templates/list.h"
#include "core/templates/local_vector.h"
#include <ostream>
#include <sstream>
#include <vector>

template<typename T>
class Test
{
    T placehold;
    List<T> list;
    LocalVector<T> vector;

public:
    void bench_list(const std::vector<T> &source, bool sort)
    {
        for (auto &item : source)
        {
            list.push_back(item);
        }
        if (sort)
        {
            list.sort();
        }
        for (auto &item : list)
        {
            placehold = item;
        }
    }

    void bench_local_vector(const std::vector<T> &source, bool sort)
    {
        for (auto &item : source)
        {
            vector.push_back(item);
        }
        if (sort)
        {
            vector.sort();
        }
        for (auto &item : vector)
        {
            placehold = item;
        }
    }

    static void bench(bool sort)
    {
        for (int i = 1; i < 4096 + 1; i *= 8)
        {
            int n = i == 1 ? 1 : i-1;
            std::ostringstream name;
            name << "itemsize: " << sizeof(T) << " length: " << n << " sort: " << sort;
            std::vector<T> source(n);
            auto bench = ankerl::nanobench::Bench()
                .minEpochIterations(1000)
                .relative(true);
            bench.run("List        " + name.str(), [&]{
                Test test;
                test.bench_list(source, sort);
            });
            bench.run("LocalVector " + name.str(), [&]{
                Test test;
                test.bench_local_vector(source, sort);
            });
        }
    }
};

void mytest(void)
{
    Test<int>::bench(false);
    Test<Variant>::bench(false);
    Test<int>::bench(true);
    Test<Variant>::bench(true);
}

YYF233333 avatar Dec 05 '24 13:12 YYF233333

Add reserve to be aligned with #100251.

YYF233333 avatar Dec 12 '24 09:12 YYF233333

That being said — this is a pretty big change, and I gotta say if I were the code maintainer I would be uncomfortable to merge it because of potential regressions. I would therefore advocate to split the PR and address each use separately, to make them small, comfortable and somewhat tested chunks. It's annoying, but it's safe.

Yes, but it is already split from a bigger one :) Despite being somewhat scattered, these changes are essentially the same pattern.

Safety should probably be ok, I double checked they follow the restriction on top, but I cannot prove that to you (unless you read the code).

I'm waiting to see if #100251 can be merged efficiently. If that can not, I'd better give up.

YYF233333 avatar Dec 12 '24 16:12 YYF233333

Any feedback on this? I feel like I've had too many pending PRs and I'd like to clean up some. Can I do something to get it merged?

YYF233333 avatar Jan 04 '25 15:01 YYF233333

Any feedback on this? I feel like I've had too many pending PRs and I'd like to clean up some. Can I do something to get it merged?

As I mentioned before, this PR (as some of your others) has quite a few changes, and there is no proof of correctness or superiority over the previous implementation. Both factors combined likely lead to unwillingness to spend time doing reviews or generally risk the code churn of a merge.

Personally I fully support and appreciate your efforts for the List to LocalVector transition, but you should probably start with smaller steps that you can fully prove to be correct and measure a performance benefit for :)

Ivorforce avatar Jan 04 '25 16:01 Ivorforce

Ok, I see @AThousandShips react with thumbs up so I assume maintainers are not generally against this idea. I'll try to continue this work.

there is no proof of correctness or superiority over the previous implementation.

What kind of proof are you expecting? I definitely cannot formally verify these code. 😅 If the CI can't convince you, then I'm afraid there's not much I can do to self-prove. About superiority, I think LocalVector beats List in both API simplicity and performance (as long as you don't insert). If we haven't reach a consensus on this point, I should create a proposal for further discussion.

YYF233333 avatar Jan 05 '25 12:01 YYF233333

Not generally against it, just agreeing with some of the opinions

AThousandShips avatar Jan 05 '25 12:01 AThousandShips

What kind of proof are you expecting? I definitely cannot formally verify these code. 😅

Here's what I would do:

  • Find an example of a function (ideally an important one) that is practically suffering from its use of List
  • Benchmark calls of this function, showing that replacing the List with LocalVector leads to measurable benefits (even among its other performance costs)
  • Analyze surrounding code and describe why I think the change will never be slower or somehow incorrect. This should be easy with a single function, but difficult with tens of functions like here.

This should be enough to get a merge. After the merge, you can add a new PR addressing a few similar functions, testing one of them explicitly and describing why the rest of them are all analogous. Given you've extensively described and proven the general benefits at this point, it's more likely this can be approved. Perhaps after that, you can start an even larger PR with less closely related functions because the benefit has been even better established. Depending on your actual benchmark results of course :)

Ivorforce avatar Jan 05 '25 12:01 Ivorforce

showing that replacing the List with LocalVector leads to measurable benefits

Well I think that's a pure misunderstanding caused by that performance tag. These changes are present here only because they don't cause a performance problem. I propose this change because I think LocalVector is more "correct" than List in these place and I'd like to do a thorough cleanup of these usecases. This inevitably involves risks and won’t bring many immediate benefits so I'm checking maintainer‘s willingness.

YYF233333 avatar Jan 05 '25 13:01 YYF233333

I think LocalVector is more "correct" than List in these place

I'm not sure I follow, can you explain why List is incorrect? So far youve only talked performance considerations. I'm confused you don't consider this to be a performance PR.

Ivorforce avatar Jan 05 '25 13:01 Ivorforce

We have extensive use of List, with only a small portion becoming performance bottlenecks. For those that do become bottlenecks, we can handle them through the usual performance optimization process. However, for those that don’t, I believe they pose two potential issues:

  • They might force nearby hotspot code to use List, leading to performance issues.
  • They could put pressure on the memory allocator (e.g., more fragmentation, reduced locality), indirectly affecting other parts of the system.

Since these concerns remain hypothetical, I think labeling them as performance PRs are inappropriate. The direct impact of these improvements might be only a few microseconds or even less.

Translated by GPT :)

YYF233333 avatar Jan 05 '25 13:01 YYF233333

You are making some interesting and valid points. If this is the primary concern behind your PRs, i recommend you gather evidence and / or articles for this to link, and open a Godot proposal. Let people weigh in, and if a consensus is reached that this is important, a big clean up PR can be made to address it once and for all. If you're just making these PRs with no consensus they should be merged, you're in a dead end. Theoretical concerns are well and good, but evidence is needed to actually warrant the risks and effort merging PRs brings.

Ivorforce avatar Jan 05 '25 13:01 Ivorforce

I’m going to close this PR as I’m currently evaluating the overall changes from List to LocalVector. Once I’ve gathered enough information, I’ll recreate this one with a better approach and scope.

YYF233333 avatar Jan 06 '25 05:01 YYF233333