kaitai_struct_cpp_stl_runtime icon indicating copy to clipboard operation
kaitai_struct_cpp_stl_runtime copied to clipboard

Process XOR/ROL optimisations

Open arekbulski opened this issue 6 years ago • 7 comments

Identical code as in C# and Python runtimes, well almost identical.

I am unable to run tests so someone may need to fix type cases. Please assist.

arekbulski avatar Apr 09 '18 18:04 arekbulski

All points made by @webbnh were resolved, either fixed or disregarded, aside of one: whether std::string is passed around as cheap O(1) or by full copy in O(N). As I am not proficient with C++, I wouldnt know. Oh two things actually: also whether optional arguments are declared properly.

Last commit amended.

arekbulski avatar Apr 10 '18 01:04 arekbulski

I would be highly surprised that any precalculation like that would be faster for C++. Anything related to C++ generally needs rigorous testing, and especially these performance-related things. So, it's not even a question of running a benchmark, it needs a lot of benchmarks to be considered — i.e. different OSes / compilers combinations, x86 / x86_64 targets, etc. There's a huge potential of improvement, but

The suggested changes are not harmless, unfortunately. You allocate global array of 2KB, which would cause trouble with dynamically linked library, and it could cause trouble with super low-RAM embedded systems where we can't allow to eat extra 2KB of RAM and unconditionally allocating 2KB of static memory is a bad idea as well. There are also numerous coding style flaws, but that's relatively minor.

GreyCat avatar Apr 10 '18 11:04 GreyCat

I would be highly surprised that any precalculation like that would be faster for C++.

In C# benchmarks, array lookup is definately faster than bitwise << | >> so it stands to reason that it would also be with C++. But I will get you a benchmark to confirm.

You allocate global array of 2KB, which would cause trouble with dynamically linked library, and it could cause trouble with super low-RAM embedded systems where we can't allow to eat extra 2KB of RAM

I think that is just balooney. It may be that someone working on embedded systems code would want to use Kaitai but that doesnt mean that we have to support it at the cost of making ridiculous restrictions. If they want to use Kaitai in embedded code, they can fork it and strip it of any non-used functionality. We just dont need to go to that much lenghts. Most processes on the desktop you are using have several MB of working space. 2KB is like next to nothing.

If you demand, I can change the code to generate that translation table on usage but it will lower benchmarks where the data is short, <256 bytes. Its not worth it.

arekbulski avatar Apr 11 '18 03:04 arekbulski

In C# benchmarks, array lookup is definately faster than bitwise << | >> so it stands to reason that it would also be with C++. But I will get you a benchmark to confirm.

Benchmarking VM-powered languages like C# is hard, microbenchmarking such intricate matters as a single inlinable method is super hard. It's certainly not as simple as running a single executable with a Stopwatch. At the very least, it should be using something like https://github.com/dotnet/BenchmarkDotNet, ensure that we've past the warmup phase, that the code got compiled, etc.

GreyCat avatar Apr 11 '18 10:04 GreyCat

Benchmark for C++, results identical to C# with respect to "which of the two is slower".

It shows (again) that array lookup is faster than several bitwise shifts/ors but its not faster than a simple xor. It also shows the same that modulo is slower than check-and-wrap. But... it also discovered that passing string is by copy and reference is equally cheap!

#include <iostream>
#include <string>
#include <boost/timer/timer.hpp>
using namespace std;
using namespace boost::timer;

string pass_copy(string data) 
{
    data[0] = 1;
    return data;
}

string & pass_reference(string & data) 
{
    data[0] = 1;
    return data;
}

string loop_xor(string data) 
{
    size_t len = data.length();
    string result(len, ' ');
    uint8_t key = 1;

    for (size_t i = 0; i < len; i++) {
        result[i] = data[i] ^ key;
    }

    return result;
}

string loop_shifts(string data) 
{
    size_t len = data.length();
    string result(len, ' ');
    int amount = 1;
    int antiamount = 7;

    for (size_t i = 0; i < len; i++) {
        uint8_t x = data[i];
        result[i] = (x << amount)|(x >> antiamount);
    }

    return result;
}

string loop_lookup(string data) 
{
    size_t len = data.length();
    string result(len, ' ');
    uint8_t translate[256];

    for (size_t i = 0; i < len; i++) {
        result[i] = translate[data[i]];
    }

    return result;
}

string loop_modulo(string data) 
{
    size_t len = data.length();
    string result(len, ' ');
    size_t keylen = 10;
    uint8_t key[10];

    for (size_t i = 0; i < len; i++) {
        result[i] = data[i] ^ key[i % keylen];
    }

    return result;
}

string loop_checkwrap(string data) 
{
    size_t len = data.length();
    string result(len, ' ');
    size_t keylen = 10;
    uint8_t key[10];

    size_t k = 0;
    for (size_t i = 0; i < len; i++) {
        result[i] = data[i] ^ key[k];
        k++;
        if (k == keylen)
            k = 0;
    }

    return result;
}

int main() 
{
    string data(1024*1024*1024, ' ');

    {
        cout << "Pass data by copy (current impl)" << endl;
        auto_cpu_timer timer;

        string result = pass_copy(data);
    }
    {
        cout << "Pass data by reference" << endl;
        auto_cpu_timer timer;

        string result = pass_reference(data);
    }
    {
        cout << "Loop with ^ key" << endl;
        auto_cpu_timer timer;

        string result = loop_xor(data);
    }
    {
        cout << "Loop with << >> |" << endl;
        auto_cpu_timer timer;

        string result = loop_shifts(data);
    }
    {
        cout << "Loop with translate[...]" << endl;
        auto_cpu_timer timer;

        string result = loop_lookup(data);
    }
    {
        cout << "Loop with data[i %% keylen]" << endl;
        auto_cpu_timer timer;

        string result = loop_modulo(data);
    }
    {
        cout << "Loop with data[k]; if(k==keylen)k=0" << endl;
        auto_cpu_timer timer;

        string result = loop_checkwrap(data);
    }
}

$ g++ --version
g++ (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0
$ g++ test.cpp -lboost_timer
$ ./a.out 
Pass data by copy (current impl)
 0.731110s wall, 0.350000s user + 0.380000s system = 0.730000s CPU (99.8%)
Pass data by reference
 0.731649s wall, 0.310000s user + 0.420000s system = 0.730000s CPU (99.8%)
Loop with ^ key
 5.903469s wall, 5.060000s user + 0.830000s system = 5.890000s CPU (99.8%)
Loop with << >> |
 7.911201s wall, 7.080000s user + 0.830000s system = 7.910000s CPU (100.0%)
Loop with translate[...]
 6.368772s wall, 5.550000s user + 0.810000s system = 6.360000s CPU (99.9%)
Loop with data[i %% keylen]
 16.802076s wall, 15.970000s user + 0.790000s system = 16.760000s CPU (99.7%)
Loop with data[k]; if(k==keylen)k=0
 6.545717s wall, 5.700000s user + 0.830000s system = 6.530000s CPU (99.8%)

arekbulski avatar Apr 18 '18 13:04 arekbulski

Code was updated according to benchmarks, and is therefore READY TO MERGE, although it would be prudent for someone to check if the runtime compiles properly with some KSY-compiled cpp, just to be sure.

arekbulski avatar Apr 18 '18 13:04 arekbulski

I'm not sure the code is ready for merge: I started looking at it and I think I found an error. (I'll do a more detailed review and post comments.)

Also, I'm not convinced that your program actually demonstrates that passing by copy and reference is equally cheap: since your code doesn't do anything with the results of the subroutines, I suspect that the compiler is within its rights to optimize away the whole call (that is, the definitions are in-scope, and the implementations have no side-effects, so the compiler could be able to tell that it doesn't matter whether the functions are actually called or not).

webbnh avatar Apr 18 '18 15:04 webbnh