kaitai_struct_python_runtime
kaitai_struct_python_runtime copied to clipboard
XOR benchmarks
Followup of discussion in 04289cc13d6501153cab61e875d9c4e2ccdaa5d8
Python 2 bytearray mutations
== run_benchmark_process_xor.py
max = 2147483543
34.8218421936
0.414654970169
Python 3 bytearray mutations
== run_benchmark_process_xor.py
max = 2147483543
33.529802142000335
0.39478560099996685
Python 3 iterating bytes
== run_benchmark_process_xor.py
max = 2147483543
32.42236607199993
0.39769275300022855
Python 2 numpy bitwise_xor
== run_benchmark_process_xor.py
max = 2147483543
27.3349659443
0.424569129944
Python 3 numpy bitwise_xor
== run_benchmark_process_xor.py
max = 2147483543
24.56664780200026
0.4790448710000419
Conclusions:
- On real world example difference between bytearray mutations and map over bytes for Python 3 is not worth maintaining the separate code path.
- numpy's bitwise_xor gives significant performance increase, however it requires manual key wrapping as bitwise_xor expects arrays of the same size
As numpy contains C extensions and may require compilation, I suggest adding numpy as an optional dependency and falling back to native Python bytearray mutations + xor if numpy is not installed.
Great comparison, thanks!
I have yet another humble suggestion - is it possible to submit something like bitwise_xor_cyclic to numpy, and then add here? That would be probably make it work even faster?
There's also this large battle for XOR at SO. I can't estimate how hard is it to include & maintain such code in our library and/or if it's worth that.
A lot of answers suggested in that SO post are impractical (writing manual SSE2 instructions, depending on lots of third party libraries), but the general idea is that the only way to gain meaningful performance improvement is to write native code. The best solutions for Python are cffi and Cython. cffi is a better solution for small functions, while Cython is better when you need complex C<->Python interop, so I picked cffi for my test. My current C implementation looks like this (I know only basic C so this might be suboptimal):
void xor_one(const char *data, const unsigned char key, char *result)
{
unsigned long data_len = strlen(data);
unsigned long idx;
for (idx = 0; idx < data_len; idx++)
{
result[idx] = data[idx] ^ key;
}
}
void xor_many(const char *data, const char *key, char *result)
{
unsigned long data_len = strlen(data);
unsigned long key_len = strlen(key);
unsigned long idx;
for (idx = 0; idx < data_len; idx++)
{
result[idx] = data[idx] ^ key[idx % key_len];
}
}
Benchmark results:
Python 2
== run_benchmark_process_xor.py
max = 2144703287
26.3223340511
0.44394993782
Python 3
== run_benchmark_process_xor.py
max = 2144703287
23.45923235600094
0.4029908089996752
Pros:
- User doesn't need to install full numpy package
- Slightly faster than numpy
Cons:
- Needs compilation and binary distributions for different platforms and architectures
Submitting bitwise_xor_cyclic to numpy doesn't look realistic, their bitwise_xor is is just one of their supported array operations, so it requires arrays of the same size in any case.
In my opinion performance improvements of cffi code are not worth the build complexity it adds.
Adding Cython as either dependency or runtime is being debated in https://github.com/kaitai-io/kaitai_struct/issues/311 . If it gets added, reimplementing xor will be just a few-liner.
EDIT: Cython runtime was withdrawn.