spicy bytes for loop performance

bytes for loop performance

Open awelzel opened this issue 7 months ago • 1 comments

In the context zeek/zeek#3580 it was observed that a for loop over a bytes object to xor decode the content takes ~30% of the total runtime in that setting. This ticket is meant as a generic "speed up bytes iteration".

The test is iterating 500k times over a 134 bytes instance in %init. The SPICY_THREADED=1 version creates a short lived thread in spicy-driver's main to avoid glibc using an optimized non-atomic shared_ptrs (#1600) This reflects better the performance within an application like Zeek and makes performance significantly worse. Python is added as baseline which executes significantly faster.

No sure how relevant this is, for looping over large bytes might not be a common use case.

Command	Mean [s]	Min [s]	Max [s]	Relative
`echo "x" \| ../build/bin/spicy-driver loop.hlto`	4.922 ± 0.022	4.903	4.945	1.76 ± 0.01
`echo "x" \| SPICY_THREADED=1 ../build/bin/spicy-driver loop.hlto`	7.147 ± 0.015	7.132	7.161	2.56 ± 0.01
`python3 loop.py`	2.791 ± 0.007	2.783	2.797	1.00

The non-dwarf flamegraph (suspect std::shared_ptr above the flat bars like in zeek/zeek#3580).

flame

#loop.spicy
module Loop;

global xs: bytes = b"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";

public type X = unit {

 on %init {
  local i = 0;
  local sum  = 0;
  while ( i < 500000 ) {
    for ( c in xs )
      sum += c;
    i++;
  }
  print "sum", sum;
 }
};

# loop.py
xs = b"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"


def main():
    i = 0
    s = 0
    while i < 500000:
        for c in xs:
            s += c
        i += 1

    print("sum", s)


if __name__ == "__main__":
    main()

Jan 25 '24 14:01 awelzel

I think the more general issue is that iterations over collections of known size can be expensive since we always emit checks at runtime; in general this makes sense since even with a range for loop the underlying could be changed under the covers (e.g., image if every second iteration would empty the collection).

To be faster we would need a way to detect that the collection itself is not modified. This is non-trivial to do automatically since we'd have to make sure that all called code does not modify the collection (e.g., change element order, or remove them); this could even happen in e.g., called functions (which do not even need to be implemented in Spicy, so we might be unable to inspect their bodies). I think we talked in the past about collection methods like transform or for_each which would either compute a new collection or modify collection elements in place, but the function (or since Spicy currently doesn't have them: hypothetical closure) argument could still modify the collection itself so I fear that approach would still not allow faster iteration. It looks like the only way to accomplish this is very global data flow analysis.

If such iteration is a bottleneck for you right now you could implement a custom C++ function which uses unchecked iteration after making sure you can uphold the invariants required for that.

Jan 25 '24 15:01 bbannier

spicy spicy copied to clipboard

bytes for loop performance

spicy
spicy copied to clipboard