event-ruler icon indicating copy to clipboard operation
event-ruler copied to clipboard

Number Benchmarks are slow

Open baldawar opened this issue 2 years ago • 8 comments

Describe the bug

The test setup surrounding Numeric Benchmarks was fixed as part of https://github.com/aws/event-ruler/pull/18 and we found that the tests were actually slow. Documenting this comment to investigate later https://github.com/aws/event-ruler/pull/18#discussion_r948337339.

I'm pretty sure I know why numeric benchmarks are slow. I had forgotten that this was true in Ruler, and I implemented a very similar approach in Quamina and was shocked when it ran slowly to a degree very similar to what we see here. So I profiled it and found out the time was being sucked up by whatever the Java equivalent is of Go's strconv.ParseFloat() and Sprintf("%019.0f"...) - it's not that cheap to parse floats or generate a normalized 19-digit representation.

Which I think means we're probably stuck with it. Unless you think you can write better floating-point parse/format code that's faster than what comes with the core libraries.

I don't think we'll do anything but we should look into if there's a way to improve precision.

Additional Context

File for Ruler https://github.com/aws/event-ruler/blob/main/src/main/software/amazon/event/ruler/ComparableNumber.java#L37.

baldawar avatar Aug 22 '22 18:08 baldawar

I looked into this a bit today. Sharing my thoughts before I get busy with operator duties for sometime.

  1. The long poll is the part of Range constructor where we generate ComparableNumber https://github.com/aws/event-ruler/blob/main/src/main/software/amazon/event/ruler/Range.java#L31. This as Tim mentioned in the comment above it affected by the padding + hex conversion. Benchmarks show its around 80~90% of the overall latency in CL2Benchmark benchmark for Numeric matchers.
  2. We can make significant improvements (> 20%) in the benchmark if we pass static pre-generated byte arrays instead of Constants for lessThan, greaterThan, and so on methods https://github.com/aws/event-ruler/blob/main/src/main/software/amazon/event/ruler/Range.java#L57. Real benefits will vary based on how often clients use in-between X & Y range or CIDR.
  3. We've also over-biased towards CIDR / Hex matching. We should investigate if directly converting long to bytes is sufficient. On the surface, seems "only" need to update digitSequence method but I've run out time to dig into the potential benefits.
  4. While we likely can't write code that's faster than core libraries, doesn't hurt to try as we have need a very limited set of features. It's worth looking if bitmasking or look-up tables lead to any benefits. One place more investigation is needed.

baldawar avatar Sep 07 '22 05:09 baldawar

Related discussion in https://github.com/aws/event-ruler/pull/37

baldawar avatar Sep 08 '22 05:09 baldawar

huh, so the issue is actually in the rules used within the benchmarks. It's not obvious unless you dig into the cityplots test data but there's multiple layers of arrays involved before we get to co-ordinates. This slows down rule-matching because of Array Consistency checks.

When I update cityplots2 file to also have a "firstCoordinates" key (takes the top most JSON w x/y/z values[1]), then match the rules against this new key then the benchmarks show significant improvement

Reading citylots2
Read 213068 events
EXACT events/sec: 213281.3
WILDCARD events/sec: 168967.5
PREFIX events/sec: 202728.8
SUFFIX events/sec: 217638.4
EQUALS_IGNORE_CASE events/sec: 198572.2
NUMERIC events/sec: 124820.2
ANYTHING-BUT events/sec: 136845.2
COMBO events/sec: 87899.3

[1] Sample JSON:

{
  "type": "Feature",
  "properties": {
    "MAPBLKLOT": "0001001",
    "BLKLOT": "0001001",
    "BLOCK_NUM": "0001",
    "LOT_NUM": "001",
    "FROM_ST": "0",
    "TO_ST": "0",
    "STREET": "UNKNOWN",
    "ST_TYPE": null,
    "ODD_EVEN": "E"
  },
  "geometry": {
    "type": "Polygon",
    "coordinates": [
      [
        {
          "x": -122.42200352825247,
          "y": 37.80848009696725,
          "z": 0
        },
        {
          "x": -122.42207601332528,
          "y": 37.808835019815085,
          "z": 0
        },
        {
          "x": -122.42110217434863,
          "y": 37.808803534992904,
          "z": 0
        },
        {
          "x": -122.42106256906727,
          "y": 37.80860105681815,
          "z": 0
        },
        {
          "x": -122.42200352825247,
          "y": 37.80848009696725,
          "z": 0
        }
      ]
    ],
    "firstCoordinates": {
      "x": -122.42200352825247,
      "y": 37.80848009696725,
      "z": 0
    }
  }
}

baldawar avatar Sep 09 '22 16:09 baldawar

Yes, but the benchmark is doing exactly what it was designed to - exercise the hell out of the flattener and number-processor and so on. One effect is that we're testing a worst-case, most people's real applications are going to run faster than the benchmark, which is a good thing. Having said that, The CityLots data is real live data from a real live dataset, so probably a realistic problem. At AWS I had one group complain to me about how matching fields that were arrays of numbers was so much slower than matching ARNs or detail-types, so the kind of change we just checked in is still useful.

BTW the Quamina array-consistency design is totally different from Ruler's. One of them is probably faster but I have no idea which one.

timbray avatar Sep 09 '22 17:09 timbray

I did some exploration around this, using the AnythingButPerformanceBenchmark (there is a slow down around the 10k event mark, caused by a few very large events) and found some events where the number of coordinates inside the geometry of the event is very high, the sample I was testing with had

Event.fields: 2462 

ACTask.stepQueue: 3029492 <- this is the max size the stepQueue gets handling that event, feels a little high

the worst offender I found, had more than 10k coordinates, and did got to around 17 million size on stepQueue, i have the impression something is exploding to get that many steps, I remain curious and will keep learning a little more about how the whole thing works to see if I can find other ways to help.

rudygt avatar Sep 09 '22 20:09 rudygt

Yeah, the basic algorithm is in ACFinder.java, like this: You have a sorted list of fields, i.e. name/value pairs, in the event, in this case 2462 long. You have a certain number of steps N in the state machine, dunno what it is. So, you start with the first state and drop an entry on the StepQueue for each field, because the state machine might start matching at any field. Suppose you get matches for first state at fields 2 (transitioning to state X) and 4 (trans to state Y). So then you're going to drop entries on the stepQ for state X and all the fields starting with 3, and for state Y and all the fields starting with 5. I.e you keep adding things to the stepQ and the central loop is "while stepQ not empty".

OK, I acknowledge this is a bit indirect, not the most intuitive thing. Anyhow, per the above, if you have 2.4K fields and get a few interim matches, it's not crazy that the stepQueue could get long. On the other hand, it is totally possible that we're doing something dumb here and unnecessarily loading it up. A fresh pair of eyes would be welcome.

timbray avatar Sep 09 '22 20:09 timbray

And I've realized, at some point in the intervening years, that the whole thing could be done recursively without using the stepQueue at all. Which my intuition says would probably be faster but I wouldn't bet much on it. But the basic concept is the same, wherever you are in the state machine you have to give all the remaining fields a chance to match the current state.

timbray avatar Sep 09 '22 22:09 timbray

I don't want to interrupt the discussion here too much but just a minor note tied my previous comment that I'm planning on adding a change, that 1/ rename the current rules around numeric matches to be called "array consistent matchers: and 2/ have a different set of rules for numeric matches that aren't affected by arrays. Mostly because I got very close to this state while digging into this issue, so not a lot of effort to make the change. It avoids any confusion around ruler's numeric matching performance (like being slow because of sprintf) and makes it clear that current bottleneck is in ACFinder.

baldawar avatar Sep 09 '22 22:09 baldawar