veniq icon indicating copy to clipboard operation
veniq copied to clipboard

Make creation of extraction opportunities faster.

Open aravij opened this issue 3 years ago • 3 comments

Changed algorithm of creation of extraction opportunities to speed it up.

Now algorithm is working in the following way:

  1. Calculate for each statement location of next similar statement. We store a list of steps (int), where adding a step to statement index, we get index of next similar statement. Not all statements may have next similar statements.
  2. Create initial statement ranges. Split a sequence of statements int a non overlapping sorted sequence of statements ranges without gaps between them. Initial ranges are ranges where each statement, except first one, is similar to the previous one. That way we split all statements and add such ranges to extraction opportunities. They correspond to opportunities created during step one.
  3. Collect all similarity gaps - statements, which next similar statement does not follow them immediately.
  4. For each such gap:
    1. Identify ranges of statements where first and second statements belong.
    2. Merge those two ranges and all between them into a single one.
    3. If previous opportunity, created due to handling gap of the same size, starts from the same statement as newly created one, overwrite that opportunity with new one. Otherwise append new one, to already created. This step is done, because some gaps may overlap, i.e. range of second statement of first gap is equals to range of first statement of second gap. If that happens, both such gaps should belong to the same opportunity, as running previous version of algorithm would pass through them at once, because they are of the same size. We identify overlapping of gaps, as second opportunity would be large than the first one, but starts from the same statement. So if newly created opportunity starts from the same statements, created during handling of the gap of the same size, we simply overwrite that opportunity with newly created one, as it contains both gaps.

Applying new version of algorithm we get the following gain:

  • For file InternalMetaDataParser with 1721 methods the average speed up of create_extraction_opportunities step was 88.6% or 0.0086 seconds. The total time saved on that step is 14.8 seconds. The total processing of this file with SEMI algorithm takes 2.5 minutes.
  • For file TomlParser with 87 methods the average speed up of create_extraction_opportunities step was 68.3% or 0.0052 seconds. The total time saved on that step is 0.45 seconds. The total processing of this file with SEMI algorithm takes 7 seconds.

The relative speed up us quite good, while in absolute numbers it is quite irrelevant.

Further speeding up the algorithms might be done through seeding up other steps and, may be, ast framework. Here is comparison of time taken by create_extraction_opportunities to other steps.

step name InternalMetaDataParser
old version
InternalMetaDataParser
new version
TomlParser
old version
TomlParser
new version
Extract semantic 3.4 ms 3.4 ms 4.2 ms 4 ms
Create opportunities 9.4 ms 0.8 ms 5.9 ms 0.7 ms
Filter opportunities 13 ms 14 ms 18.7 ms 18 ms
Rank opportunities 51 ms 52 ms 47.8 ms 47 ms

aravij avatar Nov 12 '20 10:11 aravij