veniq Make creation of extraction opportunities faster.

Changed algorithm of creation of extraction opportunities to speed it up.

Now algorithm is working in the following way:

Calculate for each statement location of next similar statement. We store a list of steps (int), where adding a step to statement index, we get index of next similar statement. Not all statements may have next similar statements.
Create initial statement ranges. Split a sequence of statements int a non overlapping sorted sequence of statements ranges without gaps between them. Initial ranges are ranges where each statement, except first one, is similar to the previous one. That way we split all statements and add such ranges to extraction opportunities. They correspond to opportunities created during step one.
Collect all similarity gaps - statements, which next similar statement does not follow them immediately.
For each such gap:
1. Identify ranges of statements where first and second statements belong.
2. Merge those two ranges and all between them into a single one.
3. If previous opportunity, created due to handling gap of the same size, starts from the same statement as newly created one, overwrite that opportunity with new one. Otherwise append new one, to already created. This step is done, because some gaps may overlap, i.e. range of second statement of first gap is equals to range of first statement of second gap. If that happens, both such gaps should belong to the same opportunity, as running previous version of algorithm would pass through them at once, because they are of the same size. We identify overlapping of gaps, as second opportunity would be large than the first one, but starts from the same statement. So if newly created opportunity starts from the same statements, created during handling of the gap of the same size, we simply overwrite that opportunity with newly created one, as it contains both gaps.

Applying new version of algorithm we get the following gain:

For file InternalMetaDataParser with 1721 methods the average speed up of create_extraction_opportunities step was 88.6% or 0.0086 seconds. The total time saved on that step is 14.8 seconds. The total processing of this file with SEMI algorithm takes 2.5 minutes.
For file TomlParser with 87 methods the average speed up of create_extraction_opportunities step was 68.3% or 0.0052 seconds. The total time saved on that step is 0.45 seconds. The total processing of this file with SEMI algorithm takes 7 seconds.

The relative speed up us quite good, while in absolute numbers it is quite irrelevant.

Further speeding up the algorithms might be done through seeding up other steps and, may be, ast framework. Here is comparison of time taken by create_extraction_opportunities to other steps.

step name	InternalMetaDataParser old version	InternalMetaDataParser new version	TomlParser old version	TomlParser new version
Extract semantic	3.4 ms	3.4 ms	4.2 ms	4 ms
Create opportunities	9.4 ms	0.8 ms	5.9 ms	0.7 ms
Filter opportunities	13 ms	14 ms	18.7 ms	18 ms
Rank opportunities	51 ms	52 ms	47.8 ms	47 ms

Nov 12 '20 10:11 aravij

@aravij Let's discuss it on Monday. It is necessary to test it on large number of files.

Nov 13 '20 13:11 lyriccoder

With increased speed: Elapsed: 7889 secs Soon, I will count without increased speed

Nov 20 '20 13:11 lyriccoder

Without increased speed: Elapsed: 8415

Nov 23 '20 10:11 lyriccoder

veniq veniq copied to clipboard

Make creation of extraction opportunities faster.

veniq
veniq copied to clipboard