veniq
veniq copied to clipboard
Make creation of extraction opportunities faster.
Changed algorithm of creation of extraction opportunities to speed it up.
Now algorithm is working in the following way:
- Calculate for each statement location of next similar statement. We store a list of steps (int), where adding a step to statement index, we get index of next similar statement. Not all statements may have next similar statements.
- Create initial statement ranges. Split a sequence of statements int a non overlapping sorted sequence of statements ranges without gaps between them. Initial ranges are ranges where each statement, except first one, is similar to the previous one. That way we split all statements and add such ranges to extraction opportunities. They correspond to opportunities created during step one.
- Collect all similarity gaps - statements, which next similar statement does not follow them immediately.
- For each such gap:
- Identify ranges of statements where first and second statements belong.
- Merge those two ranges and all between them into a single one.
- If previous opportunity, created due to handling gap of the same size, starts from the same statement as newly created one, overwrite that opportunity with new one. Otherwise append new one, to already created. This step is done, because some gaps may overlap, i.e. range of second statement of first gap is equals to range of first statement of second gap. If that happens, both such gaps should belong to the same opportunity, as running previous version of algorithm would pass through them at once, because they are of the same size. We identify overlapping of gaps, as second opportunity would be large than the first one, but starts from the same statement. So if newly created opportunity starts from the same statements, created during handling of the gap of the same size, we simply overwrite that opportunity with newly created one, as it contains both gaps.
Applying new version of algorithm we get the following gain:
- For file
InternalMetaDataParser
with 1721 methods the average speed up ofcreate_extraction_opportunities
step was 88.6% or 0.0086 seconds. The total time saved on that step is 14.8 seconds. The total processing of this file with SEMI algorithm takes 2.5 minutes. - For file
TomlParser
with 87 methods the average speed up ofcreate_extraction_opportunities
step was 68.3% or 0.0052 seconds. The total time saved on that step is 0.45 seconds. The total processing of this file with SEMI algorithm takes 7 seconds.
The relative speed up us quite good, while in absolute numbers it is quite irrelevant.
Further speeding up the algorithms might be done through seeding up other steps and, may be, ast framework
.
Here is comparison of time taken by create_extraction_opportunities
to other steps.
step name | InternalMetaDataParser old version |
InternalMetaDataParser new version |
TomlParser old version |
TomlParser new version |
---|---|---|---|---|
Extract semantic | 3.4 ms | 3.4 ms | 4.2 ms | 4 ms |
Create opportunities | 9.4 ms | 0.8 ms | 5.9 ms | 0.7 ms |
Filter opportunities | 13 ms | 14 ms | 18.7 ms | 18 ms |
Rank opportunities | 51 ms | 52 ms | 47.8 ms | 47 ms |
@aravij Let's discuss it on Monday. It is necessary to test it on large number of files.
With increased speed: Elapsed: 7889 secs Soon, I will count without increased speed
Without increased speed: Elapsed: 8415