noverify
noverify copied to clipboard
Collect the irconv stats and analyze them
PHP Corpus info:
phpcorpus $ ls
assert flysystem mustache.php string-encoder
async glide options-resolver TrustedProxy
Carbon html owncloud Twig
closure httpful phabricator twitter-api-php
CodeIgniter image php-enum TypeResolver
composer inflector php-html-parser vimeo.php
cphalcon joomla-cms php-pm vkcom-cron
css-selector laravel phpunit vkcom-www
deprecation-contracts link-util polyfill-php80 wikia-app
drupal math psysh woocommerce
entrust Medoo Reflection WordPress
expression-language mime sage yii2
flight moodle stopwatch zttp
It also includes vkcom code.
AST->IR conversion counters based on my PHP corpus (only top-20):
9461162 (15.7%) *scalar.String
8403177 (14.0%) *node.SimpleVar
6141542 (10.2%) *node.Identifier
5525223 (9.2%) *expr.ArrayItem
4955058 (8.2%) *node.Argument
3037509 (5.0%) *name.Name
2995212 (5.0%) *stmt.Expression
2182616 (3.6%) *expr.MethodCall
1848324 (3.1%) *assign.Assign
1446128 (2.4%) *scalar.Lnumber
1354479 (2.3%) *stmt.StmtList
1326921 (2.2%) *expr.PropertyFetch
1068796 (1.8%) *expr.FunctionCall
1025172 (1.7%) *expr.Array
988252 (1.6%) *expr.ArrayDimFetch
824217 (1.4%) *expr.ConstFetch
632163 (1.0%) *stmt.ClassMethod
601275 (1.0%) *node.Parameter
513030 (0.8%) *stmt.If
495436 (0.8%) *stmt.Return
Full List (156 lines)
9461162 *scalar.String
8403177 *node.SimpleVar
6141542 *node.Identifier
5525223 *expr.ArrayItem
4955058 *node.Argument
3037509 *name.Name
2995212 *stmt.Expression
2182616 *expr.MethodCall
1848324 *assign.Assign
1446128 *scalar.Lnumber
1354479 *stmt.StmtList
1326921 *expr.PropertyFetch
1068796 *expr.FunctionCall
1025172 *expr.Array
988252 *expr.ArrayDimFetch
824217 *expr.ConstFetch
632163 *stmt.ClassMethod
601275 *node.Parameter
513030 *stmt.If
495436 *stmt.Return
401763 *binary.Concat
356859 *expr.StaticCall
258342 *expr.New
209579 *stmt.Property
207723 *stmt.PropertyList
183089 *expr.BooleanNot
178007 *stmt.Use
177573 *stmt.UseList
161835 *expr.ClassConstFetch
144986 *name.FullyQualified
115980 *scalar.EncapsedStringPart
114725 *expr.Paren
104492 *node.Root
101088 *stmt.ClassImplements
101088 *stmt.ClassExtends
101088 *stmt.Class
93680 *expr.UnaryMinus
91465 *stmt.Foreach
90810 *stmt.Else
88355 *binary.BooleanAnd
83857 *expr.Isset
74743 *expr.Empty
71963 *binary.Identical
68868 *stmt.InlineHtml
64673 *binary.Equal
60574 *stmt.Namespace
59650 *scalar.Encapsed
58396 *expr.Ternary
57718 *stmt.Echo
56810 *stmt.Case
55797 *stmt.Nop
54797 *assign.Concat
52073 *binary.BooleanOr
43979 *scalar.MagicConstant
42665 *stmt.Throw
41455 *stmt.Constant
40940 *binary.NotIdentical
39694 *stmt.Break
38637 *stmt.Global
37927 *stmt.ElseIf
34762 *expr.StaticPropertyFetch
34745 *binary.Plus
34567 *binary.Minus
28152 *stmt.ClassConstList
24561 *stmt.Function
23954 *expr.Exit
23182 *binary.Greater
22760 *expr.Closure
22760 *expr.ClosureUse
21436 *binary.Smaller
21053 *binary.NotEqual
19061 *expr.RequireOnce
18053 *binary.Mul
17998 *stmt.Unset
17316 *expr.InstanceOf
16151 *expr.PostInc
15429 *scalar.Dnumber
15235 *cast.Int
14757 *stmt.Catch
14496 *stmt.Continue
14210 *expr.List
13870 *assign.Plus
13752 *stmt.Try
12941 *stmt.Declare
11203 *cast.String
10989 *stmt.CaseList
10989 *stmt.Switch
10660 *binary.Div
10382 *binary.LogicalOr
10201 *stmt.For
9910 *cast.Object
8469 *expr.ErrorSuppress
7852 *stmt.TraitUse
7833 *binary.BitwiseAnd
7580 *stmt.While
7336 *binary.LogicalAnd
6830 *binary.GreaterOrEqual
6483 *binary.SmallerOrEqual
6378 *expr.Reference
5988 *stmt.Default
5783 *cast.Array
5019 *cast.Bool
4901 *expr.PreInc
4431 *assign.Reference
4391 *binary.BitwiseOr
3864 *expr.Require
3627 *stmt.Interface
3627 *stmt.InterfaceExtends
3243 *stmt.StaticVar
3230 *scalar.Heredoc
3228 *expr.Clone
3111 *stmt.Static
2681 *binary.Coalesce
2629 *binary.ShiftRight
2458 *expr.Print
2371 *binary.ShiftLeft
2368 *binary.Mod
2151 *assign.Minus
2092 *expr.Yield
1736 *expr.Include
1727 *expr.PostDec
1520 *expr.IncludeOnce
1425 *cast.Double
1270 *stmt.Trait
1210 *node.Nullable
1015 *stmt.Do
853 *expr.PreDec
712 *assign.BitwiseOr
615 *assign.Mul
466 *binary.BitwiseXor
396 *assign.BitwiseAnd
395 *expr.BitwiseNot
395 *assign.Div
366 *stmt.TraitMethodRef
344 *stmt.TraitUseAlias
310 *node.Var
302 *stmt.ConstList
268 *assign.BitwiseXor
264 *stmt.TraitAdaptationList
262 *stmt.Finally
186 *expr.Eval
124 *expr.UnaryPlus
88 *binary.Pow
82 *expr.ShellExec
74 *assign.ShiftRight
72 *binary.LogicalXor
63 *assign.Mod
54 *binary.Spaceship
36 *expr.YieldFrom
36 *assign.ShiftLeft
28 *stmt.Label
22 *stmt.TraitUsePrecedence
18 *stmt.Goto
8 *assign.Pow
8 *stmt.HaltCompiler
4 *stmt.GroupUse
Observations:
- If we sum all binary operators, we'll get the value of 937447. This is still less than
expr.Array
, for example (1025172). But if we speak percentages, it's 1.5% of all nodes, which is quite significant. But it's still not very appealing to merge all binary expression types right now. - top-4 node types cover ~50% of all node types. top-5 nodes cover ~57% of nodes.
Most of the time we consume in irconv
is related to memory allocations.
We create a heap-allocated object for every single node that we convert.
If Go had generics, we could write simple type-bound pools for top-N (~10?) nodes that would allocate on heap only if we're out of pool space. These pools would be worker-bound, so they don't need any sync (this is why sync.Pool
would be an overkill for small objects allocation, we can do it without synchronization).
Without generics, it would require us to write N implementations. Since 5 node types cover more than half of all allocations, we could write pools only for these 5 types.
It's not practical to try re-using 100% of the memory as it may result in a lot of waste. Some files can be huge while majority of the PHP files should fit in a relatively small memory pool. Speaking of "majority", we have stats on it as well.
13959 (25.4%) less than 50 lines (but >= 25)
13001 (23.7%) less than 100 lines (but >=50)
10541 (19.2%) less than 200 lines (but >= 100)
6703 (12.2%) less than 400 lines (but >= 200)
6056 (11.0%) less than 25 lines
2830 (5.1%) less than 800 lines (but >= 400)
1103 (2.0%) less than 1600 lines (but >= 800)
609 (1.1%) more than 1600 lines
Lines include empty lines, comments, etc. It's not SLOC, it's merely a
len(RootWalker.Lines)
.
Note that every "bigger" category does not include the previous one. Here are cumulative stats:
6056 less than 25 lines
20015 less than 50 lines
33016 less than 100 lines
43557 less than 200 lines
46387 less than 800 lines
47490 less than 1600 lines
609 >=1600 lines
We can deduce that:
- 36.5% of files are tiny, they have less than 50 lines.
- 60.2% of files have a relatively small size (<100 lines) that can fit into a pool of IR nodes.
- ~80% of files are smaller than 200 lines.
- Files that are too large to be practically cached do exist (there are examples of files with 8k+ lines).
If we cache ~57% of irconv allocations and apply that to ~60% of files*, we can cut ~34% of the irconv allocations-related overhead.
(*) We can still use these pools for bigger files, but they'll need to heap-allocate everything that doesn't fit into the pool. So, the 34% is a lower bound. In practice, we can save more, but these numbers are harder to estimate.
Typical pool/node cache (or "slab allocator") can look like this:
// Every pool is typed, it's used to allocate the specific node type.
simpleVars := make([]ir.SimpleVar, poolSize)
// Allocation.
if (numAllocated >= poolSize) {
return &ir.SimpleVar{} // Overflow is heap-allocated
}
obj := &simpleVars[numAllocated]
numAllocated++
return obj
// When worker is finished with a file, we reset the counter (for every pool).
numAllocated = 0
// Every irconv routine fills the allocated object, so we don't need to clear the cached nodes.
// They will be initialized by the converter code.
// Here is a typical converter case.
case *node.SimpleVar:
if n == nil {
return (*ir.SimpleVar)(nil)
}
out := &ir.SimpleVar{} // Always allocates
out.FreeFloating = n.FreeFloating
out.Position = n.Position
out.Name = n.Name
return out
// That code can be changed to use a pool.
case *node.SimpleVar:
if n == nil {
return (*ir.SimpleVar)(nil)
}
out := converter.newSimpleVar() // Doesn't allocate for small files
out.FreeFloating = n.FreeFloating
out.Position = n.Position
out.Name = n.Name
return out
IR converter could store these pools while worker can store that IR converter so it gets reused for many files.
workerState {
converter {
simpleVarPool
stringPool
identifierPool
...
}
}
Update: I brought some field experimentation results.
*ir.Argument
40 bytes (2550376 non-nil allocs)
pool size | hits | efficiency | max waste (pool size in bytes) |
---|---|---|---|
64 | 45.0% | 33 106 768 | 2 560 |
128 | 60.0% | 35 609 024 | 5 120 |
256 | 73.0% | 23 270 979 | 10 240 |
*ir.Name
32 bytes (1646720 non-nil allocs)
pool size | hits | efficiency | max waste (pool size in bytes) |
---|---|---|---|
64 | 66.6% | 24 854 896 | 2 048 |
128 | 72.5% | 17 723 904 | 4 096 |
256 | 82.2% | 23 55 322 | 8 128 |
*ir.SimpleVar
32 bytes (4336022 non-nil allocs)
pool size | hits | efficiency | max waste (pool size in bytes) |
---|---|---|---|
128 | 50.8% | 50 006 373 | 4 096 |
192 | 59.7% | 52 115 364 | 6 144 |
256 | 65.9% | 50 478 031 | 8 128 |
320 | 70.3% | 46 343 150 | 10 240 |
512 | 78.7% | 27 278 378 | 16 384 |
640 | 82.1% | 11 515 969 | 20 480 |
*ir.String
40 bytes (4787066 non-nil allocs)
pool size | hits | efficiency | max waste (pool size in bytes) |
---|---|---|---|
64 | 25.2% | 35 453 625 | 2 560 |
128 | 34.5% | 40 461 510 | 5 120 |
256 | 43.2% | 31 520 500 | 10 240 |
512 | 50.5% | -5 701 266 | 20 480 |
With this setup:
conv.alloc.SimpleVar = newSimpleVarAllocator(256)
conv.alloc.String = newStringAllocator(128)
conv.alloc.Identifier = newIdentifierAllocator(128)
conv.alloc.Name = newNameAllocator(128)
conv.alloc.Argument = newArgumentAllocator(256)
I get these results:
// Time spent in irconv phase for the PHP corpus
with slabs: 14.74 -- 15.10 sec
w/o slabs: 17.10 -- 17.76 sec
It looks like we can save ~14% of irconv execution time with slab allocators.