wasm-opt for WASIp2
Over in https://github.com/llvm/llvm-project/issues/147201, @singleaccretion mentioned that they can't use wasm opt:
Not all scenarios can (or want) to use wasm-opt. E. g. we don't use it by default because it can alter function indices. All WASMp2-targeting code doesn't use it (and can't, currently at least). So that's the main motivation (along with the fact that if we were say to use wasm-opt for this small optimization only, it would be at least an order of magnitude slower than if done in the linker [in terms of this optimization's isolated cost]).
I definitely get the desire to avoid wasm-opt for transforms that are required for correctness rather than optimizations (we have the same principle in Emscripten), but also it's still very nice to have for optimizations. And Binaryen tries to be useful across a lot of different wasm toolchains and use cases, so it made me curious about what the requirements of this scenario are ("all WASIp2-targeting code" sounds like a pretty broadly used use case!). It might make sense to have a mode or feature in Binaryen that could make it work here.
@singleaccretion can you maybe say a bit more about what the problems/requirements are for the use case?
can you maybe say a bit more about what the problems/requirements are for the use case?
Sure! There are two parts to this issue: general and NativeAOT-LLVM-specific.
One of the general issues I mentioned above is -target wasm32-unknown-wasip2 compatibility, i. e. working with components. This is already tracked here: https://github.com/WebAssembly/binaryen/issues/6728. I don't think this part is that interesting since it is by no means a fundamental limitation.
For the NativeAOT-LLVM use case, there exist the following considerations (in no particular order):
- One of our requirements is supporting stack traces, both for exceptions and of a free-standing "capture the current trace here" variety. The way we implement this is by parsing the JS
new Error().stackstring, looking for function indices. We need to do this since our public contract includes a certain amount of structured information (not just "give me the string") - we need to associate these function indices with auxiliary compiled artifacts. Obviously, this is not really compatible with anything that would shift these indices around post-link.- Since this strategy only works on JS engines, we have a separate one that we use when targeting WASI. However, it has worse user experience characteristics (can't capture the binary offset and so line info), and worse runtime performance (need to make every method more expensive).
- A strong philosophy of .NET (and therefore our toolchain as well) is "debuggable by default". It means all our builds (Release and otherwise) get compiled with
-gby default.wasm-optis simply not much better (~3% when I last measured) when DWARF is in play than simple-Wl,--compress-relocations(that also breaks DWARF but that is an issue which in principle can be fixed). - One of the reasons
wasm-optcan be rather effective for C/C++ code is (in addition to shortening relocations, which is probably the single largest contributor) the fact cross-module inlining is still tricky and you can't always do LTO right, and it breaks sometimes, etc. Our compiler doesn't have this limitation (we feed LLVM code that is already optimized). - Stepping back a bit, our toolchain already consists of two industrial-grade optimizing compilers that have had many years of work put into them. And yet,
wasm-optstill finds opportunities to make the code they produce more efficient/compact, despite the fact lowering to WASM loses information. This comes at a cost in compile time however, and not a small one. Rather than making this kind of tradeoff, my first instinct would be to find the places upstream where these gains could also be realized - with a smaller penalty.
In the long run, we too will of course have a mode of "optimize everything everywhere, drop all debug info and stack trace info, spend as much time as you want", and wasm-opt will be a part of that. However, it is not going to be the default configuration, and default configurations matter more.
@SingleAccretion Interesting, yeah, if you are already running very powerful optimizers then maybe wasm-opt can't add much.
With that said, I am still surprised it is only 3%. If it's no trouble, is there perhaps some representative wasm output from your toolchain that I could look at, for my own curiosity?
One of our requirements is supporting stack traces, both for exceptions and of a free-standing "capture the current trace here" variety. The way we implement this is by parsing the JS new Error().stack string, looking for function indices. We need to do this since our public contract includes a certain amount of structured information (not just "give me the string") - we need to associate these function indices with auxiliary compiled artifacts. Obviously, this is not really compatible with anything that would shift these indices around post-link.
The general approach to handle that in toolchains using wasm-opt is to keep function names around til the very end, then do--print-function-map to list out function indexes and their names, then strip the names for the final binary. You can then use that list to annotate your stack traces.
(but, again, maybe not worth it for you. just fyi)
Stepping back a bit, our toolchain already consists of two industrial-grade optimizing compilers that have had many years of work put into them. And yet, wasm-opt still finds opportunities to make the code they produce more efficient/compact, despite the fact lowering to WASM loses information. This comes at a cost in compile time however, and not a small one. Rather than making this kind of tradeoff, my first instinct would be to find the places upstream where these gains could also be realized - with a smaller penalty.
Makes sense, yeah, these are constant tradeoffs. But it does imply duplication of effort - which is maybe worthwhile - because from the point of view of the larger ecosystem, every optimization we write in wasm-opt ends up available not only for LLVM users but also Dart, Kotlin, Java, and more.
With that said, I am still surprised it is only 3%.
@kripken At least that's something I got for a particular output ~2 years ago. Measuring on the output zipped below with a bit more modern wasm-opt, I do get similar numbers. It is fairly typical LLVM-produced code, but with lots (and lots) of null checks.
Code diffs
This is a managed application with ~3.3MB of code, of which about 0.3MB is C/C++ code. First, --compress-relocations:
Summary of Code Size diffs:
(Lower is better)
Total bytes of base: 3519392
Total bytes of diff: 3075156
Total bytes of delta: -444236 (-12.62% % of base)
Average relative delta: -14.44%
diff is an improvement
average relative diff is an improvement
Top method improvements (percentages):
-8 (-57.14% of base) : 1104.dasm - GetThreadStore()
-8 (-53.33% of base) : 13494.dasm - emscripten_stack_get_free
-4 (-50.00% of base) : 13495.dasm - emscripten_stack_get_base
-4 (-50.00% of base) : 13496.dasm - emscripten_stack_get_end
-4 (-50.00% of base) : 13497.dasm - stackSave
-4 (-50.00% of base) : 1482.dasm - GetCurrentThreadAllocContext()
-4 (-50.00% of base) : 1540.dasm - InitializeGCSelector()
-4 (-50.00% of base) : 13500.dasm - emscripten_stack_get_current
-4 (-50.00% of base) : 1081.dasm - SyncClean::CleanUp()
-7 (-46.67% of base) : 1487.dasm - RaiseFailFastException
-8 (-44.44% of base) : 2604.dasm - RhpGetClasslibFunctionFromCodeAddress
-4 (-44.44% of base) : 1551.dasm - SystemNative_Abort
-4 (-44.44% of base) : 2594.dasm - RhpCallFinallyFunclet
-4 (-44.44% of base) : 2593.dasm - RhpCallFilterFunclet
-4 (-44.44% of base) : 2592.dasm - RhpCallCatchFunclet
-7 (-43.75% of base) : 13428.dasm - sched_yield
-8 (-40.00% of base) : 1099.dasm - RhpGetModuleSection
-4 (-40.00% of base) : 2608.dasm - RhpThrowNativeException
-4 (-40.00% of base) : 1042.dasm - GCToEEInterface::RefCountedHandleCallbacks(Object*)
-4 (-40.00% of base) : 13498.dasm - stackRestore
12503 total methods with Code Size differences (12503 improved, 0 regressed)
Second, with "DWARF-capped" wasm-opt:
Summary of Code Size diffs:
(Lower is better)
Total bytes of base: 3519392
Total bytes of diff: 2982985
Total bytes of delta: -536407 (-15.24% % of base)
Average relative delta: -15.13%
diff is an improvement
average relative diff is an improvement
Top method regressions (percentages):
1 (50.00% of base) : 6924.dasm - DynamicGenerics_Dictionaries_Base___ctor
1 (50.00% of base) : 6923.dasm - DynamicGenerics_Dictionaries_GenBase_1<System___Canon>___ctor
1 (50.00% of base) : 6922.dasm - DynamicGenerics_Dictionaries_Gen_1<System___Canon>___ctor
1 (50.00% of base) : 6911.dasm - DynamicGenerics_Dictionaries_SingleUseArrayOnlyGen_1<System___Canon>___ctor
1 (50.00% of base) : 6847.dasm - S_P_CoreLib_System_Collections_Generic_HashSet_1_Enumerator<Char>__Dispose
1 (50.00% of base) : 1419.dasm - WKS::GCHeap::TemporaryEnableConcurrentGC()
1 (50.00% of base) : 1418.dasm - WKS::GCHeap::DiagGetGCSettings(EtwGCSettingsInfo*)
1 (50.00% of base) : 6925.dasm - DynamicGenerics_My___ctor
1 (50.00% of base) : 1420.dasm - WKS::GCHeap::TemporaryDisableConcurrentGC()
1 (50.00% of base) : 6577.dasm - S_P_CoreLib_System_Buffers_ArrayPoolEventSource___ctor_0
1 (50.00% of base) : 1573.dasm - GCToOSInterface::FlushProcessWriteBuffers()
1 (50.00% of base) : 3168.dasm - S_P_CoreLib_System_Threading_ExecutionContext__Dispose
1 (50.00% of base) : 7172.dasm - S_P_CoreLib_System_Collections_Generic_List_1_Enumerator<S_P_CoreLib_System_Collections_Generic_KeyValuePair_2<System___Canon__Bool>>__Dispose
1 (50.00% of base) : 7176.dasm - S_P_CoreLib_System_Collections_Generic_List_1_Enumerator<S_P_CoreLib_System_Collections_Generic_KeyValuePair_2<System___Canon__Int32>>__Dispose
1 (50.00% of base) : 7174.dasm - S_P_CoreLib_System_Collections_Generic_Dictionary_2_KeyCollection_Enumerator<System___Canon__Bool>__Dispose
1 (50.00% of base) : 7175.dasm - S_P_CoreLib_System_Collections_Generic_Dictionary_2_Enumerator<System___Canon__Bool>__Dispose
1 (50.00% of base) : 11754.dasm - S_P_CoreLib_System_Runtime_InteropServices_InAttribute___ctor
1 (50.00% of base) : 7177.dasm - S_P_CoreLib_System_Collections_Generic_Dictionary_2_ValueCollection_Enumerator<System___Canon__Int32>__Dispose
1 (50.00% of base) : 7178.dasm - S_P_CoreLib_System_Collections_Generic_Dictionary_2_KeyCollection_Enumerator<System___Canon__Int32>__Dispose
1 (50.00% of base) : 7179.dasm - S_P_CoreLib_System_Collections_Generic_Dictionary_2_Enumerator<System___Canon__Int32>__Dispose
Top methods only present in diff:
18 ( ∞ of base) : 13819.dasm - __bswap_16.1
5 ( ∞ of base) : 13817.dasm - __DOUBLE_BITS.1
21 ( ∞ of base) : 13816.dasm - fp_barrier.1
3 ( ∞ of base) : 13815.dasm - dummy.2
4 ( ∞ of base) : 13814.dasm - dummy.1
42 ( ∞ of base) : 13813.dasm - CheckPromoted(Object**, unsigned long*, unsigned long, unsigned long).1
24 ( ∞ of base) : 13812.dasm - GCHandleStore::~GCHandleStore().1
5 ( ∞ of base) : 13818.dasm - __cxx_global_array_dtor.1
Top method improvements (percentages):
-82 (-94.25% of base) : 13792.dasm - abort_message
-44 (-66.67% of base) : 1586.dasm - PalGetMaximumStackBounds_SingleThreadedWasm(void**, void**)
-8 (-57.14% of base) : 1128.dasm - GetThreadStore()
-87 (-56.49% of base) : 3338.dasm - S_P_CoreLib_System_Threading_ManualResetEventSlim__UpdateStateAtomically
-4 (-50.00% of base) : 13805.dasm - stackSave
-4 (-50.00% of base) : 1102.dasm - SyncClean::CleanUp()
-4 (-50.00% of base) : 1591.dasm - InitializeGCSelector()
-4 (-50.00% of base) : 1526.dasm - GetCurrentThreadAllocContext()
-144 (-49.15% of base) : 7862.dasm - S_P_CoreLib_System_Exception__ReportAllFramesAsJS
-7 (-46.67% of base) : 1531.dasm - RaiseFailFastException
-4 (-44.44% of base) : 1603.dasm - SystemNative_Abort
-37 (-44.05% of base) : 1001.dasm - main
-7 (-43.75% of base) : 13726.dasm - sched_yield
-5 (-41.67% of base) : 1114.dasm - RhGetThreadStaticStorage
-4 (-40.00% of base) : 1048.dasm - GCToEEInterface::RefCountedHandleCallbacks(Object*)
-8 (-40.00% of base) : 1123.dasm - RhpGetModuleSection
-4 (-40.00% of base) : 2663.dasm - RhpThrowNativeException
-4 (-40.00% of base) : 13806.dasm - stackRestore
-7 (-38.89% of base) : 1433.dasm - GCScan::GcWeakPtrScanBySingleThread(int, int, ScanContext*)
-7 (-38.89% of base) : 1428.dasm - GCScan::GetGcRuntimeStructuresValid()
Top methods only present in base:
-54 (-100.00% of base) : 7748.dasm - S_P_CoreLib_System_Runtime_EH__ThrowClasslibOverflowException
-7 (-100.00% of base) : 2644.dasm - RhpGetLastPreciseVirtualUnwindFrame
-451 (-100.00% of base) : 7730.dasm - S_P_CoreLib_System_Runtime_RuntimeExports__RhUnboxAny
-92 (-100.00% of base) : 12289.dasm - S_P_CoreLib_Internal_TypeSystem_LockFreeReaderHashtableOfPointers_2<System___Canon__S_P_CoreLib_System_Runtime_InteropServices_GCHandle>__TryWriteSentinelToLocation
-5 (-100.00% of base) : 13761.dasm - __DOUBLE_BITS*
-9 (-100.00% of base) : 2645.dasm - RhpCallCatchFunclet
-48 (-100.00% of base) : 1503.dasm - CheckPromoted(Object**, unsigned long*, unsigned long, unsigned long)*
-9 (-100.00% of base) : 2646.dasm - RhpCallFilterFunclet
-155 (-100.00% of base) : 6467.dasm - S_P_CoreLib_System_Random_XoshiroImpl__Next
-9 (-100.00% of base) : 2647.dasm - RhpCallFinallyFunclet
-292 (-100.00% of base) : 4001.dasm - S_P_CoreLib_Internal_IntrinsicSupport_EqualityComparerHelpers__StructOnlyEquals<S_P_CoreLib_System_Collections_Concurrent_ConcurrentUnifierWKeyed_2_Entry<S_P_CoreLib_System_Reflection_Runtime_TypeInfos_NativeFormat_NativeFormatRuntimeGenericParameterTypeInfoForMethods_UnificationKey__System___Canon>>
-29 (-100.00% of base) : 4002.dasm - S_P_CoreLib_Internal_IntrinsicSupport_EqualityComparerHelpers__GetComparerForReferenceTypesOnly<S_P_CoreLib_System_Collections_Concurrent_ConcurrentUnifierWKeyed_2_Entry<S_P_CoreLib_System_Reflection_Runtime_TypeInfos_NativeFormat_NativeFormatRuntimeGenericParameterTypeInfoForMethods_UnificationKey__System___Canon>>
-258 (-100.00% of base) : 2650.dasm - StackFrameIterator::CalculateCurrentMethodState()
-392 (-100.00% of base) : 2651.dasm - StackFrameIterator::InternalInitForStackTrace()
-109 (-100.00% of base) : 12288.dasm - S_P_CoreLib_Internal_TypeSystem_LockFreeReaderHashtableOfPointers_2<System___Canon__S_P_CoreLib_System_Runtime_InteropServices_GCHandle>__VolatileReadNonSentinelFromHashtable
-163 (-100.00% of base) : 4004.dasm - S_P_CoreLib_System_Array__IndexOf_4<S_P_CoreLib_System_Collections_Concurrent_ConcurrentUnifierWKeyed_2_Entry<S_P_CoreLib_System_Reflection_Runtime_TypeInfos_NativeFormat_NativeFormatRuntimeGenericParameterTypeInfoForMethods_UnificationKey__System___Canon>>
-173 (-100.00% of base) : 13078.dasm - S_P_CoreLib_System_Array__IndexOf_4<UInt8>
-170 (-100.00% of base) : 11611.dasm - S_P_TypeLoader_Internal_TypeSystem_ExceptionTypeNameFormatter__GetTypeName
-100 (-100.00% of base) : 7733.dasm - S_P_CoreLib_System_Runtime_InternalCalls__RhEndNoGCRegion
-113 (-100.00% of base) : 12086.dasm - S_P_CoreLib_System_Threading_Thread__StopThread
12820 total methods with Code Size differences (12533 improved, 287 regressed)
The "uninhibited" diffs are then closer to -20%, also fairly similar to what I measured last time.
You can also build the other test cases we have by yourself, though it's a bit of an involved process.
Thanks! I took a quick look now. -Oz shrinks by 3.7%, while -O3 only by 1.1%. However, this part of the optimization diff (from --metrics) is interesting:
[funcs] : 8939 -4481
Call : 95157 -7493
CallIndirect : 7376 -15
[vars] : 23027 -6397
It is removing 33% of the functions and 7.3% of calls (likely through inlining). Also 21% of the total declared (non-param) locals. So while the binary size is not much smaller, it is doing useful work which might lead to runtime speedups.
likely through inlining
Doesn't inlining get disabled with DWARF present? With inlining, I would expect to see at least some methods in the Top method regressions column (in my diff), with things that look like inlining, but I don't see that. These eliminated methods are indeed quite interesting-looking though, it suggests there is (a lot?) of some low-hanging fruit somewhere (I have an idea about where, but it would need confirmation... Maybe it'll lead to yet another linker feature request :)).
Doesn't inlining get disabled with DWARF present?
Oh, it does, yes (since it adds/changes locals). I measured without -g. When I add that, no functions are removed, and wasm-opt hardly helps, which makes sense.
But for a release build you would normally build without DWARF? Note that you can keep function names around for stack traces, but still remove full DWARF (-g --strip-dwarf), which will not prevent inlining or other opts.
But for a release build you would normally build without DWARF?
We do build with DWARF by default. The north star for user experience here is to always build with debug info enabled but then separate out that debug info (like Emscripten's -gseparate-dwarf), which can later be used for offline analysis [of e. g. crash dumps]. It is how all other .NET targets work. We're not there yet, but in any case it would mean that any post-link tool would need to preserve this DWARF.
@SingleAccretion Do you use DWARF for something other than stack traces? (wasm-opt should do a good job of preserving those, but less for things like variable mappings)
Do you use DWARF for something other than stack traces? (wasm-opt should do a good job of preserving those, but less for things like variable mappings)
Yes, stepping through (optimized) code. Variables are currently a bit non-existent due to LLVM issues (which I believe can be fixed). Though to note again, for us the choice is not "should we include DI in all builds by default", it is "what would be the very strong and compelling reason for WASM (one target of many) to be different (from all other targets)".
Makes sense. Yeah, if you want variable watching eventually then you will need full DWARF. But if you only wanted stack traces, you could have gotten them equally as good in wasm as in other targets by using source maps, that is, another form of DI (which handle stack traces but not locals edit: and which wasm-opt has very good support for).