portal
portal copied to clipboard
Add details on more execution errors
Added errors: Instruction limit exceeded Trapped Trapped Explicitly Wasm module not found Out of memory Reserved pages for old Motoko Slice overrun Memory access limit exceeded Insufficient cycles in memory grow Reserved cycles limit exceeded in memory grow Insufficient cycles in message memory grow Wasm memory limit exceeded
High-level questions:
- Discovery: how will developers know that this page exists?
- Maintainability: how do we plan on keeping this page up to date? I can already see that the 8GiB limit being added here will change in a currently open PR. AFAICT, these errors are at the level of the interface spec, as they are practically part of the protocol and changing them would be a breaking change. Would it make sense to at least think about whether these errors should be defined as part of the interface, maintained in a similar rigor as the interface spec, etc?
cc @dsarlis
For discoverability, Hypervisor
errors will have an attached doc_link
field introduced here which will provide a link to this page.
Hopefully also the errors are unique enough that if a user copies the error into google they get a hit here.
For maintainability, I agree that it's not easy. Maybe some of the limits could be generated by a library coming from the replica code? I could see some of them going into the spec, but I don't know if they all make sense given that, at least in theory, the spec should be for a general protocol and "other implementations" could have other limits.
Also I'm not sure about changes to the errors being breaking changes since they're not returned in a structured way. If someone is parsing the rejections to get details about the errors, it could be argued that they're relying on an unstable interface.
Oh, and as for including the specific errors in the spec, I wonder if it's realistic to fully specify when each type of error occurs. For example, with the errors about growing memory I guess we would then need to add the notion of the subnet total memory to the spec and then the scheduling algorithm would also need to be included in order to specify which messages succeed and which ones fail when there's a race to allocate the last memory on the subnet.
I could see some of them going into the spec, but I don't know if they all make sense given that, at least in theory, the spec should be for a general protocol and "other implementations" could have other limits.
I agree. Some of the errors are implementation specific which means that we probably don't want to include them in the spec. And as Adam also mentioned explaining some of them might require going into implementation details that don't fit in the spec anyway.
Maintaining these error messages would be a challenge indeed -- I think the best we can do is just keep each other accountable if we change error messages to update the developer docs (maybe we can also have a comment in the code somewhere reminding people that need to update the page here?).
If we could get something more automated like
Maybe some of the limits could be generated by a library coming from the replica code?
it would be very nice but not sure how realistic it is to expect it to happen any time soon.
Some of the errors are implementation specific which means that we probably don't want to include them in the spec. And as Adam also mentioned explaining some of them might require going into implementation details that don't fit in the spec anyway.
I don't disagree with documenting these errors as done in this PR, and I agree that maintaining errors is hard. I don't have a clear answer myself, but I wouldn't necessarily agree that these error messages are implementation details. The north star of the interface spec is that external developers can build their own version of the replica, deploy that implementation, even on the same subnet as our own replica, and they would work. Clearly, they'd need to know exactly what error messages to return in what circumstances to avoid the states from diverging.
I know we're very far away from this goal, and I'm not sure we'll ever reach it, but if a change we make can break canisters (which can certainly happen by changing these messages), then we shouldn't treat these as purely implementation details.
The north star of the interface spec is that external developers can build their own version of the replica, deploy that implementation, even on the same subnet as our own replica, and they would work.
Is it really correct that the goal is they should be able to run on the same subnet? In that case we'd need to to specify the exact cycle costs of everything. I thought it was supposed to be more that they could build their own subnet.
Is it really correct that the goal is they should be able to run on the same subnet? In that case we'd need to to specify the exact cycle costs of everything. I thought it was supposed to be more that they could build their own subnet.
Perhaps that was just my interpretation and I misinterpreted it, but with that interpretation, yes, cycle costs would need to be specified, I agree.
In any case, perhaps this is a bigger discussion than this PR. The documentation you're adding helps for sure, thanks for putting the effort into this.