go-algorand
go-algorand copied to clipboard
simulate: resource population
Summary
When a user calls simulate with UnnamedResources enabled, simulate should suggest to the user how they can populate the resource arrays in their transactions to properly send the transaction group to the network.
Test Plan
- [x] Test ResourcePopulator works with simple local (not group sharing) resources
- [x] Test ResourcePopulator with group sharing
- [x] Test ResourcePopulator resource limit detection with group sharing (ie. it is able to find the correct transaction to put a resource in)
- [x] Test Simulate with ResourcePopulator functionality
- [ ] Test
/simulateendpoint with ResourcePopulator functionality - [ ] Write smaller tests for better
ledger/simulation/resources.gocoverage
Codecov Report
Attention: Patch coverage is 89.18919% with 40 lines in your changes missing coverage. Please review.
Project coverage is 51.78%. Comparing base (
c95bb50) to head (e6f0d30). Report is 67 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #6015 +/- ##
==========================================
+ Coverage 51.60% 51.78% +0.18%
==========================================
Files 649 649
Lines 87048 87418 +370
==========================================
+ Hits 44917 45267 +350
- Misses 39269 39287 +18
- Partials 2862 2864 +2
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
In 99abd2f I have added resource population to the simulate API but for some reason this struct is not being properly encoded in the response
type PopulatedResourceArrays struct {
Accounts []basics.Address `codec:"accounts"`
Assets []basics.AssetIndex `codec:"assets"`
Apps []basics.AppIndex `codec:"apps"`
Boxes []logic.BoxRef `codec:"boxes"`
}
Boxes is encoded properly, but Accounts, Assets, and Apps are coming up nil. This behavior can be seen when running TestSimulateWithUnnamedResources with the changes in this commit. This seems to be an issue with the handler or encoder because TestPopulateResources shows simulate itself returns references (in this case addresses) as expected.
@jasonpaulos Any idea where the data is getting lost? In general, what is the best way to debug these algod API tests?
In 292a9b9 I updated the API model to avoid using map[int] but for some reason it's not encoding the response properly: msgpack decode error [pos 50]: no matching struct field found when decoding stream map with key PopulatedResourceArrays.
If I print the raw response I see this:
��last-round�txn-groups���PopulatedResourceArrays��app-budget-added���app-budget-consumed�failed-at���failure-message��transaction SSG3ROSUBRSMXPTZYOORYAXCDYURLGM5OJJF5J3LMJ74LFEWJJAA: logic eval error: invalid Account reference CV6S42NRDBJZKQDDQPUVXSD4KUJ5FSYHXENR74ZPTTIEQASD2WBRBFODRY. Details: app=1006, pc=57, opcodes=store 2; load 0; balance�txn-results���app-budget-consumed�txn-result��pool-error��txn��sig�@۔sq��3b�l3���^J�b�=�29�IұH=� �P��I�r����-e��A��}�q���?<�txn��apid��fee��fv�gh� �Gp�y8�d��"��\>-c�P�2P�݆~ݢlv���snd� ��ʏe�Lz�Y��W�[�+!���O���P�Фtype�appl�version
So for some reason PopulatedResourceArrays is present where I would expect it to be omitted (and if it was included I would expect it to be populated-resource-arrays) given the fact that it's defined as
// PreEncodedSimulateTxnResult mirrors model.SimulateTransactionResult
type PreEncodedSimulateTxnResult struct {
Txn PreEncodedTxInfo `codec:"txn-result"`
AppBudgetConsumed *uint64 `codec:"app-budget-consumed,omitempty"`
LogicSigBudgetConsumed *uint64 `codec:"logic-sig-budget-consumed,omitempty"`
TransactionTrace *model.SimulationTransactionExecTrace `codec:"exec-trace,omitempty"`
UnnamedResourcesAccessed *model.SimulateUnnamedResourcesAccessed `codec:"unnamed-resources-accessed,omitempty"`
FixedSigner *string `codec:"fixed-signer,omitempty"`
PopulatedResourceArrays *model.ResourceArrays `codec:"populated-resource-arrays,omitempty"`
}
Any ideas on what might be happening here?
It almost seems like the codec line was completely ignored for encoding, since it has the default name and omitempty was ineffective. Yet, decoding was surprised to see the capitalized form. I don't know the context of your testing - is there any chance you encoded that bytestream before the codec line was added, then decoded it after?
At a glance it seems to me like you might have a bug somewhere where you're assigning *simulation.PopulatedResourceArrays to PreEncodedSimulateTxnResult.PopulatedResourceArrays instead of *model.ResourceArray but its not super clear where that would be happening & i dont know the go-algorand code base well enough to say definitively.
Where are you printing the raw response?
Edit2: The error is actually in simulate. TestPopulateResources/mixed_resources is currently failing.
After having this on the backburner for awhile I've come back to working on this and discovered why I was slow to make progress once I started to implement the endpoint. I was making two mistakes
-
I was not building
algodbefore running the e2e tests. In hindsight this seems obvious, but I was used togo testpicking up the changes automatically for me. With the e2e tests the builtalgodis spawned as a seperate task, so any changes toalgodneed to be explicitly rebuilt. -
The test cache was not being properly invalidated. Most likely because of the first problem, but I was running tests and getting incorrect cached results. This lead to me making changes that actually broke things but I was under the impression they were still working. This made debugging breaking changes harder because I was breaking things without realizing it (see 035ef72 fixed by 41d63dd )
Now with 41d63dd all tests are passing, although I am experiencing an intermittent issue with database tables being locked when testing, which is seemingly causing a tracked app to be missing
--- FAIL: TestPopulatorWithGlobalResources (0.00s)
resources_test.go:431:
Error Trace: /Users/joe/git/algorand/go-algorand/ledger/simulation/resources_test.go:431
Error: elements differ
extra elements in list B:
([]interface {}) (len=1) {
(basics.AppIndex) 3
}
listA:
([]basics.AppIndex) (len=2) {
(basics.AppIndex) 11,
(basics.AppIndex) 5
}
listB:
([]basics.AppIndex) (len=3) {
(basics.AppIndex) 5,
(basics.AppIndex) 11,
(basics.AppIndex) 3
}
Test: TestPopulatorWithGlobalResources
time="2025-01-21T15:46:24.630756 -0500" level=warning msg="db.LoggedRetry: 5 retries (last err: database table is locked: accountbase)" file=dbutil.go function=github.com/algorand/go-algorand/util/db.LoggedRetry line=171
time="2025-01-21T15:46:24.630995 -0500" level=warning msg="db.LoggedRetry: 6 retries (last err: database table is locked: accountbase)" file=dbutil.go function=github.com/algorand/go-algorand/util/db.LoggedRetry line=171
time="2025-01-21T15:46:24.631008 -0500" level=warning msg="db.LoggedRetry: 7 retries (last err: database table is locked: accountbase)" file=dbutil.go function=github.com/algorand/go-algorand/util/db.LoggedRetry line=171
time="2025-01-21T15:46:24.631220 -0500" level=warning msg="db.LoggedRetry: 8 retries (last err: database table is locked: acctrounds)" file=dbutil.go function=github.com/algorand/go-algorand/util/db.LoggedRetry line=171
Here is a gist showing the full output with 2/10 runs failing because of the above: https://gist.github.com/joe-p/860cf28908a99db2f58c5010cb378894
I have not yet tried to reproduce on the e2e tests, but I was running them extensively last week and never saw this issue.
Once this issue is resolved the only remaining work is to make some smaller unit tests to test the "bad" cases and make sure things fail gracefully.
I believe all comments have been addressed at this point and test coverage is near 100%. The only problem is I'm still occasionally getting database table is locked when running tests locally. So far it's only happened with TestPopulatorWithGlobalResources. I tried just running this test and disabling parallel testing but I'm still seeing the same error occasionally. I believe this is just a problem with the test harness so not sure if it should be considered a blocker or not. I'd be interested to know if others can replicate.
~~As I was working on SDK support and testing for full coverage of the API I realized I didn't have a test with full coverage of the API here. I found a bug that seems to be related to extra-resource-arrays. I will write a test for proper coverage of the API and then mark as ready for review once ready~~
Fixed in 791df53. See body for details
The failing CI was from the non-deterministic behavior of the population algorithm. Turns out the table is locked message was correlated but not causal.
From the body of fd2c8dc:
The initial reason for desiring this was because of the non-deterministic behavior of maps made testing difficult. When thinking about it more, I realized that having deterministic population will also improve the developer experience since the same txn group will always get back the same populated resources. This change, however, does expose a potential problem inherit to resource population: order matters. As seen in the modified test, the determinstic order results in an extra resource in the extra resource array. In the future steps could be taken to try to improve the efficiency, but unless we go over every permutation of resource ordering there will always be cases where algorithmic population will not result in the most efficient resource packing.
Relevant issue https://github.com/algorand/go-algorand/issues/5616
@joe-p could you fix the reviewdog and respond to the review?
~~CI is failing due to an unrelated test: TestP2PEnableGossipService_BothDisable~~
Re-triggered test run passing
All comments have been addressed. I also noticed CI failed. I need to step away for a minute but will take a closer look later today.
Also apologies for the long delays between addressing feedback, should be able to focus on this to get it across the finish line now.
Edit: Seemed to have introduced a regression Edit2: Should be fixed in 0df1e0f. Tests passing locally, will wait for CI
You probably need to re-merge master and make api to fix the generated conflicts.
Some simplistic code comments, and some question about when you add certain things. I think the existing order may be perfectly fine, despite my early comments reservations. But take my confusion as a request to find a good place to describe the whole strategy all at once in a comment.
I will add this in a comment but the general philosophy is to go in order of resources with the most restrictions to least. This is why the order is
- Txn-specific resources
- Cross-ref resources (because they need two slots and we don't want single slot resources to potentially use up one of the available two-slot transactions)
- Boxes, because they may require an app ref in addition to the box ref
- Accounts because they have a lower limit than other resources
- Standalone resources
I became worried that you are not accounting for old avm version apps, which can't use resource sharing. But I think maybe you're ok, because the response from simulate was taken that into account and places the necessary refs in the specific transactions if needed?
Yeah exactly. The logic for transaction-specific references exists upstream in simulate. By the time we reach the resource populator we know txn-specific resources and can handle those first
We've gotten so close to this being ready, but I wonder if the work in #6286 means we should hold off on this. The access list will dramatically simplify resource population so not entirely sure if it's worth merging in this fairly complex algorithm when we know we have a more elegant solution around the corner. Any thoughts?
Id say it depends entirely on how soon the access PR gets merged
My expectation is that there will be a go-algorand release in the next week or so that does not have #6286 (and is not a consensus release) and then there will be a consensus release that has tx.Access next, maybe at the end of July.
I'm not sure what the best recommendation is for this code. I could imagine it staying, to support pre-sharing apps, or the very rare cases where the "cross-product" nature of foreign arrays allows a transaction to access more than tx.Access. But I'm not sure the complexity of supporting both (and choosing between them intelligently?) forever is worth it.
Supporting resource population just for App post txn.Access could be an additional incentive to steer the ecosystem (and tools) towards this new resource declaration method.
I'm not sure the complexity of supporting both (and choosing between them intelligently?) forever is worth it.
Yeah I think this is the main thing to consider.
to support pre-sharing apps
Population for pre-sharing apps is already trivial because the existing simulate response already gives you unnamed resources per txn.
Once we have the access list there will basically be no actual use-case for this code, so I don't think it makes sense to merge it in. As such, closing this PR. Will re-evaluate whether we want this functionality in algod for access list and make another PR if so.