go-algorand icon indicating copy to clipboard operation
go-algorand copied to clipboard

simulate: resource population

Open joe-p opened this issue 1 year ago • 5 comments

Summary

When a user calls simulate with UnnamedResources enabled, simulate should suggest to the user how they can populate the resource arrays in their transactions to properly send the transaction group to the network.

Test Plan

  • [x] Test ResourcePopulator works with simple local (not group sharing) resources
  • [x] Test ResourcePopulator with group sharing
  • [x] Test ResourcePopulator resource limit detection with group sharing (ie. it is able to find the correct transaction to put a resource in)
  • [x] Test Simulate with ResourcePopulator functionality
  • [ ] Test /simulate endpoint with ResourcePopulator functionality
  • [ ] Write smaller tests for better ledger/simulation/resources.go coverage

joe-p avatar Jun 05 '24 22:06 joe-p

Codecov Report

Attention: Patch coverage is 89.18919% with 40 lines in your changes missing coverage. Please review.

Project coverage is 51.78%. Comparing base (c95bb50) to head (e6f0d30). Report is 67 commits behind head on master.

Files with missing lines Patch % Lines
daemon/algod/api/server/v2/utils.go 0.00% 26 Missing :warning:
ledger/simulation/resources.go 96.98% 6 Missing and 4 partials :warning:
ledger/simulation/simulator.go 66.66% 2 Missing and 2 partials :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6015      +/-   ##
==========================================
+ Coverage   51.60%   51.78%   +0.18%     
==========================================
  Files         649      649              
  Lines       87048    87418     +370     
==========================================
+ Hits        44917    45267     +350     
- Misses      39269    39287      +18     
- Partials     2862     2864       +2     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jun 06 '24 11:06 codecov[bot]

In 99abd2f I have added resource population to the simulate API but for some reason this struct is not being properly encoded in the response

type PopulatedResourceArrays struct {
	Accounts []basics.Address    `codec:"accounts"`
	Assets   []basics.AssetIndex `codec:"assets"`
	Apps     []basics.AppIndex   `codec:"apps"`
	Boxes    []logic.BoxRef      `codec:"boxes"`
}

Boxes is encoded properly, but Accounts, Assets, and Apps are coming up nil. This behavior can be seen when running TestSimulateWithUnnamedResources with the changes in this commit. This seems to be an issue with the handler or encoder because TestPopulateResources shows simulate itself returns references (in this case addresses) as expected.

@jasonpaulos Any idea where the data is getting lost? In general, what is the best way to debug these algod API tests?

joe-p avatar Jul 24 '24 20:07 joe-p

In 292a9b9 I updated the API model to avoid using map[int] but for some reason it's not encoding the response properly: msgpack decode error [pos 50]: no matching struct field found when decoding stream map with key PopulatedResourceArrays.

If I print the raw response I see this:

��last-round�txn-groups���PopulatedResourceArrays��app-budget-added���app-budget-consumed�failed-at���failure-message��transaction SSG3ROSUBRSMXPTZYOORYAXCDYURLGM5OJJF5J3LMJ74LFEWJJAA: logic eval error: invalid Account reference CV6S42NRDBJZKQDDQPUVXSD4KUJ5FSYHXENR74ZPTTIEQASD2WBRBFODRY. Details: app=1006, pc=57, opcodes=store 2; load 0; balance�txn-results���app-budget-consumed�txn-result��pool-error��txn��sig�@۔sq��3b�l3���^J�b�=�29�IұH=� �P��I�r����-e��A��}�q���?<�txn��apid��fee��fv�gh� �Gp�y8�d��"��\>-c�P�2P�݆~ݢlv���snd� ��ʏe�Lz�Y��W�[�+!���O���P�Фtype�appl�version

So for some reason PopulatedResourceArrays is present where I would expect it to be omitted (and if it was included I would expect it to be populated-resource-arrays) given the fact that it's defined as

// PreEncodedSimulateTxnResult mirrors model.SimulateTransactionResult
type PreEncodedSimulateTxnResult struct {
	Txn                      PreEncodedTxInfo                        `codec:"txn-result"`
	AppBudgetConsumed        *uint64                                 `codec:"app-budget-consumed,omitempty"`
	LogicSigBudgetConsumed   *uint64                                 `codec:"logic-sig-budget-consumed,omitempty"`
	TransactionTrace         *model.SimulationTransactionExecTrace   `codec:"exec-trace,omitempty"`
	UnnamedResourcesAccessed *model.SimulateUnnamedResourcesAccessed `codec:"unnamed-resources-accessed,omitempty"`
	FixedSigner              *string                                 `codec:"fixed-signer,omitempty"`
	PopulatedResourceArrays  *model.ResourceArrays                   `codec:"populated-resource-arrays,omitempty"`
}

Any ideas on what might be happening here?

joe-p avatar Oct 09 '24 13:10 joe-p

It almost seems like the codec line was completely ignored for encoding, since it has the default name and omitempty was ineffective. Yet, decoding was surprised to see the capitalized form. I don't know the context of your testing - is there any chance you encoded that bytestream before the codec line was added, then decoded it after?

jannotti avatar Oct 09 '24 14:10 jannotti

At a glance it seems to me like you might have a bug somewhere where you're assigning *simulation.PopulatedResourceArrays to PreEncodedSimulateTxnResult.PopulatedResourceArrays instead of *model.ResourceArray but its not super clear where that would be happening & i dont know the go-algorand code base well enough to say definitively.

Where are you printing the raw response?

kylebeee avatar Oct 09 '24 16:10 kylebeee

Edit2: The error is actually in simulate. TestPopulateResources/mixed_resources is currently failing.

joe-p avatar Jan 17 '25 12:01 joe-p

After having this on the backburner for awhile I've come back to working on this and discovered why I was slow to make progress once I started to implement the endpoint. I was making two mistakes

  1. I was not building algod before running the e2e tests. In hindsight this seems obvious, but I was used to go test picking up the changes automatically for me. With the e2e tests the built algod is spawned as a seperate task, so any changes to algod need to be explicitly rebuilt.

  2. The test cache was not being properly invalidated. Most likely because of the first problem, but I was running tests and getting incorrect cached results. This lead to me making changes that actually broke things but I was under the impression they were still working. This made debugging breaking changes harder because I was breaking things without realizing it (see 035ef72 fixed by 41d63dd )

Now with 41d63dd all tests are passing, although I am experiencing an intermittent issue with database tables being locked when testing, which is seemingly causing a tracked app to be missing

--- FAIL: TestPopulatorWithGlobalResources (0.00s)
    resources_test.go:431: 
                Error Trace:    /Users/joe/git/algorand/go-algorand/ledger/simulation/resources_test.go:431
                Error:          elements differ
                            
                                extra elements in list B:
                                ([]interface {}) (len=1) {
                                 (basics.AppIndex) 3
                                }
                            
                            
                                listA:
                                ([]basics.AppIndex) (len=2) {
                                 (basics.AppIndex) 11,
                                 (basics.AppIndex) 5
                                }
                            
                            
                                listB:
                                ([]basics.AppIndex) (len=3) {
                                 (basics.AppIndex) 5,
                                 (basics.AppIndex) 11,
                                 (basics.AppIndex) 3
                                }
                Test:           TestPopulatorWithGlobalResources
time="2025-01-21T15:46:24.630756 -0500" level=warning msg="db.LoggedRetry: 5 retries (last err: database table is locked: accountbase)" file=dbutil.go function=github.com/algorand/go-algorand/util/db.LoggedRetry line=171
time="2025-01-21T15:46:24.630995 -0500" level=warning msg="db.LoggedRetry: 6 retries (last err: database table is locked: accountbase)" file=dbutil.go function=github.com/algorand/go-algorand/util/db.LoggedRetry line=171
time="2025-01-21T15:46:24.631008 -0500" level=warning msg="db.LoggedRetry: 7 retries (last err: database table is locked: accountbase)" file=dbutil.go function=github.com/algorand/go-algorand/util/db.LoggedRetry line=171
time="2025-01-21T15:46:24.631220 -0500" level=warning msg="db.LoggedRetry: 8 retries (last err: database table is locked: acctrounds)" file=dbutil.go function=github.com/algorand/go-algorand/util/db.LoggedRetry line=171

Here is a gist showing the full output with 2/10 runs failing because of the above: https://gist.github.com/joe-p/860cf28908a99db2f58c5010cb378894

I have not yet tried to reproduce on the e2e tests, but I was running them extensively last week and never saw this issue.

Once this issue is resolved the only remaining work is to make some smaller unit tests to test the "bad" cases and make sure things fail gracefully.

joe-p avatar Jan 21 '25 20:01 joe-p

I believe all comments have been addressed at this point and test coverage is near 100%. The only problem is I'm still occasionally getting database table is locked when running tests locally. So far it's only happened with TestPopulatorWithGlobalResources. I tried just running this test and disabling parallel testing but I'm still seeing the same error occasionally. I believe this is just a problem with the test harness so not sure if it should be considered a blocker or not. I'd be interested to know if others can replicate.

joe-p avatar Jan 31 '25 12:01 joe-p

~~As I was working on SDK support and testing for full coverage of the API I realized I didn't have a test with full coverage of the API here. I found a bug that seems to be related to extra-resource-arrays. I will write a test for proper coverage of the API and then mark as ready for review once ready~~

Fixed in 791df53. See body for details

joe-p avatar Feb 19 '25 18:02 joe-p

The failing CI was from the non-deterministic behavior of the population algorithm. Turns out the table is locked message was correlated but not causal.

From the body of fd2c8dc:

The initial reason for desiring this was because of the non-deterministic behavior of maps made testing difficult. When thinking about it more, I realized that having deterministic population will also improve the developer experience since the same txn group will always get back the same populated resources. This change, however, does expose a potential problem inherit to resource population: order matters. As seen in the modified test, the determinstic order results in an extra resource in the extra resource array. In the future steps could be taken to try to improve the efficiency, but unless we go over every permutation of resource ordering there will always be cases where algorithmic population will not result in the most efficient resource packing.

joe-p avatar Feb 20 '25 14:02 joe-p

Relevant issue https://github.com/algorand/go-algorand/issues/5616

algorandskiy avatar Feb 26 '25 17:02 algorandskiy

@joe-p could you fix the reviewdog and respond to the review?

algorandskiy avatar Mar 08 '25 19:03 algorandskiy

~~CI is failing due to an unrelated test: TestP2PEnableGossipService_BothDisable~~

Re-triggered test run passing

joe-p avatar Mar 08 '25 22:03 joe-p

All comments have been addressed. I also noticed CI failed. I need to step away for a minute but will take a closer look later today.

Also apologies for the long delays between addressing feedback, should be able to focus on this to get it across the finish line now.

Edit: Seemed to have introduced a regression Edit2: Should be fixed in 0df1e0f. Tests passing locally, will wait for CI

joe-p avatar May 05 '25 18:05 joe-p

You probably need to re-merge master and make api to fix the generated conflicts.

jannotti avatar May 05 '25 23:05 jannotti

Some simplistic code comments, and some question about when you add certain things. I think the existing order may be perfectly fine, despite my early comments reservations. But take my confusion as a request to find a good place to describe the whole strategy all at once in a comment.

I will add this in a comment but the general philosophy is to go in order of resources with the most restrictions to least. This is why the order is

  1. Txn-specific resources
  2. Cross-ref resources (because they need two slots and we don't want single slot resources to potentially use up one of the available two-slot transactions)
  3. Boxes, because they may require an app ref in addition to the box ref
  4. Accounts because they have a lower limit than other resources
  5. Standalone resources

I became worried that you are not accounting for old avm version apps, which can't use resource sharing. But I think maybe you're ok, because the response from simulate was taken that into account and places the necessary refs in the specific transactions if needed?

Yeah exactly. The logic for transaction-specific references exists upstream in simulate. By the time we reach the resource populator we know txn-specific resources and can handle those first

joe-p avatar May 06 '25 22:05 joe-p

We've gotten so close to this being ready, but I wonder if the work in #6286 means we should hold off on this. The access list will dramatically simplify resource population so not entirely sure if it's worth merging in this fairly complex algorithm when we know we have a more elegant solution around the corner. Any thoughts?

joe-p avatar Jun 30 '25 21:06 joe-p

Id say it depends entirely on how soon the access PR gets merged

kylebeee avatar Jun 30 '25 22:06 kylebeee

My expectation is that there will be a go-algorand release in the next week or so that does not have #6286 (and is not a consensus release) and then there will be a consensus release that has tx.Access next, maybe at the end of July.

I'm not sure what the best recommendation is for this code. I could imagine it staying, to support pre-sharing apps, or the very rare cases where the "cross-product" nature of foreign arrays allows a transaction to access more than tx.Access. But I'm not sure the complexity of supporting both (and choosing between them intelligently?) forever is worth it.

jannotti avatar Jul 01 '25 13:07 jannotti

Supporting resource population just for App post txn.Access could be an additional incentive to steer the ecosystem (and tools) towards this new resource declaration method.

cusma avatar Jul 01 '25 14:07 cusma

I'm not sure the complexity of supporting both (and choosing between them intelligently?) forever is worth it.

Yeah I think this is the main thing to consider.

to support pre-sharing apps

Population for pre-sharing apps is already trivial because the existing simulate response already gives you unnamed resources per txn.

Once we have the access list there will basically be no actual use-case for this code, so I don't think it makes sense to merge it in. As such, closing this PR. Will re-evaluate whether we want this functionality in algod for access list and make another PR if so.

joe-p avatar Jul 01 '25 20:07 joe-p