altinn-studio icon indicating copy to clipboard operation
altinn-studio copied to clipboard

Messagebox does not list 3.0 app instances

Open jeevananthank opened this issue 5 years ago • 18 comments

Describe the bug

Messagebox does not list 3.0 app instances at some times when navigating from app frontend to inbox, or first time login. When thr page is refreshed, the 3.0 instances are listed again.

To Reproduce

Steps to reproduce the behavior: Very hard to reproduce this issue, occurs sporadically at different cases.

Expected behavior

The 3.0 instances should be visible in the messagebox all times without requiring a refresh.

Cause/Resolution

There are two possible causes of this issue. If the instance has just been instantiated the cache still might hold a value indicating that the user doesn't have any T3.0 instances. Another cause would be that the response time from storage is too long, causing the messagebox to load without the t3.0 instances. During refresh this response is cached and returning the instances will satisfy the time requirements.

Nothing much we can do in these cases except optimizing our APIs to lower response time.

Analysis

TT02 All slow requests in TT02 are related to the time it takes to retrieve the role list from SBL.

PR Two cases causing slow responces.

  • 30 % authorization and retrieving roles take a long time. The database call itself does not take time. Seems like delay in the communication with the vip. Could this also be because the response i async? image.png

  • 40 % request to get documents from cosmos take time. Delay caused by ClientSideMetrics.PartitionSchedulingTimeSpans.Item.TurnaroundTime & ClientSideMetrics.PartitionSchedulingTimeSpans.Item.ResponseTime. Assumption is that this is caused by the thread being busy and it being an async call to get the documents from cosmos.

Screenshots

inbox reappear

Additional info

Env: At22 Browser: Chrome 83 Time: 07/07/2020 08:55 AM CET

jeevananthank avatar Jul 07 '20 06:07 jeevananthank

First step is to create a panel to get overview of slow responses on platform APIS.

TheTechArch avatar Sep 14 '20 10:09 TheTechArch

Check with @alt-how

TheTechArch avatar Sep 14 '20 10:09 TheTechArch

Example of query in application insights:

requests
| where duration > 2000 and name contains "messagebox"
| project Time=timestamp, StatusCode=resultCode, Name=name, Duration=duration, Id=id
| order by Duration

ghost avatar Sep 14 '20 10:09 ghost

Initial investigation shows that this happens several times a day, also in production. Our hypothesis that the system has been unused for a while and is slow upon startup does not seem to be the cause of the delayed requests.

acn-sbuad avatar Sep 15 '20 10:09 acn-sbuad

TT02 dashboard updated with new tile. Please provide feedback if this is enough to monitor the slow requests.

acn-sbuad avatar Sep 16 '20 06:09 acn-sbuad

@acn-sbuad FYI: Changed the time interval for the tile to reflect the time interval on the dashboard

ghost avatar Sep 16 '20 06:09 ghost

@alt-how I set it to seven days to better get an understanding of the pattern. If it is on a day by day basis will we find the patterns we are looking for?

acn-sbuad avatar Sep 16 '20 06:09 acn-sbuad

Added tiles to tt02 and prod dashboard. Moving back to backlog. Further analysis should be done to identify which part of the requests are slow and if there is any pattern in when these slow responses occur.

acn-sbuad avatar Sep 17 '20 07:09 acn-sbuad

One hypothesis is that setting the app title is time consuming as we connect to cosmos for each instance to retrieve the title. Pr #4813 gets the app title for the cached text resource. lets review if perfomance improves when this is deployed

acn-sbuad avatar Sep 18 '20 08:09 acn-sbuad

Focus should be either a general performance issue or how many instances we should be able to retrieve in the given time frame. What are the requirements for the performance on retrieveing messagebox instances? Check with load testing team.

acn-sbuad avatar Sep 28 '20 10:09 acn-sbuad

@TheTechArch We could start logging some query metrics to figure out what is taking time within the cosmosDB query. We would have to log it at a high enough level for the log to be forwarded to app insights.

acn-sbuad avatar Oct 07 '20 08:10 acn-sbuad

Extended query metrics enabled in prod, but won't be deployed until 20. oct.. Further analysis should be done once the metrics are registered

acn-sbuad avatar Oct 08 '20 14:10 acn-sbuad

The majority of slow requests in producion are due to authorization taking long time. However, for some requests we see that the execution time in cosmos db is short, but that the wait time is quite long. E.g.

"QueryEngineTimes":{
	"IndexLookupTime":"00:00:00.0000900",
	"DocumentLoadTime":"00:00:00.0003300",
	"WriteOutputTime":"00:00:00.0000200",
	"RuntimeExecutionTimes":{
		"SystemFunctionExecutionTime":"00:00:00",
		"UserDefinedFunctionExecutionTime":"00:00:00",
		"TotalTime":"00:00:00.0001800"
	}​
},
"Retries":0,
"ClientSideMetrics":{
	"Retries":0,
	"RequestCharge":3.9,
	"FetchExecutionRanges":
		[
		{"PartitionId":"PKId([\"58537830\"])",
		"ActivityId":"858dd1a4-9675-4c8b-a1da-eebdbb7fd387",
		"StartTime":"2020-10-22T08:29:17.386825Z",
		"EndTime":"2020-10-22T08:29:17.3933807Z",
		"NumberOfDocuments":18,
		"RetryCount":0}
	],
	"PartitionSchedulingTimeSpans":
		[
		{"Item1":"PKId([\"58537830\"])",
		"Item2":{
			"NumPreemptions":0,
			"TurnaroundTime":"00:00:02.5975632",
			"ResponseTime":"00:00:02.5975631",
			"RunTime":"00:00:00",
			"WaitTime":"00:00:02.5975633"
		}​}]
},

@alt-how could we check the correlation between cpu use and slow reponses. ref https://stackoverflow.com/questions/58665671/high-scheduling-metrics-wait-time-in-cosmos-db

acn-sbuad avatar Oct 22 '20 12:10 acn-sbuad

Completing analysis for now. Next step will be to follow up with infrastructure team

acn-sbuad avatar Oct 23 '20 08:10 acn-sbuad

https://tfs.ai-dev.brreg.no/Altinn/Altinn/_workitems/edit/43009

annerisbakk avatar Oct 26 '20 11:10 annerisbakk

Confirmed that issue occurs on multiple of the nodes in the node pool, not a single one.

acn-sbuad avatar Oct 27 '20 12:10 acn-sbuad

Moving to ice box as we are awaiting performance tests

acn-sbuad avatar Oct 28 '20 08:10 acn-sbuad

16 occurences past 30 days. Party affected varies. Only one case of second try of loading messagebox not loading 3.0 elements. This could also be because it wasn't attempted by the end user.

acn-sbuad avatar Sep 17 '21 12:09 acn-sbuad

Closing issue due to age. We have issues in progression dealing with bettering the performance of the messagebox apis in Platform Storage.

acn-sbuad avatar Dec 13 '22 14:12 acn-sbuad