Messagebox does not list 3.0 app instances
Describe the bug
Messagebox does not list 3.0 app instances at some times when navigating from app frontend to inbox, or first time login. When thr page is refreshed, the 3.0 instances are listed again.
To Reproduce
Steps to reproduce the behavior: Very hard to reproduce this issue, occurs sporadically at different cases.
Expected behavior
The 3.0 instances should be visible in the messagebox all times without requiring a refresh.
Cause/Resolution
There are two possible causes of this issue. If the instance has just been instantiated the cache still might hold a value indicating that the user doesn't have any T3.0 instances. Another cause would be that the response time from storage is too long, causing the messagebox to load without the t3.0 instances. During refresh this response is cached and returning the instances will satisfy the time requirements.
Nothing much we can do in these cases except optimizing our APIs to lower response time.
Analysis
TT02 All slow requests in TT02 are related to the time it takes to retrieve the role list from SBL.
PR Two cases causing slow responces.
-
30 % authorization and retrieving roles take a long time. The database call itself does not take time. Seems like delay in the communication with the vip. Could this also be because the response i async?
-
40 % request to get documents from cosmos take time. Delay caused by ClientSideMetrics.PartitionSchedulingTimeSpans.Item.TurnaroundTime & ClientSideMetrics.PartitionSchedulingTimeSpans.Item.ResponseTime. Assumption is that this is caused by the thread being busy and it being an async call to get the documents from cosmos.
Screenshots

Additional info
Env: At22 Browser: Chrome 83 Time: 07/07/2020 08:55 AM CET
First step is to create a panel to get overview of slow responses on platform APIS.
Check with @alt-how
Example of query in application insights:
requests
| where duration > 2000 and name contains "messagebox"
| project Time=timestamp, StatusCode=resultCode, Name=name, Duration=duration, Id=id
| order by Duration
Initial investigation shows that this happens several times a day, also in production. Our hypothesis that the system has been unused for a while and is slow upon startup does not seem to be the cause of the delayed requests.
TT02 dashboard updated with new tile. Please provide feedback if this is enough to monitor the slow requests.
@acn-sbuad FYI: Changed the time interval for the tile to reflect the time interval on the dashboard
@alt-how I set it to seven days to better get an understanding of the pattern. If it is on a day by day basis will we find the patterns we are looking for?
Added tiles to tt02 and prod dashboard. Moving back to backlog. Further analysis should be done to identify which part of the requests are slow and if there is any pattern in when these slow responses occur.
One hypothesis is that setting the app title is time consuming as we connect to cosmos for each instance to retrieve the title. Pr #4813 gets the app title for the cached text resource. lets review if perfomance improves when this is deployed
Focus should be either a general performance issue or how many instances we should be able to retrieve in the given time frame. What are the requirements for the performance on retrieveing messagebox instances? Check with load testing team.
@TheTechArch We could start logging some query metrics to figure out what is taking time within the cosmosDB query. We would have to log it at a high enough level for the log to be forwarded to app insights.
Extended query metrics enabled in prod, but won't be deployed until 20. oct.. Further analysis should be done once the metrics are registered
The majority of slow requests in producion are due to authorization taking long time. However, for some requests we see that the execution time in cosmos db is short, but that the wait time is quite long. E.g.
"QueryEngineTimes":{
"IndexLookupTime":"00:00:00.0000900",
"DocumentLoadTime":"00:00:00.0003300",
"WriteOutputTime":"00:00:00.0000200",
"RuntimeExecutionTimes":{
"SystemFunctionExecutionTime":"00:00:00",
"UserDefinedFunctionExecutionTime":"00:00:00",
"TotalTime":"00:00:00.0001800"
}
},
"Retries":0,
"ClientSideMetrics":{
"Retries":0,
"RequestCharge":3.9,
"FetchExecutionRanges":
[
{"PartitionId":"PKId([\"58537830\"])",
"ActivityId":"858dd1a4-9675-4c8b-a1da-eebdbb7fd387",
"StartTime":"2020-10-22T08:29:17.386825Z",
"EndTime":"2020-10-22T08:29:17.3933807Z",
"NumberOfDocuments":18,
"RetryCount":0}
],
"PartitionSchedulingTimeSpans":
[
{"Item1":"PKId([\"58537830\"])",
"Item2":{
"NumPreemptions":0,
"TurnaroundTime":"00:00:02.5975632",
"ResponseTime":"00:00:02.5975631",
"RunTime":"00:00:00",
"WaitTime":"00:00:02.5975633"
}}]
},
@alt-how could we check the correlation between cpu use and slow reponses. ref https://stackoverflow.com/questions/58665671/high-scheduling-metrics-wait-time-in-cosmos-db
Completing analysis for now. Next step will be to follow up with infrastructure team
https://tfs.ai-dev.brreg.no/Altinn/Altinn/_workitems/edit/43009
Confirmed that issue occurs on multiple of the nodes in the node pool, not a single one.
Moving to ice box as we are awaiting performance tests
16 occurences past 30 days. Party affected varies. Only one case of second try of loading messagebox not loading 3.0 elements. This could also be because it wasn't attempted by the end user.
Closing issue due to age. We have issues in progression dealing with bettering the performance of the messagebox apis in Platform Storage.