add timeout arg
My internet router is very buggy at the moment and the connection jumps very frequently. It's very unpleasant, but I practically live on one of our stations, which allows me to pick up bugs that can happen in the field. For example, I had relatively variable prediction times without much reason, so I looked at the details and the main thing that changed was the call to the api due to the bad connection:
2024-06-06 17:06:27,504 | INFO: Camera '169.254.40.2' - No wildfire (confidence: 3.18%)
2024-06-06 17:06:27,504 | INFO: Heartbeat time: 4.098053 seconds
2024-06-06 17:06:27,504 | INFO: Resize time: 0.137569 seconds
2024-06-06 17:06:27,505 | INFO: Inference time: 2.505296 seconds
2024-06-06 17:06:27,505 | INFO: Update states time: 0.000756 seconds
2024-06-06 17:06:27,505 | INFO: Alert time: 0.000002 seconds
2024-06-06 17:06:27,505 | INFO: Total prediction time: 6.742037 seconds
2024-06-06 17:06:27,505 | INFO: Time taken to analyze stream from camera 169.254.40.2: 6.74 seconds
169.254.40.1 (3840, 2160)
2024-06-06 17:06:31,112 | INFO: Camera '169.254.40.1' - No wildfire (confidence: 27.97%)
2024-06-06 17:06:31,113 | INFO: Heartbeat time: 0.223382 seconds
2024-06-06 17:06:31,113 | INFO: Resize time: 0.112550 seconds
2024-06-06 17:06:31,114 | INFO: Inference time: 3.079912 seconds
2024-06-06 17:06:31,114 | INFO: Update states time: 0.001111 seconds
2024-06-06 17:06:31,114 | INFO: Alert time: 0.000003 seconds
2024-06-06 17:06:31,114 | INFO: Total prediction time: 3.417658 seconds
2024-06-06 17:06:31,114 | INFO: Time taken to analyze stream from camera 169.254.40.1: 3.42 seconds
I propose this PR to be able to change the heartbeat timeout by passing an argument. It's not really a problem if the heartbeat fails from time to time, but it mustn't block our loop. I'll probably even change the way I call it in engine
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 88.02%. Comparing base (
3fa4f3f) to head (0973fc1). Report is 35 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #330 +/- ##
==========================================
+ Coverage 88.00% 88.02% +0.01%
==========================================
Files 28 28
Lines 642 643 +1
==========================================
+ Hits 565 566 +1
Misses 77 77
| Flag | Coverage Δ | |
|---|---|---|
| backend | 87.47% <ø> (ø) |
|
| client | 95.45% <100.00%> (+0.10%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
why don't you initialize the Client with an other value of timeout ?
Hello @RonanMorgan, I was thinking maybe we need a bigger timeout to send a file for example no ?
Hey there :wave:
Trying to give the overall picture before we make a choice:
- if we allow a more flexible timeout, it means we'll potentially hold the Python process on the Raspberry Pi for longer
- the heartbeat is happening often for this reason, and if a device can't complete the request is time (which is supposedly the lightest request to the API), I doubt it will be able to send images as you mentioned
I think we should instead think about the max internet latency allowed to consider the device reachable. The client allows a default timeout, so I'd argue that if we have to set one differently it should only be for the request duration outlier, which would most likely be image upload (detection creation), what do you think?
In short, what I suggest:
- set a small timeout when instantiating the client, which will propagate to all HTTP requests
- optionally override it for the image upload with a value we set in the constructor (dynamically with a default value)
Hi FG, I understand what you're saying, but what you have to take into account is that on a station the internet can fluctuate quite rapidly. Very slow one minute and fast again. A bit like the example above, we go from 4s to 0.2s for the heartbeat very quickly. The purpose of this timeout is to avoid getting stuck in this case.
There's absolutely no problem if you get even one heartbeat out of two.
For images, if the connection is too slow, engine will try again next iteration.
I guess the timeout can be 3s for images and 0.5 for hearbeat
But then aren't we thinking the same thing?
- applying 0.5sec as default client timeout
- overriding timeout for image upload to 3s?
(my point is that if connection fluctuates for heartbeat, it certainly also does for the other routes which are all more intensive) Apologies if I misunderstood :upside_down_face:
Just checking: is this still an issue @MateoLostanlen ?
Hi @frgfm , sorry I missed your reply. Yes, we still have the problem. Sending a heartbit can fail but it shouldn't take too long because we are bound by a 30s loop. The constraint is totally different for images. That's why I'd like a solution to have two distinct hearbits no matter how you implement it. In the meantime, I've created a timeout around the hearbit on engine but it's not optimal
I see what you mean. My only concern is having for instance 1min timeout on the API for image upload means we'd have requests taking the RAM for quite some time. Even though we don't have massive volume, we have several dozens of heartbeat per min for instance.
Another solution potentially is: the API grants temporary access to the camera to upload directly to S3, meaning the bytes of the image don't go through the backend API instance. Let me think a bit about this, curious to hear also what @RonanMorgan thinks
You don't need 1mn for the image, maybe 6s and 1s for the heartbit. If the number of heartbits is a problem, I can modify the engine and send one every 5mn instead of 30s, which is just enough for our needs
Ok I think I understand better what problem remains, just a few questions to make sure:
- the problem is not with image upload but with occasional heartbeat timing out
- the current timeout for the API client is 10sec and it's not overridden here, so that means the faulty heartbeat are taking more than 10sec?
- I think both the simplest and best solution is to edit the engine to: add an exception catch on the call to heartbeat, AND make the heartbeat asynchronous
Increasing the heartbeat here just means we're considering it's an acceptable behaviour to have heartbeat lasting more than 10sec. And to be honest, if that's the case, there is no way the camera will be able to transfer an image 😅
I don't know what the problem is exactly, but there are some pretty big variations, as you can see in my first message here, from 4.1s to 0.22s for the heartbeat in 4s. The network seems variable but the image ends up being sent, it's just that the 4s is too long for the loop.
But don't worry, I'll manage it on the engine side.
Sounds good, can we close this PR then? (trying to do some project "gardening" haha)