Arduino icon indicating copy to clipboard operation
Arduino copied to clipboard

HTTPClient: Split "HTTPC_ERROR_READ_TIMEOUT" up into more detailed error codes

Open M-Reimer opened this issue 4 years ago • 20 comments

I'm currently facing an issue with some "IoT device web server" which at some point seems to be a bit off with how they implemented the HTTP standard. After some runtime in some situation I end up getting "HTTPC_ERROR_READ_TIMEOUT" all the time. Sometimes this "revives" somehow just to reappear later.

The problem is that this error code could actually mean different things. Several checks return this code. We have it at the following locations in the GIT revision I've checked (with GIT revision in URLs to make them "stable"):

  1. https://github.com/esp8266/Arduino/blob/140d0ffde1a22fa978fdf9bde53b2170f0daf78c/libraries/ESP8266HTTPClient/src/ESP8266HTTPClient.cpp#L35
  2. https://github.com/esp8266/Arduino/blob/140d0ffde1a22fa978fdf9bde53b2170f0daf78c/libraries/ESP8266HTTPClient/src/ESP8266HTTPClient.cpp#L673
  3. https://github.com/esp8266/Arduino/blob/140d0ffde1a22fa978fdf9bde53b2170f0daf78c/libraries/ESP8266HTTPClient/src/ESP8266HTTPClient.cpp#L709
  4. https://github.com/esp8266/Arduino/blob/140d0ffde1a22fa978fdf9bde53b2170f0daf78c/libraries/ESP8266HTTPClient/src/ESP8266HTTPClient.cpp#L1075

Where only the last one actually is based on actual measurement of time. The first one may also be timeout related (didn't dig that deep). Point 2 and 3 can also be "malformed data" which would be interesting to know so I can check what my server does wrong here. Probably it would be even possible to make the HTTPClient more robust once I know what is wrong on the server side. The access works every time with a "regular browser" so I guess the HTTP implementation is more forgiving there.

M-Reimer avatar Sep 05 '21 13:09 M-Reimer

FWIW, after upgrading from v2.7.4 Arduino core to v3.0.2 on my ESP8266 WebServer my ESP8266 HTTPClients started getting this very HTTPC_ERROR_READ_TIMEOUT" after some runtime (hours). Sometimes this revives for a moment but then falls back to error -11 (timeout). The access to the webserver works every time with a "regular browser" (just like for you).

I emphasize that this used to work fine for years when the server was built with pre-v3.x Arduino cores so it seems to be related to the server side but I can't figure out what exactly is wrong and where. Restoring v2.7.x webserver fixes the issue. Clients might be v2.7.x based or v3.0.2, both seem to work OK. Very hard to debug as I haven't been able to reproduce it in my lab, it only happens far away in the field.

joysfera avatar Dec 03 '21 17:12 joysfera

If you have any idea on how to reproduce your issues, that will help.

d-a-v avatar Dec 04 '21 17:12 d-a-v

Seems like the key is to have at least three or four "simultaneous" clients (I've got four clients - each fetching data from the server every five seconds). Sometimes it takes several hours to reproduce, sometimes it happens almost instantly - probably when two or more of the clients' requests clash? I don't know, just guessing.

FYI, I described my problem in detail here: https://www.esp8266.com/viewtopic.php?f=32&t=23296

I haven't created my own github issue because I don't have a short code to reproduce it. I just noticed that this github issue describes what I get (HTTP TIMEOUT) so I thought I'd mention my story - perhaps with more detailed error reporting from HTTPClient I'd guess more quickly what's up with the WebServer? Or maybe not.

joysfera avatar Dec 04 '21 18:12 joysfera

Actually , I have similar issue. I have build kind of push button that connect to server through out http protocol. Sometime I have gotten -11 some time get 200 who knows why ? Weird... This caused push button more than one time to get 200...

ozzcet avatar Dec 04 '21 18:12 ozzcet

I 2 it...

kirurui avatar Dec 04 '21 19:12 kirurui

There was a recent related update, currently in master branch: see #8216 Not saying this will solve this bug, but worth giving a try. Can you give a try with master branch ? (Arduino users can try the snapshot release v0.0.1).

d-a-v avatar Dec 04 '21 23:12 d-a-v

I am currently running PostServer and PostClient.

On client, delay between requests is removed, and an error counter is added and increased when retcode is not 200.

Setup:

  • Server on esp8266
  • one client on another esp8266
  • another client on host:
cd test/hosts
make list
make D=1 ../../libraries/ESP8266HTTPClient/examples/PostHttpClient/PostHttpClient

I'll tell you just before I die burning in the 2.4G flow, or if a counter happens to be greater than 0. If any of you has a publishable app to test, the issue will be narrowed faster. You also may enable core and library debug messages so there is more to show.

d-a-v avatar Dec 05 '21 12:12 d-a-v

Isn't the PostClient re-using existing connection? That could mask what I experienced with HTTP 1.0. Also, I'd insert a delay(500) to the PostServer's loop() to simulate that the server node has also some other work to do.

I'll arrange a similar test with 3+ clients when I get some spare time. I just invested half a day to return to core v2.7.4 and surprise surprise - SPIFFS.check() was causing a fatal crash and a boot loop there. Took me several hours to debug it.

joysfera avatar Dec 05 '21 12:12 joysfera

I have updated my test to use HTTP1.0. I have now 3 host clients, 1 esp http1.0 client and 1 esp server with 500ms loop. I'll let it run for some hours. Then if I still have no error, I'll remove the 500ms.

d-a-v avatar Dec 05 '21 13:12 d-a-v

Do you use the v3.0.2 release or master branch? Just asking if it got fixed in the master by any chance. BTW, we kinda hijacked @M-Reimer issue, that's not fair. Though I am very very glad you @d-a-v pay so much attention to my issue, thank you!

joysfera avatar Dec 05 '21 13:12 joysfera

@d-a-v one more note: my server application also keeps a MQTT connection alive (probably consuming at least one network socket all the time) and does a HTTP POST to another server every minute (which requires another socket). That limited amount of sockets might help to cause the issue. On the plus side my clients are not hammering the ESP8266 server that often - only once in 5 seconds.

joysfera avatar Dec 05 '21 13:12 joysfera

There is no limit for number of client sockets apart from free memory. Do you follow your free memory on both client and server ? Do you follow the service time, what's its greatest value ? Are you using some kind of Stream to feed the webclient (= not only sending/receiving strings) ?

These questions are general, and may also help @M-Reimer @kirurui @ozzcet debugging.

d-a-v avatar Dec 05 '21 14:12 d-a-v

Do you use the v3.0.2 release or master branch? Just asking if it got fixed in the master by any chance.

I am currently using the master branch.

d-a-v avatar Dec 05 '21 15:12 d-a-v

In my app I watch the free memory on both client and server (about 25+ kB). As for the service time - when all ESP8266 clients were getting -1 or -11 errors the ESP8266WebServer's handler was not even getting their requests, so service time was 0. Normally it was 2 milliseconds. My server builds a JSON String manually ( reply += "...") (the total length is say 500 bytes) and then sends it at once.

    server.sendHeader("Access-Control-Allow-Origin", " *");
    server.sendHeader("Cache-Control", " no-cache, must-revalidate");
    server.send(200, "application/json", reply);

joysfera avatar Dec 05 '21 15:12 joysfera

was not even getting their requests,

Ah this is a valuable information. Can you modify this constant to something bigger (either by modifying the file or by externally overriding it if you don't use the Arduino IDE with which you can't) ?

Hint: this parameter is not the maximum number of clients. It's the automatic door of the shop. It stays closed when there are 5 customers waiting behind the door, but opens when there are less. Impatient customers outside may give up after their timeout. This is a DOS (free memory) protection.

d-a-v avatar Dec 05 '21 15:12 d-a-v

I can modify that constant easily but please note that in this particular use case I don't want any backlog to grow there. Given that clients are timeouting quickly by themselves (within 1500ms - I could even short that period under 1000 ms) it would be ideal if the webserver was throwing away any older request it couldn't handle in time as soon as a newer request arrives. At least that's my theory for as stable client-server setup as possible.

I admit that I don't know what's happening on the server when the client times out. Does the server know about it or does it still keep the old abandoned request in a queue that keeps filling up because the client is resending its request every few seconds?

Perhaps the key question is how quickly server times out when sending a reply to a client that no longer listens because he has timed out a moment ago?

joysfera avatar Dec 05 '21 15:12 joysfera

This parameter is not managed by the server. It's TCP. It's very like when an incoming customer is blocked outside of the shop. The door is closed. Can't see what's inside. Exactly as if there is no server on the port. For the client which reaches timeout, it's like if there is no server. When backlog was not enabled, memory could be filled by too many incoming connections. So increasing it now is still better than when it was not enabled.

The other option is to increase client timeout.

A further option would be for http client to distinguish between real timeout (server sending incomplete answer) or can't connect to the port (what's discussed above).

The server don't see the clients which have timeouted before the connection is established (they didn't come through the dooor).

Perhaps the key question is how quickly server times out when sending a reply to a client that no longer listens because he has timed out a moment ago?

and

it would be ideal if the webserver was throwing away any older request it couldn't handle in time as soon as a newer request arrives

In master branch, the webserver will close quicker when a client is misbehaving but another is waiting (this is a recent improvement).

d-a-v avatar Dec 05 '21 15:12 d-a-v

I forgot to tell that the "no features" tastes of lwIP do not have tcp-backlog enabled.

d-a-v avatar Dec 06 '21 10:12 d-a-v

I had the same problem hapends around midnight (request every 12 min at time +30s), i add an exception in my code, maybe the server idk.

Buddha6669 avatar Jan 05 '22 23:01 Buddha6669