Telethon icon indicating copy to clipboard operation
Telethon copied to clipboard

InvalidBufferError: HTTP code 429 handling

Open CptKirk opened this issue 2 years ago • 7 comments

Hi, We are using this library to control user interactions with our Telegram bot. Usually, we have no problems with Telegram. Still, lately, we get at night a vast count of InvalidBufferErrors (Invalid buffer Invalid response buffer (HTTP code 429)). This error occurs so often that our bot is just not available at night. The image below shows the log spam due to this error: image We looked at the library code and saw that it would just reconnect when this error occurs, which is a suboptimal way of handling it in our eyes and use-case. We are more than baffled because this error occurs at a specific time nearly every day, and it only affects our bot code which uses only one session file. Due to this behaviour, we suggest the following changes: https://github.com/CptKirk/Telethon/commit/59d8bb03626158b94c977fec1fcfcc628e29f5e0 We suggest cancelling the send loop, waiting for x seconds and then restarting it when encountering a 429 InvalidBufferError. Does anyone else have a better solution? Nonetheless, we will probably deploy these changes on Monday (27.06) to our production environment. We will report the days later if waiting fixes something.

Telethon version: 1.24.0 Python Version: 3.9

Checklist

  • [x] The error is in the library's code, and not in my own.
  • [x] I have searched for this issue before posting it and there isn't a duplicate. (rel: #3825 )
  • [x] I ran pip install -U https://github.com/LonamiWebs/Telethon/archive/master.zip and triggered the bug in the latest version.

Traceback

Traceback (most recent call last):
File "/home/projects/our_project/code/bot.py", line 140, in <module>
    loop.run_forever()
File "/usr/lib64/python3.9/asyncio/base_events.py", line 596, in run_forever
    self._run_once()
File "/usr/lib64/python3.9/asyncio/base_events.py", line 1890, in _run_once
    handle._run()
File "/usr/lib64/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
File "/home/projects/our_project/venv/lib64/python3.9/site-packages/telethon/network/mtprotosender.py", line 531, in _recv_loop
    self._log.warning('Invalid buffer %s', e)

CptKirk avatar Jun 24 '22 13:06 CptKirk

So... finally, we have something to report. We deployed the changes as said on Monday (27.06) and have been waiting for the InvalidBufferError. Strangely enough, the first two nights, there were no errors. "Thankfully", this night, it happened again, and our fix seems to remedy the problem. As you can see in the picture below, the spam was much less: image The logging output was the following: image The solution is to cancel the send_loop for 10 seconds when this error occurs. Are those changes viable for a PR? If not, tell me what to change, and I will implement it.

CptKirk avatar Jun 30 '22 14:06 CptKirk

Thanks for the detailed report. Your linked fix seems to be more of a workaround than an actual fix. Ideally, we would figure out why the 429 happens. That is, answering the question, "is the library going rogue and spamming requests under some strange condition that leads to the 429, or is Telegram the one who sends 429 even if the library behaves correctly?". If the answer is "it's Telegram", then the proposed workaround would be the right fix. Not sure if with those specific changes though. Perhaps a sleep would be enough (and another flag to actually raise the error if we've slept, say, 3 times in a row and it's still not solved).

Lonami avatar Jun 30 '22 14:06 Lonami

We do not think it's a problem with Telethon. We get this error always at night when the traffic to our bot is the lowest. We even got this error at 3 a.m. when nobody used the bot (I checked the logs, but there were no logs until this error happened). The proposal to add a flag which counts how much we slept in a row is good; I will add it. Do you think a simple reconnect would be sufficient when we waited three times in a row?

Edit: Ok, I checked our code, and we have debug logs for Telethon disabled, so maybe it is something with Telethon that we do not see. I will check if we could enable them to monitor the behaviour for a couple of days.

CptKirk avatar Jun 30 '22 14:06 CptKirk

@CptKirk, hello! Any updates on the research?

teplinsky-maxim avatar Jul 13 '22 07:07 teplinsky-maxim

@teplinsky-maxim Hey, I recently had no time to monitor this behaviour through debug logs. I will try to get this done this or next week.

CptKirk avatar Jul 13 '22 08:07 CptKirk

@CptKirk hello! Any updates on the research?

spiderpanel avatar Jul 30 '22 15:07 spiderpanel

Hi, while we still do not know if the problem is on Telegrams side or Telethon, I can report the first bug for my workaround. It seems that if we are halting the send_loop and in parallel we get a disconnect from Telegram, the send_loop will not start again and the program needs to be terminated manually. Here are the logs for that case: image Please have this in mind when using my workaround. As soon as I have more spare time I will try to fix it.

CptKirk avatar Aug 01 '22 07:08 CptKirk

Hi, while we still do not know if the problem is on Telegrams side or Telethon, I can report the first bug for my workaround. It seems that if we are halting the send_loop and in parallel we get a disconnect from Telegram, the send_loop will not start again and the program needs to be terminated manually. Here are the logs for that case: image Please have this in mind when using my workaround. As soon as I have more spare time I will try to fix it.

There are 2 sides of the problem. The first one is that telethon will spam packets ignoring the error unless you stop _send_loop_handle and _recv_loop_handle coroutines manually inside MTProtoSender I believe. It would be great if we can do something like pass a callback, which will be called at 429 error or pass some predefined param, which will tell what to do.

The second problem is that continuing sending packets after the error will cause more timeout, so more 429 error appearing time. Restarting the loop does nothing if you sent 10000 packets after the first time error occurred. So it’d be great to have something (at least something like client.error_429_occured which will hold the most recent timestamp and packet, which caused the error) and let the user handle the error himself. @Lonami, if you support the idea let me know, I will implement it and create a PR

teplinsky-maxim avatar Oct 09 '22 22:10 teplinsky-maxim

Moreover, even usage of the original tdesktop client can cause such errors, it is rarely seen, but it happens. What does the client do to resolve it? It just does not allow user to do any action which will send the same packet, which caused the error (same thing the client does if it meets flood_wait_error on any action). So, to do that using telethon, imho we need to start from tracking the action and time of error (could be done by using msg_id param)

teplinsky-maxim avatar Oct 09 '22 22:10 teplinsky-maxim

@Lonami, if you support the idea let me know, I will implement it and create a PR

I don't think a callback or event would really be the right way to handle this. I think the library should simply disconnect and return the error, like it does with -404, if that's the right thing to do.

My only worry was that the library would "unnecessarily" bubble that error up to the user. Did we figure out if the cause for -429 is because the library spammed messages beforehand unintentionally, or can the -429 happen during normal operation? If -429 happens during normal operation, then disconnecting and returning the error to the user is the only right thing to do (other than auto-reconnect, which as can be seen, isn't working very well).

Lonami avatar Oct 10 '22 08:10 Lonami

@Lonami, if you support the idea let me know, I will implement it and create a PR

I don't think a callback or event would really be the right way to handle this. I think the library should simply disconnect and return the error, like it does with -404, if that's the right thing to do.

My only worry was that the library would "unnecessarily" bubble that error up to the user. Did we figure out if the cause for -429 is because the library spammed messages beforehand unintentionally, or can the -429 happen during normal operation? If -429 happens during normal operation, then disconnecting and returning the error to the user is the only right thing to do (other than auto-reconnect, which as can be seen, isn't working very well).

AFAIK it can happen due a normal operation, and simple relogin could not help. It can help probably in a simple case, but not in a 'heavy' case, when user already did TOO much. I saw the error in the following scenario:

The client works for a month without any interruption, doing some simple things like getting updates and mostly sending some resolving requests like messages.ChatFull or users.getFullUser. Then something happens, and telegram decides to send lots of 429 error. Why lots? Because log contains only Too many requests: HTTP code 429, which is contained in MTProtoSender._recv_loop, so there are 2 things which I see:

  1. Telethon sees that a request was not satisfied with a reply and sends it again (not sure is there any resend mechanism) OR
  2. My code does not understand that 429 error appeared and still sends resolving packets, getting 429 again and again.

In both ways the program halts due to the error. No more requests to Telegram API could be send. I tried @CptKirk's solution to reset the loop, even reset the client instance. Maybe it helps, there is no way to test it properly, but since another month or so the error appears again. I think the loop reset can help in a soft case, where the client did not spam too much requests after the first occurrence, because it takes time to initialize client, so there is some time between the error and next requests. I am also interested in work of some async iterator we have in telethon such as iter_dialogs (this one I use a lot), iter_messages, are they 429-safe? As I see in the code the functions relate on offsets, which returned by the API. How will they behave in case the error appears? I suppose the function will send requests until they hit some limit, which could be recved from a success request or even send the endlessly?

In both cases I think it'd be great to have some class field, which can say about 429 error, so I can stop sending packets in the code or do something else

teplinsky-maxim avatar Oct 10 '22 09:10 teplinsky-maxim

  1. Telethon sees that a request was not satisfied with a reply and sends it again

Yes, on reconnect messages which were not satisfied are enqueued again to be sent later:

https://github.com/LonamiWebs/Telethon/blob/db29e9b7efb204fbe56fe8d59ccc4d55ceddc165/telethon/network/mtprotosender.py#L395

How will they behave in case the error appears?

Those iterators are "safe" as long as the request eventually succeeds (the offsets they use would return the same results now or a week later, barring messages which were deleted between the calls). At least iter_messages (iter_dialogs is different because a new message can bump the dialog to be behind the current offset, but it's unlikely).


We have two options:

  • When the error occurs, flag all in-flight requests as "failed with 429". This means, for example, client.send_message() will be able to raise this error. This is the right option if there's a particular request which fails 429. If it's all of them this doesn't make much sense because the client should just disconnect if it can't do anything.
  • When the error occurs, treat it in the same way as broken authorization (disconnect, and raise the error through client.run_until_disconnected() so the user can handle it). This would be similar to how other connection errors are handled.

I'd like to try the first option, just so that we can confidently say if the error fails with specific requests or with all of them.

Let me know if you want to implement this and I can offer some guidance if you need.

Lonami avatar Oct 10 '22 09:10 Lonami

The other detail we have to pay attention to is that Telegram does not allow to make the same request if the error appeared. For example, we do 10000 messages.ChatFull at the same time. Some answer will be 429 error, so I suppose we have to ASAP cancel all pending messages.ChatFull requests so it does not get worse. The same does tdesktop. For example, you can start spamming joining chats\channels, when 429 error will appear, the client will not allow you to join (the request will not be sent) until recved timeout is expired (you can check this behavior by enabling debug logs in tdesktop), but you still can send messages, read them, etc. Due to this the 2nd option is a bad idea, because in that case we will cancel other request. Moreover, it will happen silently, which is actually bad. Let's stick to the first idea. I see implementation as the following:

  1. Allow __call__ in UserMethods raise new error, which will tell user about the error so any code like result = await self.client(self.request) can raise the error. If there are any other ways to send a request to telegram, those ways should also have a way to raise the error
  2. After raising a error self._send_queue should be filtered, so no more requests which caused 429 will be sent.
  3. Also log the error with logger

One of the most important things is that user should explicitly get the error from any part of the code. I mean, @CptKirk's idea was to patch MTProtoSender._recv_loop, which is bad, imho, because the user will not know what actually happened, so he can not handle the error himself. I tried to do something like throwing a error instead of halting the loop, but did not succeed because telethon does not allow to throw such error to user's code (at least I did not find the correct way), so I guess I will need your help to implement it.

Also, one more problem, how do we determine, which packet caused the error? For example, we have errors, such as telethon.errors.rpcerrorlist.InviteHashExpiredError: The chat the user tried to join has expired and is not valid anymore (caused by CheckChatInviteRequest), can we implement something like this to get the reason of 429? I am just not sure if it is 100% guaranteed to know, which packet caused the error

teplinsky-maxim avatar Oct 10 '22 10:10 teplinsky-maxim

2nd option is a bad idea, because in that case we will cancel other request

Both options would essentially cancel all requests. This error is produced at the transport level. It is not possible to know what request caused it. All requests must be cancelled. The question is if the error is raised through those requests, or if it's treated as a connection error.

it will happen silently

This is the current behavior. We must decide how to treat the error (as a request error or connection error, even if it's neither). This wasn't done because perhaps Telethon was spamming requests, and that's what should've been fixed. But since we're assuming this is not the case (it occurs under normal operation), then we must assume Telethon is working fine, and bubble this error up to the user.

Regarding your implementation plan:

  1. Yes. It should probably be something similar to other flood errors.
  2. No. The sender should not concern itself with this. The user will be getting an error for some requests, so the user should know to stop sending that. The library should not do this automatically because it is not know when it is okay to resume sending those requests (unlike flood wait which has a time period). This is up to the user.
  3. Yes. It's already logged, but we can customize it for this particular case.

how do we determine, which packet caused the error?

You cannot. This error is at the transport layer, not the RPC layer.

I tried to do something like throwing a error instead of halting the loop, but did not succeed because telethon does not allow to throw such error to user's code (at least I did not find the correct way), so I guess I will need your help to implement it.

You need to set the pending states to exception, very similar to how they are cancelled when disconnect is called, and then clear the states so they are not sent again:

https://github.com/LonamiWebs/Telethon/blob/db29e9b7efb204fbe56fe8d59ccc4d55ceddc165/telethon/network/mtprotosender.py#L322

You can search set_exception in that file for more examples.

Lonami avatar Oct 10 '22 14:10 Lonami

This error is produced at the transport level

What do you mean 'at transport level'? Do you mean TCP?

It is not possible to know what request caused it

Can't we use msg_id param in MTProto to find out the request? I think we can use it, but I did not see any use of such internal fields in current library logic.

We must decide how to treat the error then we must assume Telethon is working fine

Yes, the library works fine, we just need to tell user, that some limits were hit, and he has to do something with it.

You need to set the pending states to exception, very similar to how they are cancelled when disconnect is called, and then clear the states so they are not sent again

Thanks! I will watch at it more closely.

So, I guess now we have enough information to create a kind of concept, right?

teplinsky-maxim avatar Oct 11 '22 06:10 teplinsky-maxim

What do you mean 'at transport level'? Do you mean TCP?

There's different layers. The transport layer sits on top of TCP and is responsible for packing the encrypted data sent to Telegram. This is essentially all of telethon/network/connection. The InvalidBufferError occurs when the packet doesn't contain valid data, and rather contains a negative HTTP code (packets shorter than 8 bytes are invalid, these contain the -429 error):

https://github.com/LonamiWebs/Telethon/blob/18da855dd4dc787b7aab08fecf3066bac80790ff/telethon/network/mtprotostate.py#L157-L158

Can't we use msg_id param in MTProto to find out the request?

You can't use this when InvalidBufferError occurs because there is no msg_id at all. All the library gets is the negative HTTP code. It's not a valid packet that can be decrypted. It's just the negative HTTP code.

now we have enough information to create a kind of concept, right?

Yes, we can start by marking all in-flight request states to the exception status and see how it goes. Because I don't remember experiencing this, I can't really test this in a real scenario.

Lonami avatar Oct 11 '22 09:10 Lonami

Yes, we can start by marking all in-flight request states to the exception status and see how it goes. Because I don't remember experiencing this, I can't really test this in a real scenario

Ok, I will start implementing such thing at next week probably, it will also take some time to test the code

teplinsky-maxim avatar Oct 12 '22 07:10 teplinsky-maxim

Yes, we can start by marking all in-flight request states to the exception status and see how it goes. Because I don't remember experiencing this, I can't really test this in a real scenario

Ok, I will start implementing such thing at next week probably, it will also take some time to test the code

Hi, is there any news?

Eclipsium avatar Nov 17 '22 04:11 Eclipsium

I'll be closing this under the assumption that this is a duplicate of #4042, which has since been fixed. If you believe this is still an issue, or are not happy with the resolution, please let me know, and I will reopen it.

Lonami avatar Apr 06 '23 12:04 Lonami