squid icon indicating copy to clipboard operation
squid copied to clipboard

Fix memory allocation for from-server network reads

Open rousskov opened this issue 3 years ago • 12 comments

Use tcp_recv_bufsize (not read_ahead_gap) for the response socket read(2) buffer. This change may increase default Squid memory footprint by ~50KB per to-server connection in most environments and decrease Squid memory footprint by approximately (read_ahead_gap - 64KB) in many performance-sensitive deployments that run with an n*MB read_ahead_gap. The admin now has better control over memory usage and I/O performance. Using tcp_recv_bufsize for kernel and Squid receive buffer is natural and simple.

HttpStateData::inBuf is used for socket read(2) when receiving data from an origin server or cache peer. The buffer does not accumulate content between reads except when parsing response headers and chunked encoding metadata. In fact, the buffer is usually empty between reads. Squid accumulates response content in Store or adaptation BodyPipe instead.

Squid was incorrectly using a mostly unrelated read_ahead_gap directive to size inBuf. The default read_ahead_gap value is 16KB. Admins increase read_ahead_gap because accumulating more response data often improves overall Squid performance or user satisfaction. Large increases caused excessive (and unpooled!) inBuf memory allocations. In some cases, these memory allocations blocked the entire worker for a few milliseconds!

We could just replace Config.readAheadGap with Config.tcpRcvBufsz (or another directive) when calculating inBuf capacity, ignoring other problems in the surrounding code, but that would make it impractical to tell the admins what this change really means because there would be too many exceptions where Squid would still not behave as (re)configured.

  • Small read_ahead_gap values stalled response header and body parsing.

  • Squid was deferring response reads when inBuf became full due to reconfiguration but deferring does not always reduce buffer size.

  • Squid was not always deferring response reads when running out of buffer space, stalling some transactions.

  • 64-bit configuration values could overflow int-based local variables.

  • inBuf was grown earlier than the code wanted to grow it (before the socket became ready for reading) and not grown when the new/grown space was actually needed/used (just before reading from the socket).

  • Transactions killed by a Must() failure when a reconfiguration made inBuf "full" just before read(2) in HttpStateData::readReply().

  • Wrong/misleading/stale comments/debugs documenting key decisions.

SBuf implementation and parsing API exacerbate the bug by forcing memory re-allocations for inBuf even when the underlying buffer storage has enough space (and often is completely empty!). We will address these problems separately, but this fix reduces their negative impact.

Also do not wait for the rest of the HTTP response header or chunk metadata if we have no space to store more incoming bytes. The parser does not know how much more buffer space we have (if any). It is our responsibility to check that some progress is still possible.

I replaced "Non-HTTP-compliant" with "cannot parse" because we do not know whether the parsing failure is related to HTTP compliance. For example, buffer space limitations are not a compliance-related problem.

The changes near wasThereAnException look wrong, but it is the existing wasThereAnException name that is "reversed". Compiled code should be OK.

Also fixed commit 1ad6851 to delay buffer allocations until we are ready to read from the server. That old commit made buffer allocations in maybeMakeSpaceAvailable() optional so that they can be delayed until we actually read from the peer socket. Delaying buffer allocation until read(2) is the right thing to do. The callers (correctly) passed a boolean flag to trigger the allocation, but the method interpreted the flag as if it meant "do not allocate". That is, commit tried to delay allocations but actually allocated early.

Why was the 1ad6851 bug unnoticed since 2015? The outside observers could not see the problem because Squid continued to allocate the buffer, and there was no code that would fiddle with that buffer while Squid was waiting for the socket to become ready for reading (unlike bug 4206 that served as an inspiration for commit 1ad6851).

Also fixed "read_ahead_gap none" (i.e. unlimited gap) handling that was broken since commit 5791270 (at least). I am not sure we should support this -- the admin can just give a large positive value -- but now we do.

Also improved reply_header_max_size validation to help adjust (incorrect) admin assumptions that Squid can handle response headers exceeding SBug::maxSize.

Also improved tcp_recv_bufsize/read_ahead_gap documentation, validation.

rousskov avatar Oct 07 '20 15:10 rousskov

There are several options for configuring HttpStateData::inBuf capacity:

  1. read_ahead_gap: This is dead wrong, for many reasons (partially detailed in the PR description). This is what Squid is using today. BTW, this PR updates read_ahead_gap description if you need more information about that directive. I am listing this non-option here for completeness sake only.

  2. tcp_recv_bufsize: This is what this PR is currently using. It is simple.

  3. getsockopt(SO_RCVBUF) call: Probably as accurate as we can get without major Squid changes. Accommodates dynamic changes in kernel buffering. Requires an extra system call per read(2) system call. We could also call this just once per HTTP transaction with the server, before the first read(). See tcp(7) for more info.

  4. http_read_buffer_max_size: A new dedicated squid.conf directive (the exact name is to be decided). This is probably the simplest option, but it kind of duplicates (if not contradicts) a lot of tcp_recv_bufsize functionality. The only realistic use case where folks should configure http_read_buffer_max_size while leaving tcp_recv_bufsize in its default/unused state, is when they are sure what the kernel socket buffer size is and do not want to pay the price of an extra system call to set that size explicitly on every new socket.

  5. SQUID_TCP_SO_RCVBUF: Simple but has many problems. See PR commit 7de95e2 for a bulleted list.

@squid-cache/commiters, which option do you prefer and why? If there are no strong opinions, we will probably go with tcp_recv_bufsize.

rousskov avatar Oct 07 '20 15:10 rousskov

There are several options for configuring HttpStateData::inBuf capacity:

We have many other similar buffers which need to be fixed for example in ssl/bio.cc, tunnel.cc and icap code

2. tcp_recv_bufsize: This is what this PR is currently using. It is simple.

Looks the best. If this is not defined by the user we can use the SQUID_TCP_SO_RCVBUF. Currently the SQUID_TCP_SO_RCVBUF is retrieved from the system default tcp read value at compile time. Moreover this is limited by squid to be up to 65535. My system gives 212992 as default value. Questions:

  1. Should we increase the maximum of 65535 value?
  2. Is there any real reason for defining the SQUID_TCP_SO_RCVBUF at compile time instead of computing this value on squid-startup? In the first case a development machine TCP configuration is used and in the second case the TCP configuration of Squid running machine is used.
3. getsockopt(SO_RCVBUF) call: Probably as accurate as we can get without major Squid changes. Accommodates dynamic changes in kernel buffering. Requires an extra system call per read(2) system call. We could also call this just once per HTTP transaction with the server, before the first read(). See [tcp(7)](https://man7.org/linux/man-pages/man7/tcp.7.html) for more info.

Requires two system calls but in how many cases a bigger that the default read buffer size will be used?

4. http_read_buffer_max_size: A new dedicated squid.conf directive (the exact name is to be decided). This is probably the simplest option, but it kind of duplicates (if not contradicts) a lot of tcp_recv_bufsize functionality. The only realistic use case where folks should configure http_read_buffer_max_size while leaving tcp_recv_bufsize in its default/unused state, is when they are sure what the kernel socket buffer size is and do not want to pay the price of an extra system call to set that size explicitly on every new socket.

I do not think it is a huge overhead a setsockopt call for each new socket. But in the case it is squid can get on startup the default tcp read buffer size if the tcp_recv_bufsize is not defined and use it to allocate read buffers. If the read buffer size is the socket default there is not any need for an setsockopt(SO_RCVBUF) call.

@squid-cache/commiters, which option do you prefer and why? If there are no strong opinions, we will probably go with tcp_recv_bufsize.

chtsanti avatar Oct 09 '20 14:10 chtsanti

We have many other similar buffers which need to be fixed for example in ssl/bio.cc, tunnel.cc and icap code

Yes, but I think that is a secondary concern because those other TCP receive buffers can be adjusted in a similar fashion, regardless of the solution we pick here.

If this is not defined by the user we can use the SQUID_TCP_SO_RCVBUF.

Again, I think this is a secondary concern. This PR already uses SQUID_TCP_SO_RCVBUF if tcp_recv_bufsize is unset. And SQUID_TCP_SO_RCVBUF calculation adjustments (including the timing of that calculation) is a separate topic IMO. Clearly, we can improve a lot of things there, but those improvements (or the lack of thereof) are unlikely to affect the best choice among the four presented alternatives.

in how many cases a bigger that the default read buffer size will be used?

In a high-performance environment, I would expect many kernel socket read buffers to vary in size with time (by default):

tcp_moderate_rcvbuf (Boolean; default: enabled; since Linux 2.4.17/2.6.7)
    If enabled, TCP performs receive buffer auto-tuning, attempting to automatically size the buffer
    (no greater than tcp_rmem[2]) to match the size required by the path for full throughput.

The presence of these auto-tuning algorithms (and their many parameters) is exactly what makes the getsockopt(SO_RCVBUF) option attractive: If we think that calculating the right buffer size dynamically is usually better than a static setting, then we should ask the kernel for those calculation results and use them. The admin will assist the kernel algorithms using existing OS settings and/or Squid configuration options.

rousskov avatar Oct 09 '20 16:10 rousskov

in how many cases a bigger that the default read buffer size will be used?

In a high-performance environment, I would expect many kernel socket read buffers to vary in size with time (by default):

It would be interesting to see test programs to simulate various cases. For example to check if and how linux kernel decreases the read buffer sizes. My sense is that it only increases the buffer to new needs if required.

For example the case squid for a reason (reconfiguration or an external event) did not retrieve enough fast buffers for a moment. The socket read buffer is growing to handle new data, then squid returns and retrieve all data and then operates normaly and the data in buffers remains small. But squid may still allocate huge buffers because of the increased socket buffer sizes.

I must note that the read buffers are big (currently of size 65535 and we may want to grow), they may become huge (the maximum allowed socket read buffer size is 6MB in my system), and these buffer sizes are not covered by squid memory pools.

The problem this issue trying to solve in practice coming from huge allocations before any read. And a part of this issue coming from SBuf re-allocations in cases we believed they are covered very well by SBuf cheap copy/substrings behavior.

I have not see any other software which uses the getsockopt(SO_RCVBUF) before any read to retrieve data. I think it will be more practical to use the recv/MSG_PEEK to implement such behaviour, but still I am not a fan of this implementation.

Personally I strongly believe that we need to implement one Squid read buffer memory allocation before start downloading any data from remote server. We need to allocate the best size, but it should be once. Just an opinion however.

chtsanti avatar Oct 10 '20 08:10 chtsanti

read_ahead_gap is clearly documented as the limitation on how many bytes are read "when retrieving an object from another server". I do not see how it can be "mostly unrelated" to server I/O buffer limit when it has that purpose.

read_ahead_gap is the Squid internal buffer size for server I/O. Renaming the directive to something more easily understood and/or adjusting its default value seems appropriate to me. Dropping it entirely does not seem right.

tcp_recv_bufsize is itself badly named since Squi d(ab)uses it to size two buffers within kernel (SO_RCVBUF and SO_SNDBUF). Switching the Squid inBuf to also use it will just further increase the memory allocation above what admin expect to see. It is reasonable to expect that setting a buffer limit to N will only allocate those N bytes, not a minimum of 3*N.

yadij avatar Oct 17 '20 10:10 yadij

TLDR: @yadij, my question is about HttpStateData::inBuf capacity. It is not clear to me what you mean by "server I/O buffer", but it feels like you are talking about something else. I will try to resolve that miscommunication below, but please try to stay on topic of HttpStateData::inBuf (i.e. individual network read(2) buffer) sizing, which is quite far from the topic of read-ahead data accumulation in Squid (that accumulation happens elsewhere).

read_ahead_gap is clearly documented as the limitation on how many bytes are read "when retrieving an object from another server".

The above paraphrasing and the excessive focus on the least important (quoted) part of the documentation can be misleading IMO. The read_ahead_gap directive is actually documented as The amount of data the cache will buffer ahead of what has been sent to the client when retrieving an object from another server. That official phrasing is imprecise and can mislead as well, but its careful interpretation in the right context does reveal the true intent behind the directive. That intent focuses on accumulation of data across multiple network reads and is mostly unrelated to individual network reads or the buffer used for those individual network reads. The read_ahead_gap is mostly about things like accommodating slow client(s)/fast server combinations without running out of RAM over the course of client transaction(s), not about an individual network read(2) call.

I do not see how it can be "mostly unrelated" to server I/O buffer limit

It all depends on what you call "server I/O buffer". My question was about HttpStateData::inBuf, which is a buffer used for individual network read(2) calls and not meant for read-ahead accumulation of data for client consumption (beyond short-term parsing and other special HttpStateData needs essentially irrelevant to my question).

The following simple examples illustrate the difference between two mostly unrelated concepts, the read-ahead data accumulation and capacity of a buffer used for individual network read(2) call:

  1. A proxy can buffer ~10 MB worth of (read-ahead) bytes using a 1-byte read(2) buffer. If the client is not ready to receive a 20MB response until the entire response is received by Squid, accumulating that data will take 20 million reads, and the read ahead gap will become 20 MB in size after the last single-byte read(2). On average, the read ahead gap will be about 10 MB in size before the entire response is received: (1+2+3+... 20 MB)/20M = 10MB.

  2. A proxy can buffer ~0 bytes worth of (read-ahead) bytes using a 20 MB I/O buffer: If the client "immediately" consumes everything the proxy has to offer, and the server very slowly supplies data in 20 MB pieces, then there will be no data accumulation beyond that single read(2) call. While the read-ahead gap will be 20 MB in size immediately after each read(2) call, the average read ahead gap size will be about zero due to "immediate" client consumption of the read data.

read_ahead_gap is the Squid internal buffer size for server I/O.

Yes, but only if you interpret "buffer size for server I/O" as the amount of data Squid may accumulate across multiple network reads for the purpose of feeding it to the clients. This is unrelated to HttpStateData::inBuf my question is about. I do not recommend this terminology because, when seeing "buffer for ... I/O", many folks will think about a buffer used for individual network I/Os, like HttpStateData::inBuf.

Renaming the directive to something more easily understood and/or adjusting its default value seems appropriate to me. Dropping it entirely does not seem right.

The read_ahead_gap name is acceptable and removal of that key directive is impossible. My question is not about read_ahead_gap though.

tcp_recv_bufsize is itself badly named since Squid (ab)uses it to size two buffers within kernel (SO_RCVBUF and SO_SNDBUF).

Well, I would not blame the tcp_recv_bufsize directive for that (ab)use in commSetTcpRcvbuf(). Let's just add tcp_send_bufsize to avoid contaminating the already complex issue with SO_SNDBUF concerns. That addition is probably best done in a dedicated follow-up PR. We can just keep that disentanglement in mind while answering the pending question about inBuf.

Switching the Squid inBuf to also use it will just further increase the memory allocation above what admin expect to see. It is reasonable to expect that setting a buffer limit to N will only allocate those N bytes, not a minimum of 3*N.

If we decide to use tcp_recv_bufsize for inBuf sizing, then the updated directive documentation will make it clear that setting tcp_recv_bufsize will affect both the kernel socket read buffer and Squid read(2) buffer. Keeping the two closely-related buffers in sync is natural, and a reasonable admin should easily understand what is going on. SO_SNDBUF red herring aside, no matter what we do in the foreseeable future, there will be two read(2) "buffers" (one in kernel and one in Squid), so 2*N (or, with a new explicitly set option, N + M) is something we have to live with.

Currently, Squid does not have a directive dedicated to control of the size of Squid buffer used for individual server network read(2) calls -- HttpStateData::inBuf. The buggy official code abuses a read_ahead_gap directive meant (and used!) for a completely different concept. Reusing tcp_recv_bufsize for inBuf configuration is one of the options on the table. There are others. If you have a preference among those options, then please announce and support your choice. If you can suggest a better option, then please do that.

rousskov avatar Oct 17 '20 19:10 rousskov

Christos: It would be interesting to see test programs to simulate various cases. For example to check if and how linux kernel decreases the read buffer sizes. My sense is that it only increases the buffer to new needs if required.

I speculate that your suspicion is correct -- the Linux kernel is not my area of expertise, but I could not quickly find any kernel code that decreases the socket buffer size for a given socket. I only found code that increases that size.

I strongly believe that we need to implement one Squid read buffer memory allocation before start downloading any data from remote server. We need to allocate the best size, but it should be once.

While it is not clear (to me) why you believe that allocate-once is always the best strategy, I hope that this particular PR will continue to follow a very similar strategy just because it is much simpler than persistent dynamic resizing. Our focus in this PR should be on fixing the known wrong-inBuf-capacity bug IMO. That fix alone is rather difficult to get right due to several other bugs in the surrounding code! If at all possible, persistent dynamic resizing schemes (if any) and SBuf handling optimizations (to prevent needless reallocations) should be investigated separately. Options 1, 3, and 4 in my question follow this strategy.

rousskov avatar Oct 18 '20 14:10 rousskov

While it is not clear (to me) why you believe that allocate-once is always the best strategy

Not always. For squid case and also take in account squid existing code and current issues. A possible issue is the memory allocation for long buffers over the size of 65535 which are not covered by squid memory pools. I know we can make special object pool to keep such buffers. But in the other hand a burst of 10K connections will result about 640Mb memory for buffers which will remain in pools as unused but allocated memory after the number of connections decreased. (Probably we need to implement auto-freeing objects from memory/object pools if not required any more)

chtsanti avatar Oct 19 '20 14:10 chtsanti

I do not plan on making more changes and have removed the Draft/WIP designation from this PR. This complex PR deserves another pair of eyes, so I will keep it open for at least ten more days. I updated the PR description to reflect the current changes.

As far as the earlier configuration question is concerned, the PR continues to use recv_buf_size for HttpStateData::inBuf sizing. Both @chtsanti and I liked that simple option. AFAICT, @yadij primary objection and even previous changes like commit 5791270 were based on a misunderstanding what inBuf is, and where read-ahead content accumulation actually happens in Squid. Nobody else voiced an opinion.

All the newly added TODOs and XXXs in this PR are meant for future work. They are outside this PR scope. Most are either difficult to address properly or require a dedicated PR because they change Squid operation is a significant way that should be tracked by its own commit. Some may require additional research/discussions. I foresee at least two near-term PRs addressing some of those newly documented problems.

Finally, there will be another PR dedicated to SBuf optimizations that were previously located in this PR. I yanked those changes because they are not closely linked to the primary problem being fixed in this PR, they affect SBuf uses beyond this PR, and they would be much easier to review in a dedicated PR. Besides, this PR is already large/complex enough without SBuf optimizations.

rousskov avatar Nov 03 '20 03:11 rousskov

AFAICT, @yadij primary objection and even previous changes like commit 5791270 were based on a misunderstanding what inBuf is, and where read-ahead content accumulation actually happens in Squid. Nobody else voiced an opinion.

I don't think I am misunderstanding the directive(s) mentioned. I agree the code could use them better. But the complete detachment of input from read_ahead_gap still looks wrong to me, I want to get this clarified and agreed before this goes any further.

...

Currently, Squid does not have a directive dedicated to control of the size of Squid buffer used for individual server network read(2) calls -- HttpStateData::inBuf. The buggy official code abuses a read_ahead_gap directive meant (and used!) for a completely different concept.

NP: the size of inBuf is a sub-set of the message bytes received and not yet sent to client. The read-ahead concept as I understand it forbids that buffer being larger than read_ahead_gap value. In most cycles that buffer should actually be considered as not being able to read/resize (too many bytes pending for ICAP etc).

TLDR: @yadij, my question is about HttpStateData::inBuf capacity. It is not clear to me what you mean by "server I/O buffer", but it feels like you are talking about something else.

I am talking about inBuf. see below.

read_ahead_gap is clearly documented as the limitation on how many bytes are read "when retrieving an object from another server".

The above paraphrasing and the excessive focus on the least important (quoted) part of the documentation can be misleading IMO. The read_ahead_gap directive is actually documented as The amount of data the cache will buffer ahead of what has been sent to the client when retrieving an object from another server. That official phrasing is imprecise and can mislead as well, but its careful interpretation in the right context does reveal the true intent behind the directive. That intent focuses on accumulation of data across multiple network reads and is mostly unrelated to individual network reads or the buffer used for those individual network reads. The read_ahead_gap is mostly about things like accommodating slow client(s)/fast server combinations without running out of RAM over the course of client transaction(s), not about an individual network read(2) call.

NP: A lot of those extremely old documentation texts use the words "the cache" to mean Squid as a whole. I am interpreting the description in light of that and how it is used through Squid.

read-ahead concept places a limit on whether an individual read(2) call is possible, and how many bytes that read(2) is allowed to receive. Each of those read(2) is a "retrieval" of bytes from the server.

Note also that read_ahead_gap is only used for the read(2) of payload bytes on messages which exceed 64KB. Smaller messages use the max header size (64KB) for sizing inBuf.

Currently the inBuf math is assuming that bytes in-transit are not pruned out of inBuf. Which is consistent with the SBuf model ... more below on that. BUT, since SBuf is not yet universal the logic consume()'ing from inBuf may be causing wrong accounting of how much bytes are in-transit.

I do not see how it can be "mostly unrelated" to server I/O buffer limit

It all depends on what you call "server I/O buffer". My question was about HttpStateData::inBuf, which is a buffer used for individual network read(2) calls and not meant for read-ahead accumulation of data for client consumption (beyond short-term parsing and other special HttpStateData needs essentially irrelevant to my question).

inBuf is what I am calling "server I/O buffer". Being an SBuf it is an object shared between the read(2), Parser, and the code delivering to the next outbound hop (client, ICAP, or cache storage). Thus the current code assumption that not-yet-delivered data remains visible in that buffer between read(2) operations.

There are currently some implications of SBuf transition not yet being finished. That inBuf may incorrectly have bytes consumed() out of it when they are passed to a non-SBuf object's storage.

The following simple examples illustrate the difference between two mostly unrelated concepts, the read-ahead data accumulation and capacity of a buffer used for individual network read(2) call:

1. A proxy can buffer ~10 MB worth of (read-ahead) bytes using a 1-byte read(2) buffer. If the client is not ready to receive a 20MB response until the entire response is received by Squid, accumulating that data will take 20 million reads, and the read ahead gap will become 20 MB in size after the last single-byte read(2). On average, the read ahead gap will be about 10 MB in size before the entire response is received: (1+2+3+... 20 MB)/20M = 10MB.

This is "not a problem" ... transfer happens (albeit unacceptably slow) and Squid remains with admin configured memory limits.

2. A proxy can buffer ~0 bytes worth of (read-ahead) bytes using a 20 MB I/O buffer: If the client "immediately" consumes everything the proxy has to offer, and the server very slowly supplies data in 20 MB pieces, then there will be no data accumulation beyond that single read(2) call. While the read-ahead gap will be 20 MB in size immediately after each read(2) call, the average read ahead gap size will be about zero due to "immediate" client consumption of the read data.

First problem is that Squid read(2) occur as soon as any bytes are available from the server. Squid may be allocating a 20MB buffer and filling in 1 byte from server each cycle - exactly bad behaviour as per the previous case. But worse; memory allocations are huge, mostly unused and non-pooled because too big.

  • All we can do in this case is limit the buffer to something close to the read-ahead gap, to ensure that server read(2) get throttled by client write(2) or equivalent. That helps the kernel hopefully accumulate more bytes between read(2) attempts.
  • We have several bug reports today where this memory bloating behaviour apparently happens already despite all the buffering bugs that reduce operation sizes.

Second problem is that reality has shown there to be a third quite common case. For reverse-proxy and CDN where servers are fast and clients slow the server. TCP may deliver that whole 20MB in a few dozen jumbo packets, then has to wait a long time for Squid to empty the buffer down to the read-ahead gap before the next read(2) is allowed to be performed. The longer that wait the more chance TCP or NAT will timeout the connection. IIRC this was one of the issues driving the update to use read_ahead_gap on inBuf.

read_ahead_gap is the Squid internal buffer size for server I/O.

Yes, but only if you interpret "buffer size for server I/O" as the amount of data Squid may accumulate across multiple network reads for the purpose of feeding it to the clients. This is unrelated to HttpStateData::inBuf my question is about. I do not recommend this terminology because, when seeing "buffer for ... I/O", many folks will think about a buffer used for individual network I/Os, like HttpStateData::inBuf.

Whether you interpret it as the I/O buffer size, or the total of all in-transit bytes - the read(2) size cannot be allowed to be much larger.

Renaming the directive to something more easily understood and/or adjusting its default value seems appropriate to me. Dropping it entirely does not seem right.

The read_ahead_gap name is acceptable and removal of that key directive is impossible. My question is not about read_ahead_gap though.

Huh? That does not follow.

SO_SNDBUF red herring aside, no matter what we do in the foreseeable future, there will be two read(2) "buffers" (one in kernel and one in Squid), so 2*N (or, with a new explicitly set option, N + M) is something we have to live with.

If you are going to insist on those 4 options being the only ones implemented. Then option #4 is the closest to correct. We still have the issue of read_ahead_gap causing timeouts in the server TCP connection if that is much smaller than inBuf when clients are slow.

yadij avatar Nov 09 '20 03:11 yadij

the size of inBuf is a sub-set of the message bytes received and not yet sent to client.

The above statement is technically true but is largely irrelevant because inBuf is (and should be) usually empty. Thinking of inBuf as a buffer that becomes empty immediately after each read(2) may help understand what is actually going on by ignoring noise and focusing on the essence of the concepts we are dealing with here.

The read-ahead concept as I understand it forbids that buffer being larger than read_ahead_gap value.

Same thing: A technically true but largely unimportant fact. Yes, minor caveats aside, the read-ahead limit is one of the upper limits on the inBuf capacity. Naturally, one does not want to read more than one is allowed to accumulate. What is important here is that the read-ahead limit is just one of the limits -- it does not determine a desirable inBuf capacity in many cases; it only limits it.

In most cycles that buffer should actually be considered as not being able to read/resize (too many bytes pending for ICAP etc).

Minor caveats aside: In most cycles, inBuf is empty. If the client is faster than the server, then a new read(2) into inBuf is usually possible. Resizing should indeed be rare except during the headers-to-body transition (because of our current header and body parsing code implementation specifics).

A lot of those extremely old documentation texts use the words "the cache" to mean Squid as a whole. I am interpreting the description in light of that and how it is used through Squid.

Sure, that is fine, but since inBuf is usually empty, its role is not really important in understanding of "the cache" concept, for any reasonable interpretation of "the cache". The primary data storage and accumulation (that read-ahead is focusing on) happens outside of inBuf. In the closer-to-ideal implementation, inBuf would be a local variable that exists just for the read(2) call. In the ideal implementation, inBuf probably would not exist at all -- the kernel would read into Squid storage directly.

read-ahead concept places a limit on whether an individual read(2) call is possible, and how many bytes that read(2) is allowed to receive. Each of those read(2) is a "retrieval" of bytes from the server.

Yes. And there are many other upper limits like that read-ahead limit. They all cannot be ignored, of course, but no single one of those limits can determine, in isolation, inBuf capacity (in a general case). They are just that -- upper limits.

Perhaps it will help to look at this from a different angle: Imagine a situation where the admin does not want to limit read ahead at all (or limits it to some huge value like 100GB). What should inBuf (and maximum single read(2) size) be in that case? That is what my question was about. Not the other (external to inBuf) upper limits on reading.

Currently the inBuf math is assuming that bytes in-transit are not pruned out of inBuf.

Currently, inBuf is pruned, as it has always been. It is usually empty. That is not going to change in the foreseeable future. The current code is broken in many ways, possibly because it has been modified using wrong assumptions.

inBuf is what I am calling "server I/O buffer". Being an SBuf it is an object shared between the read(2), Parser, and the code delivering to the next outbound hop (client, ICAP, or cache storage).

Yes, I hope the above discussion clarifies why this is not important.

The following simple examples illustrate the difference between two mostly unrelated concepts, the read-ahead data accumulation and capacity of a buffer used for individual network read(2) call:

  1. A proxy can buffer ~10 MB worth of (read-ahead) bytes using a 1-byte read(2) buffer.

This is "not a problem" ... transfer happens (albeit unacceptably slow) and Squid remains with admin configured memory limits.

Yes, I was not documenting a problem. I was providing an illustration where there is virtually no connection between the read-ahead limit and inBuf size.

  1. A proxy can buffer ~0 bytes worth of (read-ahead) bytes using a 20 MB I/O buffer

Again, not a problem. Just an illustration that there is no one-determines-another relationship between the two concepts.

For reverse-proxy and CDN where servers are fast and clients slow the server. TCP may deliver that whole 20MB in a few dozen jumbo packets, then has to wait a long time for Squid to empty the buffer down to the read-ahead gap before the next read(2) is allowed to be performed. The longer that wait the more chance TCP or NAT will timeout the connection. IIRC this was one of the issues driving the update to use read_ahead_gap on inBuf.

The decision to make inBuf as big as the read_ahead_gap was wrong regardless of what use case it was trying to accommodate. I am not sure it would help with the above use case, but it would certainly hurt in other use cases. A read_ahead_gap value is just an upper bound and should be treated as such.

Whether you interpret it as the I/O buffer size, or the total of all in-transit bytes - the read(2) size cannot be allowed to be much larger.

We agree on that.

Renaming the directive to something more easily understood and/or adjusting its default value seems appropriate to me. Dropping it entirely does not seem right.

The read_ahead_gap name is acceptable and removal of that key directive is impossible. My question is not about read_ahead_gap though.

Huh? That does not follow.

I see no problems with my statement. Hopefully, the discussion above will clarify why my question about inBuf capacity was not about read_ahead_gap.

If you are going to insist on those 4 options being the only ones implemented.

I am always open to better options (that do not require implementing serious new features -- we are just fixing bugs here), but none was offered so far.

Then option #4 is the closest to correct.

I do not understand how a compiled-in constant can be closest to the correct value which certainly varies from one use case to another. See R commit 7de95e2 for a bulleted list of problems with that option.

We still have the issue of read_ahead_gap causing timeouts in the server TCP connection if that is much smaller than inBuf when clients are slow.

Using tcp_recv_bufsize would allow the admin to configure read_ahead_gap independently of inBuf capacity so if the problem you want to solve has a solution at all (within Squid capabilities), then the admin would be able to configure Squid to match that solution.

rousskov avatar Nov 09 '20 04:11 rousskov

This PR may fix bug 5170 as well, although these changes were not focused on or driven by that specific problem (which was reported a year after the last PR comment at https://github.com/squid-cache/squid/pull/736#issuecomment-723746840).

rousskov avatar Oct 26 '21 14:10 rousskov