ureq
ureq copied to clipboard
into_string might double memory usage
If charset is not enabled, Response::into_string uses String::from_utf8_lossy to convert the body &[u8] to a string.
https://github.com/algesten/ureq/blob/cfaca317c61d6ba4307c67789b309adcdf940848/src/response.rs#L365
Ok(String::from_utf8_lossy(&buf).to_string())
from_utf8_lossy returns a Cow<str>. In those cases the original &[u8] is valid utf-8, the result is a Cow::Borrowed, however if it encounters illegal utf-8, it inserts "the replacement char" to make the string valid. The replacement char requires 3 bytes, which means it can't be inserted "in place" into an existing Vec<u8>.
Though, replacement chars aside, as it stands, we use to_string(), which forces double memory usage every time.
In #351 we size limit bodies to 10MB, but into_string() will in practice for a short period hold both the original Vec<u8> and the String in memory pushing the actual memory usage up to 20MB.
One improvement would be to check whether the entire string is valid utf-8, and if so use from_utf8(buf).unwrap() to convert the Vec<u8>, this will however not solve the case where it encounters invalid utf-8.
It might be tempting to take a hardline "invalid utf-8 is invalid, just reject it", however related to this is an issue @Shnatsel kindly reported in my other project. https://github.com/algesten/hreq/issues/38 – "25718 websites out of the top million from Feb 3 Tranco list are affected." – which might be a little too frequent to ignore.
Maybe the intermediate solution is to convert the Vec<u8> if we can, but accept the double memory if it can't be done. A more elaborate solution could be to recreate relevant parts of from_utf8_lossy, but instead of the "replacement char", use something in the ascii-range to ensure the utf-8 can be fixed in place.
In hreq I tried writing a from_utf8_lossy_replacement where the caller provides a us ascii replacement char.
https://github.com/algesten/hreq/blob/4e97e7b6939ee060f940b6063e1f0c3c8cac1deb/src/from_utf8.rs
Not sure it's a great idea, but it avoids the problem, by mangling the content. Example site that has this: https://www.c64games.de – this site doesn't provide any content encoding. Neither in HTTP header, nor in html meta tags. It only displays correctly when a browser assumes a default of latin-1.
Also curl seems to replace üäö etc in that example with ?.
In #351 we size limit bodies to 10MB, but into_string() will in practice for a short period hold both the original Vec
and the String in memory pushing the actual memory usage up to 20MB.
I am not concerned about this. We don't try to guarantee that memory usage will never exceed 10MB, just that it is bounded.
It might be tempting to take a hardline "invalid utf-8 is invalid, just reject it", however related to this is an issue @Shnatsel kindly reported in my other project. algesten/hreq#38 – "25718 websites out of the top million from Feb 3 Tranco list are affected." – which might be a little too frequent to ignore.
I find this pretty tempting! Note that the problem in algesten/hreq#38 is the panic, rather than the fact that hreq rejects such websites. The problem with the lossy approach is that it may round-trip incorrectly. For instance, imagine a JSON API where you fetch an object, change one field, and POST it back. If there was an unrelated field with invalid UTF-8, when you POST the object back, you might be inadvertantly corrupting that other field (this example is imperfect because you would hopefully be using into_json()).
Or for a more concrete example, http://netclusive.de/ linked in algesten/hreq#38 actually has an encoding in the body: <?xml version="1.0" encoding="ISO-8859-1" ?>. By using from_utf8_lossy, we incorrectly decode that site and hide the error. The more correct thing to do would be for a user of hreq to read the body into a Vec<u8>, then parse a little bit looking for encoding tags, and then if applicable try to decode as ISO-8859-1.
into_string() is a convenience method for the happy path when everything is UTF-8. I think it's okay to return an error for encodings that we can't see at the HTTP layer, and ask the hreq user to do something more sophisticated.
Example site that has this: https://www.c64games.de – this site doesn't provide any content encoding. Neither in HTTP header, nor in html meta tags. It only displays correctly when a browser assumes a default of latin-1.
This is an interesting example! It has <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> as the doctype.
Here's what the HTML 4.0 spec says: https://www.w3.org/TR/html40/charset.html#h-5.2.2
The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter.
The more modern https://tools.ietf.org/html/rfc7231#appendix-B says:
The default charset of ISO-8859-1 for text media types has been removed; the default is now whatever the media type definition says. Likewise, special treatment of ISO-8859-1 has been removed from the Accept-Charset header field. (Section 3.1.1.3 and Section 5.3.3)
Also in writing this up it occurred to me we should perhaps be sending Accept-Charset: utf-8, or Accept-Charset: <list> when build with the charset feature. But looking at the docs, it appears that no modern browser sends Accept-Charset, and so I suspect most servers don't bother to honor it either.
The problem with the lossy approach is that it may round-trip incorrectly.
Not necessarily disagreeing, but what do you think about curl doing this mangling?
Not necessarily disagreeing, but what do you think about curl doing this mangling?
I was surprised to learn about it! I'm going to go read some curl docs and see what facilities it has regarding encodings.
I think actually curl passes through the ISO-8859-1 bytes unchanged, but my terminal turns those into � Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD).
Here's a screenshot of part of the output for curl https://www.c64games.de/ | less. less processes non-ASCII characters itself, and turns them into e.g. <FC> and <F6>.

Here's a screenshot of the same section, where I just ran curl https://www.c64games.de/ and let my terminal render it. Terminal is gnome-terminal, LC_ALL=en_US.UTF-8.

And here is the same section, after I set my terminal encoding to ISO-8859-1, and re-ran curl. It correctly renders e.g. "geprüfte", "können":
