http icon indicating copy to clipboard operation
http copied to clipboard

`content_length` raises an exception instead of `nil` or `Fixnum`

Open chief opened this issue 5 years ago • 14 comments

In some rare cases the method def content_length raises TypeError instead of nil or Fixnum. For example, you can check the following scenario:

Reproduction code

url = "https://imagery.pragprog.com/products/595/tpp20.jpg"
response = HTTP.get(url)
response.content_length 

Expected behavior

Expected nil or Fixnum instead I get

TypeError: can't convert Array into Integer

The response object is:

 #<HTTP::Response/1.1 200 OK {"Accept-Ranges"=>["bytes", "bytes"], "Age"=>"4590838", "Cache-Control"=>"public max-age=315360000", "Content-Length"=>["2802494", "2802494"], "Date"=>"Fri, 20 Sep 2019 14:33:37 GMT", "Etag"=>"\"8121dc48ec937ecf919bc2c54aa961a4\"", "Expires"=>"Thu, 31 Dec 2037 23:55:55 GMT", "Last-Modified"=>"Mon, 06 May 2019 20:19:24 GMT", "Server"=>"imagery3", "Via"=>"1.1 varnish", "X-Varnish"=>"312290075 306038822", "X-Amz-Id-2"=>"H6BrRPBmJ2AKray6L+xApTdXlP511cVEXJ6DnDMEYVPcmaWJCGpzs3DTNiRez2x+SuLsQxZcUfI=", "X-Amz-Request-Id"=>"7ED8CC65AA4FCF29", "Connection"=>"Close"}>

The value of Content-Length is "Content-Length"=>["2802494", "2802494"] which is suspicious.

chief avatar Sep 20 '19 15:09 chief

Looks like the server is returning the Content-Length header twice:

$ http https://imagery.pragprog.com/products/595/tpp20.jpg
HTTP/1.1 200 OK
ACCEPT_RANGES: bytes
Accept-Ranges: bytes
Age: 4603039
CONTENT_LENGTH: 2802494
Cache-Control: public max-age=315360000
Connection: keep-alive
Content-Length: 2802494
Date: Fri, 20 Sep 2019 17:56:57 GMT
ETAG: "8121dc48ec937ecf919bc2c54aa961a4"
Expires: Thu, 31 Dec 2037 23:55:55 GMT
LAST_MODIFIED: Mon, 06 May 2019 20:19:24 GMT
Server: imagery3
Via: 1.1 varnish
X-Varnish: 312319203 306038822
X_AMZ_ID_2: H6BrRPBmJ2AKray6L+xApTdXlP511cVEXJ6DnDMEYVPcmaWJCGpzs3DTNiRez2x+SuLsQxZcUfI=
X_AMZ_REQUEST_ID: 7ED8CC65AA4FCF29

paul avatar Sep 20 '19 18:09 paul

I believe we should use #last value of content-length headers.

ixti avatar Sep 21 '19 21:09 ixti

Wow, that's crazy. Any prior art on what other HTTP clients do in this case?

tarcieri avatar Sep 21 '19 21:09 tarcieri

Yeah. Would be nice to know which is default behaviour other clients use. I see there are 3 ways on how to act in given scenario:

  1. consider first value only
  2. consider last value only
  3. treat as response with no Content-Length

some sort of mixture of the above variant:

def content_length
  values = headers.get("Content-Length").uniq
  values.first.to_i if 1 == values.size
end

Kinda pseudo-code...

ixti avatar Sep 21 '19 22:09 ixti

I seem to recall reading somewhere that some headers allow multiple values separated by some delimiter, and if one of those headers is specified multiple times then the client should consider that equivalent to it being specified once with the values separated by the delimiter.

In the case of headers that do not allow multiple values, then later duplicate header values should replace ones specified earlier.

I don't know if there's a comprehensive list of which headers allow multiple values and their associated delimiters (because the delimiters can vary between headers). I also don't know if we want to deal with any of that, or just punt and special-case the Content-Length header with values.is_a?(Array) ? values.last : values.

https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html Section 4.2:

Multiple message-header fields with the same field-name MAY be present in a message if and only if the entire field-value for that header field is defined as a comma-separated list [i.e., #(values)].

https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.13 Content-Length is not one of the multliple value headers.

paul avatar Sep 22 '19 02:09 paul

I did some experiments with Faraday and Httparty.

Faraday

It seems Faraday keeps only one value at headers.

url = "https://imagery.pragprog.com/products/595/tpp20.jpg"
response = Faraday.get url
puts response.headers['content_length'] #=> "2802494"

HTTParty

Same with this gem. It keeps only one value

url = "https://imagery.pragprog.com/products/595/tpp20.jpg"
response  = HTTParty.get url
puts response.headers['content_lenght'] #=> "2802494"

I can examine more clients if you want to

chief avatar Sep 22 '19 07:09 chief

I’m curious what curl and major notable browsers do

tarcieri avatar Sep 22 '19 13:09 tarcieri

I've wrote a small testing server:

# frozen_string_literal: true

require "socket"
server = TCPServer.new(3000)

loop do
  Thread.new(server.accept) do |io|
    status  = 200
    headers = []

    case io.gets[%r{^GET /(\d+) HTTP}, 1]
    when "1"
      headers << "Content-Length: 3"
      headers << "Content-Length: 3"
    when "2"
      headers << "Content-Length: 9"
      headers << "Content-Length: 3"
    when "3"
      headers << "Content-Length: 3"
      headers << "Content-Length: 9"
    when "4"
      headers << "Content-Length: 9"
      headers << "Content-Length: 9"
    else
      status = 500
    end

    io << "HTTP/1.1 #{status}\r\n"
    headers.each { |v| io << v << "\r\n" }
    io << "\r\n123456789"
  rescue => e
    warn e
  ensure
    io.close
  end
end

And that's the behaviour of different clients:

HTTPie

Ensures that if there are multiple Content-Length - all are the same

image

CURL

Uses last given Content-Length

image

Firefox

Similar behaviour to HTTPie

  • If both Content-Lengths are the same - use it
  • When they mismatch - fail:

image

ixti avatar Sep 22 '19 14:09 ixti

Personally I prefer failing in case if multiple different content length values are given:

def content_length
  values = headers.get("Content-Length").uniq

  raise "WTF? o_O" if 1 < values.size

  values.first&.to_i
end

ixti avatar Sep 22 '19 14:09 ixti

Worth to test how they handle absolutely undxpected value of content length as well.

ixti avatar Sep 22 '19 14:09 ixti

Enhanced dummy server with couple of more scenarios:

# frozen_string_literal: true

require "socket"
server = TCPServer.new(3000)

loop do
  Thread.new(server.accept) do |io|
    status  = 200
    headers = []

    case io.gets[%r{^GET /(\d+) HTTP}, 1]
    when "1"
      headers << "Content-Length: 3"
      headers << "Content-Length: 3"
    when "2"
      headers << "Content-Length: 9"
      headers << "Content-Length: 3"
    when "3"
      headers << "Content-Length: 3"
      headers << "Content-Length: 9"
    when "4"
      headers << "Content-Length: 9"
      headers << "Content-Length: 9"
    when "5"
      headers << "Content-Length: X"
      headers << "Content-Length: 9"
    when "6"
      headers << "Content-Length: 9"
      headers << "Content-Length: X"
    when "7"
      headers << "Content-Length: 3"
      headers << "Content-Length: X"
      headers << "Content-Length: 9"
    when "8"
      headers << "Content-Length: 9"
      headers << "Content-Length: X"
      headers << "Content-Length: 3"
    when "9"
      headers << "Content-Length: 9"
      headers << "Content-Length: X"
      headers << "Content-Length: 9"
    else
      status = 500
    end

    io << "HTTP/1.1 #{status}\r\n"
    headers.each { |v| io << v << "\r\n" }
    io << "\r\n123456789" unless 500 == status
  rescue => e
    warn e
  ensure
    io.close
  end
end

HTTPie

image

curl

image

Firefox

Allows multiple values as long as result is unambiguous (all values are the same).

ixti avatar Sep 22 '19 20:09 ixti

The response includes a Content-Length and CONTENT_LENGTH header. It seems like http.rb is treating them as equivalent (probably due to the normalizing), but I believe it should treat them as two separate, unrelated headers.

britishtea avatar Sep 25 '19 08:09 britishtea

The response includes a Content-Length and CONTENT_LENGTH header. It seems like http.rb is treating them as equivalent (probably due to the normalizing), but I believe it should treat them as two separate, unrelated headers.

Right. So we have to tackle 2 problems here, I believe.

ixti avatar Sep 25 '19 22:09 ixti

This seems relevant:

https://tools.ietf.org/html/rfc7230#section-3.3.2

If a message is received that has multiple Content-Length header fields with field-values consisting of the same decimal value, or a single Content-Length header field with a field value containing a list of identical decimal values (e.g., "Content-Length: 42, 42"), indicating that duplicate Content-Length header fields have been generated or combined by an upstream message processor, then the recipient MUST either reject the message as invalid or replace the duplicated field-values with a single valid Content-Length field containing that decimal value prior to determining the message body length or forwarding the message.

tarcieri avatar Sep 25 '19 22:09 tarcieri