Enhanced Trailing Whitespace Handling in HTTP Headers
Screenshot of Test Results
Updated Code
Original Code
Test: test_parse_line_with_trailing_spaces
def test_parse_line_with_trailing_spaces(self):
headers = HTTPHeaders()
# Test header with multiple trailing spaces
headers.parse_line("Content-Length: 0 ")
self.assertEqual(headers.get('content-length'), '0')
# Test header with leading and trailing spaces
headers.parse_line(" Content-Length : 123 ")
self.assertEqual(headers.get('content-length'), '123') # Ensure value is overwritten
# Test continuation line with trailing spaces
headers.parse_line("Content-Length: 42")
headers.parse_line(" ") # Test multi-line continuation
self.assertEqual(headers.get('content-length'), '42') # Ensure spaces don't affect value
Description:
The modifications made to the parse_line method in the HTTPHeaders class effectively resolve the issue of improper handling of trailing and leading whitespace in HTTP headers, specifically addressing a critical edge case involving the Content-Length header. This issue was highlighted in a GitHub discussion (#3321), where it was noted that trailing spaces in the header value could lead to errors during processing, especially when Content-Length is the last header in a request.
How the Code Solves the Issue:
-
Robust Whitespace Management:
- The new implementation of
parse_lineremoves leading and trailing whitespace from the header line usingline.strip(). This ensures that any extra spaces do not affect the stored value, thus preventing potential formatting issues when accessing the header later. - The use of
strip()prevents cases likeContent-Length: 0from storing an unintended value with trailing spaces.
- The new implementation of
-
Specific Handling of
Content-Length:- The code explicitly checks for the
Content-Lengthheader and ensures that the value after the colon is correctly extracted and stripped of whitespace. For example, parsingContent-Length : 123would result in the correct value of'123', completely ignoring any spaces. - This targeted handling avoids common pitfalls such as retaining leading spaces (e.g.,
42from a continuation line) or appending unintended spaces due to continuation line logic.
- The code explicitly checks for the
-
Continuation Line Handling:
- The revised implementation also addresses the issue of continuation lines. If a continuation line is encountered (indicating an extension of a previous header), the new code checks for non-empty values before appending them. This prevents scenarios where an empty continuation line could erroneously append whitespace to the previous value, maintaining the integrity of the header data.
- For example, when parsing a continuation line like
" ", the new logic ensures that this line is effectively ignored, thus maintaining the integrity of the previously stored header value.
-
Error Prevention:
- By ensuring that values are stripped of whitespace and validating conditions before updating header values, the code reduces the risk of
ValueErroroccurrences that arise from malformed headers. This makes the header processing more resilient and less prone to user errors in HTTP requests.
- By ensuring that values are stripped of whitespace and validating conditions before updating header values, the code reduces the risk of
Conclusion:
The enhancements made in the parse_line method significantly improve the handling of HTTP headers, specifically addressing the issues related to trailing and leading whitespace in header values. By implementing a robust approach to whitespace management and providing specific handling for critical headers like Content-Length, the code mitigates potential parsing errors and ensures adherence to HTTP standards, thereby improving overall functionality and reliability. This change effectively resolves the edge cases discussed in the issue, enhancing the usability of the HTTPHeaders class in real-world applications.
@bdarnell can you please review my tests ??
@bdarnell In response to: Discussion & Pull Request Review
I have updated the handling of the Content-Length header to enforce stricter validation rules, ensuring compliance with RFC 7230 guidelines. The prior approach in parse_line allowed headers to be updated without rigorous validation, which could lead to compliance issues. Given this, I have reverted the previous changes to parse_line and introduced stricter validation.
In this revised implementation:
- Any malformed, negative, or conflicting
Content-Lengthheaders will now raise an exception. This prevents incorrect or potentially unsafe values from being processed, ensuring the header's value complies with HTTP/1.1 standards.
This update resolves several shortcomings in Tornado’s prior implementation by correctly handling edge cases that were previously unaddressed. Specifically, the original code had the following issues in tests:
test_multiple_content_length_headers: Allowed multipleContent-Lengthheaders with differing values without raising an error.test_invalid_content_length: Failed to raise an exception for non-integer values inContent-Length.test_negative_content_length: Did not handle negativeContent-Lengthvalues appropriately.
The updated code now adheres to RFC 7230 specifications, ensuring these cases are handled correctly and improving Tornado's robustness.
Errors in the Original Code:
-
test_multiple_content_length_headers:- The test expected a single value
"123", but the code returned"123,123". - Error message:
AssertionError: '123,123' != '123' - 123,123 + 123
- The test expected a single value
-
test_invalid_content_length:- The code did not raise an exception for an invalid
Content-Lengthvalue like"abc". - Error message:
AssertionError: HTTPInputError not raised
- The code did not raise an exception for an invalid
-
test_negative_content_length:- The code failed to raise an error for a negative
Content-Lengthvalue. - Error message:
AssertionError: HTTPInputError not raised
- The code failed to raise an error for a negative
Updated Test Code:
def test_multiple_content_length_headers(self):
headers = HTTPHeaders()
headers.parse_line("Content-Length: 123")
headers.parse_line("Content-Length: 123")
self.assertEqual(headers.get("content-length"), "123")
with self.assertRaises(HTTPInputError):
headers.parse_line("Content-Length: 456") # Should raise an error due to conflicting values
def test_invalid_content_length(self):
headers = HTTPHeaders()
with self.assertRaises(HTTPInputError):
headers.parse_line("Content-Length: abc") # Should raise an error due to non-integer value
def test_negative_content_length(self):
headers = HTTPHeaders()
with self.assertRaises(HTTPInputError):
headers.parse_line("Content-Length: -123") # Should raise an error due to negative value
def test_leading_trailing_whitespace(self):
headers = HTTPHeaders()
headers.parse_line("Content-Length: 123 ")
self.assertEqual(headers.get('content-length'), '123') # Should handle trailing whitespace correctly
def test_zero_content_length(self):
headers = HTTPHeaders()
headers.parse_line("Content-Length: 0")
self.assertEqual(headers.get('content-length'), '0') # Should correctly handle zero
Context for Content-Length Validation
According to RFC 7230, HTTP/1.1 specifies that:
-
Header Field Name Syntax: Field names must consist of valid tokens without any leading, trailing, or internal whitespace:
field-name = token token = 1*tchar -
Whitespace in Headers: Whitespace is allowed only after the colon separating the field name and value. Leading whitespace in field names (e.g.,
" Content-Length: 123") renders the header invalid under HTTP/1.1. -
Specific Rules for
Content-Length:Content-Lengthvalues must be numeric and must not have any whitespace in the field name itself.
This means Tornado's current behavior in parse_line—which raises an error on encountering leading whitespace in the field name—is both expected and correctly handled:
def test_space_in_content_length_key(self):
headers = HTTPHeaders()
with self.assertRaises(HTTPInputError):
headers.parse_line(" Content-Length: 1") # Invalid due to leading space in the key
Updated Code
Original Code
Instead, in the add method, I will not allow the addition of a Content-Length header unless it meets the necessary criteria.
I don't think HTTPHeaders.add is the right place to enforce this - HTTPHeaders is more of a dumb container, and semantic issues are covered in HTTP1Connection (If we do start enforcing this in HTTPHeaders, the redundant checks in HTTP1Connection should probably be removed).
In any case, I may be getting mixed up but I don't know if modifying add() without parse_line() actually solves the problem from #3321. When continuation lines are used, parse_line calls add() on the first line (which contains only digits), then modifies the internal dictionary directly on the second line (which contains only whitespace). This bypasses the checks in add().
This is a different approach than I described in https://github.com/tornadoweb/tornado/issues/3321#issuecomment-2150213338. Why?
I reiterate my earlier comment, which you didn't answer. This change leaves the problematic handling of whitespace in continuation lines alone, which can be a problem for more headers than content-length.
I've got my own version of a fix for this in #3477