webperf_core icon indicating copy to clipboard operation
webperf_core copied to clipboard

UTF-8 Byte Order Mark Breaks Test

Open rabbtekejos opened this issue 1 year ago • 6 comments

I have been investigating why our site got a low score in the standard files sub-category and found that webperf can't handle the presence of a UTF-8 BOM (Byte Order Mark)

If the robots.txt file begins with a BOM and the first row has the sitemap: instruction then webperf will not fetch and process the sitemap.

If the sitemap.xml file begins with a BOM then get_root_element fails to find the root element.

This likely affects other areas of this and other tests as well.

rabbtekejos avatar May 02 '24 10:05 rabbtekejos

@rabbtekejos Please provide one (preferably 5-10 different) url(s) we can reproduce this bug against.

7h3Rabbit avatar May 02 '24 10:05 7h3Rabbit

Notes for when we have test urls: It could be because of missing/malformed encoding info in response headers resulting in wrong encoding used when reading file. We use get_http_content): to get sitemap(s) used response.text for xml sitemaps. According to documentation it uses unicode IF encoding can't be determined by response headers.

7h3Rabbit avatar May 02 '24 10:05 7h3Rabbit

I don't know much about Python or encodings but the encoding HTTP header I find tells if the content is compressed and with what algorithm (brotli, gzip etc.) and that does not look like the issue. When I added a print of the sitemap_content variable in get_root_element the text following the BOM is a correct sitemap so it can be decoded/decompressed.

As for how to reproduce this, here is a minimal ASP.NET 8 application, if that's acceptable

using System.Text;
using System.Xml;

var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();

app.MapGet("/", () => Results.Ok("This is the startpage, the sitemap can be find under the url /sitemap.xml"));

app.MapGet("/sitemap.xml", async (HttpResponse Response) =>
{
	XmlWriterSettings settings = new()
	{
		Encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: true), // Toggle the BOM on/off here
		Async = true
	};

	Response.Headers.ContentType = "text/xml";
	XmlWriter writer = XmlWriter.Create(Response.Body, settings);

	await writer.WriteStartDocumentAsync();
	await writer.WriteStartElementAsync(null, "urlset", "http://www.sitemaps.org/schemas/sitemap/0.9");
	await writer.WriteAttributeStringAsync("xmlns", "xhtml", null, "http://www.w3.org/1999/xhtml");

	await writer.WriteStartElementAsync(null, "url", null);
	await writer.WriteElementStringAsync(null, "loc", null, "https://localhost:5000");
	await writer.WriteElementStringAsync(null, "lastmod", null, DateTime.UtcNow.ToString("yyyy-MM-ddThh:mm:ss"));

	await writer.WriteStartElementAsync("xhtml", "link", null);
	await writer.WriteAttributeStringAsync(null, "rel", null, "alternate");
	await writer.WriteAttributeStringAsync(null, "hreflang", null, "sv");
	await writer.WriteAttributeStringAsync(null, "href", null, "https://localhost:5000");
	await writer.WriteEndElementAsync(); //end xhtml:link

	await writer.WriteEndElementAsync(); //end url

	await writer.WriteEndElementAsync(); // end urlset
	await writer.WriteEndDocumentAsync();

	await writer.FlushAsync();

	return Results.Empty;
}
);

app.Run();

Toggling between true and false at the marked line will control if a BOM will be present or not.

rabbtekejos avatar May 03 '24 13:05 rabbtekejos

@rabbtekejos I would have preferred url BUT I think I see the problem/missing part in your code example.

You are not specifying encoding/charset on this line: Response.Headers.ContentType = "text/xml"; resulting in receiving party MUST use default charset (read: us-ascii).

As you are encoding your XML with utf-8 you need to specify that the xml is using utf-8 charset/encoding in the contenttype. If you change that line to the following it should work:

Response.Headers.ContentType = "text/xml; charset=utf-8";

Let me know if that solves your problem :)

7h3Rabbit avatar May 03 '24 21:05 7h3Rabbit

It don't seems like I can get webperf_core working when running towards a local site so I can't test what happens when I change or remove the Content Type header.

Removing the BOM from sitemap in our testing environment and running webperf_core improved our score.

rabbtekejos avatar May 06 '24 07:05 rabbtekejos

@rabbtekejos As long as it is not protected behind login it should be possible to access the local website

7h3Rabbit avatar May 06 '24 18:05 7h3Rabbit

closed as there was no new info in issue for a week.

7h3Rabbit avatar May 12 '24 12:05 7h3Rabbit