astro-seo icon indicating copy to clipboard operation
astro-seo copied to clipboard

Title with unicode doesn't work

Open mrcnski opened this issue 1 year ago • 6 comments

Thanks for the utility! I think I've found a bug. It seems that setting the <SEO title= field to some unicode breaks the generated HTML:

Screenshot 2024-02-22 at 15 26 40

mrcnski avatar Feb 22 '24 14:02 mrcnski

I think this might be because astro-seo currently puts the <meta charset="UTF-8" /> tag after the title tag. I'm going to create a separate issue for that.

ttmc avatar Feb 24 '24 19:02 ttmc

Thank you for reporting this @mrcnski and really good catch @ttmc, we'll discuss possible solutions to this in #91

Just to confirm that this might actually be the issue behind this: do you use the charset tag via astro-seo @mrcnski?

jonasmerlin avatar Feb 25 '24 08:02 jonasmerlin

Hey @jonasmerlin, I didn't set charset because I thought that it would default to UTF-8 if not provided. If that's not true, maybe the docs could be clarified?

mrcnski avatar Feb 25 '24 11:02 mrcnski

@mrcnski While modern browsers like Chrome strongly prefer UTF-8 as the default charset for websites without explicit declaration, there isn't a single guaranteed assumption. It's a multi-step process with multiple fallback options... here's what Chrome does:

  1. Byte Order Mark (BOM): Chrome first checks if the website content starts with a Byte Order Mark, which is a sequence of bytes indicating the specific encoding used. If a UTF-8 BOM is present, the browser assumes UTF-8 encoding.
  2. HTTP Headers: If no BOM is found, Chrome looks for the Content-Type header in the HTTP response. This header can explicitly specify the charset used, and if it mentions UTF-8, that will be used.
  3. Meta Tag: If neither BOM nor the Content-Type header provides a clear answer, Chrome checks for a <meta charset="utf-8"> tag within the HTML document itself. If present, this explicitly declares UTF-8 as the encoding.
  4. Heuristic Detection: If none of the above methods provide a clear indication, Chrome attempts to "guess" the charset based on heuristics and statistical analysis of the content itself. This involves looking for patterns and similarities with known encodings, but it's not always accurate and can lead to misinterpretations, especially for content containing characters from multiple languages.
  5. Fallback Default: If all attempts to identify the charset fail, Chrome resorts to a fallback default encoding. This is implementation-dependent and can vary across different browsers and even browser versions. However, for Chrome, the fallback default is generally the user's operating system default encoding, which might be something like Windows-1252 or ISO-8859-1 depending on the user's system configuration.

However, relying on browser guessing and fallback defaults is strongly discouraged for several reasons:

  • Inconsistency: Different browsers and user systems can have different fallback defaults, leading to inconsistent rendering of a website across different platforms.
  • Incorrect Interpretation: Misinterpreting the encoding can lead to garbled text, broken layouts, and potential security vulnerabilities.
  • Unnecessary Re-encoding: Browsers need to re-encode the content if the guessed encoding is wrong, which wastes resources and can slow down page loading.

Therefore, it's essential for website developers to explicitly declare the character encoding using either the Content-Type header or the <meta charset> tag, preferably using UTF-8 due to its widespread adoption and compatibility.

ttmc avatar Feb 25 '24 22:02 ttmc

Thanks for the detailed explanation @ttmc. For some reason I thought that astro-seo would set a default of UTF-8 if this field was omitted. I can see why we wouldn't want to set default any values, as they may be already set outside of the <SEO /> component. My mistake, but maybe this part of the README could be clarified slightly?

Set the charset of the document. In almost all cases this should be UTF-8.

mrcnski avatar Feb 27 '24 09:02 mrcnski

@mrcnski Over on issue #91, one suggestion has been to make the astro-seo integration check if the charset gets set to UTF-8 (by something or someone other than the astro-seo integration), and if it hasn't been set, then the integration will inject the charset declaration at the top of the head.

ttmc avatar Feb 27 '24 14:02 ttmc