react icon indicating copy to clipboard operation
react copied to clipboard

head > meta > content escaping issue

Open oliviertassinari opened this issue 5 years ago • 29 comments

Do you want to request a feature or report a bug?

I'm guessing it's a bug.

What is the current behavior?

The following source code,

<meta property="og:image" content="https://onepixel.imgix.net/60366a63-1ac8-9626-1df8-9d8d5e5e2601_1000.jpg?auto=format&q=80&mark=watermark%2Fcenter-v5.png&markalign=center%2Cmiddle&h=500&w=500&s=60ec785603e5f71fe944f76b4dacef08" />

, is being escaped once server side rendered:

<meta property="og:image" content="https://onepixel.imgix.net/60366a63-1ac8-9626-1df8-9d8d5e5e2601_1000.jpg?auto=format&amp;q=80&amp;mark=watermark%2Fcenter-v5.png&amp;markalign=center%2Cmiddle&amp;h=500&amp;w=500&amp;s=60ec785603e5f71fe944f76b4dacef08"/>

You can reproduce the behavior like this:

const React = require("react");
const ReactDOMServer = require("react-dom/server");
const http = require("http");

const doc = React.createElement("html", {
  children: [
    React.createElement("head", {
      children: React.createElement("meta", {
        property: "og:image",
        content:
          "https://onepixel.imgix.net/60366a63-1ac8-9626-1df8-9d8d5e5e2601_1000.jpg?auto=format&q=80&mark=watermark%2Fcenter-v5.png&markalign=center%2Cmiddle&h=500&w=500&s=60ec785603e5f71fe944f76b4dacef08"
      })
    }),
    React.createElement("body", { children: "og:image" })
  ]
});

//create a server object:
http
  .createServer(function(req, res) {
    res.write("<!DOCTYPE html>" + ReactDOMServer.renderToStaticMarkup(doc)); //write a response to the client
    res.end(); //end the response
  })
  .listen(8080); //the server object listens on port 8080

editor: https://codesandbox.io/s/my299jk7qp output : https://my299jk7qp.sse.codesandbox.io/

What is the expected behavior?

I would expect the content not being escaped. It's related to https://github.com/zeit/next.js/issues/2006#issuecomment-355917446. I'm using the og:image meta element so my pages can have nice previews within Facebook :).

capture d ecran 2018-10-12 a 14 15 26

Which versions of React, and which browser / OS are affected by this issue? Did this work in previous versions of React? 16.5.2

oliviertassinari avatar Oct 12 '18 12:10 oliviertassinari

Has this ever worked as you intend? Can you send a fix?

gaearon avatar Nov 01 '18 19:11 gaearon

We are solving the problem this way:

import Entities from 'html-entities/lib/html5-entities'

const entities = new Entities()
const contentRegExp = /content="([^"]+)"/g
const handleContent = (match, content) => {
  return `content="${entities.decode(content)}"`
}

html = html.replace(contentRegExp, handleContent)

We spend ~1ms per request in the path. It's not too bad. I can give it a look at some point.

oliviertassinari avatar Nov 01 '18 19:11 oliviertassinari

I have found this related issue: #6873. Digging into the implementation, the behavior comes from https://github.com/facebook/react/blob/0005d1e3f54b79fe4707fbccc44b89e0fb4ce565/packages/react-dom/src/server/DOMMarkupOperations.js#L61 ⬇️ https://github.com/facebook/react/blob/b87aabdfe1b7461e7331abb3601d9e6bb27544bc/packages/react-dom/src/server/quoteAttributeValueForBrowser.js#L17 ⬇️ https://github.com/facebook/react/blob/b87aabdfe1b7461e7331abb3601d9e6bb27544bc/packages/react-dom/src/server/escapeTextForBrowser.js#L108


Now, all the escaping tests I could find are covering the children use case: https://github.com/facebook/react/blob/b87aabdfe1b7461e7331abb3601d9e6bb27544bc/packages/react-dom/src/tests/escapeTextForBrowser-test.js#L23-L24 I have limited knowledge of web escaping related security issues. I don't see any harm potential with:

 const response = ReactDOMServer.renderToString(<span data-src={'&'}></span>); 
 expect(response).toMatch('<span data-reactroot="" data-src="&"></span>'); 

oliviertassinari avatar Jan 14 '19 14:01 oliviertassinari

I have the same problem in the content of <style> elements:

const React = require("react");
const ReactDOMServer = require("react-dom/server");

console.log(ReactDOMServer.renderToStaticMarkup(
  <html>
    <head>
      <link
        href="https://fonts.googleapis.com/css?family=Source+Sans+Pro"
        rel="stylesheet"
      />
      <style>{`
        html {
          font-family: "Source Sans Pro", sans-serif;
        }
      `}</style>
    </head>
    <body>
      <p>Test.</p>
    </body>
  </html>
));

This outputs:

<html><head><link href="https://fonts.googleapis.com/css?family=Source+Sans+Pro" rel="stylesheet"/><style>
        html {
          font-family: &quot;Source Sans Pro&quot;, sans-serif;
        }
      </style></head><body><p>Test.</p></body></html>

By the parsing rules in the HTML spec (I'm consulting WHATWG here), the contents of elements style, xmp and iframe (as well as noscript, noframes and noembed when they're not being rendered) are parsed with the RAWTEXT tokenizer state, which treats everything as plaintext until it finds a matching closing tag.

Escaping the contents of style elements is, however, valid (in fact, mandatory for angled brackets) in the XML syntax of HTML; and indeed, adding an xmlns="http://www.w3.org/1999/xhtml" attribute to the <html> element results in valid XML. But if the intention of ReactDOMServer is indeed to render XML syntax, that should be explicitly noted in the documentation, because there are a number of tools (such as Next.js) which serve the output of these functions with content-type text/html.

andreubotella avatar Mar 06 '19 22:03 andreubotella

@andreubotella This is a different problem, you should use dangerouslySetInnerHTML. Can an admin mark the comments as "resolved"?

oliviertassinari avatar Mar 06 '19 22:03 oliviertassinari

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution.

stale[bot] avatar Jan 10 '20 06:01 stale[bot]

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

stale[bot] avatar Jan 17 '20 06:01 stale[bot]

This is not a bug in React. Using an entity reference for & (e.g. &amp;) is the correct behavior for xhtml documents:

In both SGML and XML, the ampersand character ("&") declares the beginning of an entity reference (e.g., ® for the registered trademark symbol "®"). Unfortunately, many HTML user agents have silently ignored incorrect usage of the ampersand character in HTML documents - treating ampersands that do not look like entity references as literal ampersands. XML-based user agents will not tolerate this incorrect usage, and any document that uses an ampersand incorrectly will not be "valid", and consequently will not conform to this specification. In order to ensure that documents are compatible with historical HTML user agents and XML-based user agents, ampersands used in a document that are to be treated as literal characters must be expressed themselves as an entity reference (e.g. "&amp;"). For example, when the href attribute of the a`` element refers to a CGI script that takes parameters, it must be expressed as http://my.site.dom/cgi-bin/myscript.pl?class=guest&amp;name=user rather than as http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user.

In the HTML spec you do not need to use a character reference for & as long as what follows it is not a string that forms a named character reference.

The example they give is:

<a href="?bill&ted">Bill and Ted</a> <!-- &ted is ok, since it's not a named character reference -->
<a href="?art&amp;copy">Art and Copy</a> <!-- the & has to be escaped, since &copy is a named character reference -->

Personally, I feel like React made the right call with escaping & since that works in both XHTML and HTML5.

jbraithwaite avatar Aug 17 '20 21:08 jbraithwaite

In meta tags escaped paths don't work... otherwise, this bug would not have be opened.

equinusocio avatar Aug 18 '20 05:08 equinusocio

This is the change needed to get the behavior you expect:

Replace https://github.com/facebook/react/blob/ee409ea3b577f9ff37d36ccbfc642058ad783bb0/packages/react-dom/src/server/ReactPartialRenderer.js#L383

with an escape hatch:

if (tagVerbatim === 'meta' && propKey === 'content') {
  markup = 'content="' + propValue + '"';
} else {
  markup = createMarkupForProperty(propKey, propValue);
}

This would explicitly exempt the meta tag's content attribute from being properly escaped which wouldn't help @oliviertassinari's issue of wanting <span data-src={'&'}></span>.

A more generic solution would involve having something like dangerouslySetAttributes

<span
  dangerouslySetAttributes={{__attributes: [{name: 'data-src', value: '&'}]}}
/>

This could easily lead to parsing errors and unexpected results if any value after the & is a named character reference e.g. &copy (without the ;)

Again, the issue was with the HTML parser FB was using for the Sharing Debugger, not React. It is properly parsing the escaped paths now.

jbraithwaite avatar Aug 18 '20 05:08 jbraithwaite

@equinusocio Anything reading the meta tags from the HTML should handle decoding the HTML encoded characters that React produces here. It's a bug with anything consuming the HTML if it doesn't decode the &amp;s correctly. If you're trying to test your work and want to read HTML-decoded value, then in dev tools element inspector, pick the meta tag element, and in the console run $0.content. That will return the proper value because the browser handles HTML decoding properly as anything parsing the HTML should.

Macil avatar Mar 24 '21 23:03 Macil

Hello, I'm having this problem and can't find any fix. I'm on React 17.0.1. Would anyone have a hint ?

YoannBuzenet avatar May 04 '21 09:05 YoannBuzenet

I’ve spent the last little while trying to figure out this issue since the encoded URLs cause LinkedIn to not see Open Graph images.

I’ve come up with a workaround for our Gatsby site that uses ReactDOMServer.renderToStaticMarkup, replacing &amp; to & in the string, then rendering using dangerouslySetInnerHTML:

const headHtml = ReactDOMServer.renderToStaticMarkup(headComponents).replace(
  /&amp;/g,
  '&'
)

return (
  <html>
    <head dangerouslySetInnerHTML={{ __html: headHtml }} />

...

And here’s the full html.js for a Gatsby project: https://github.com/notsidney/gatsby-meta-encoded-url-workaround/blob/main/src/html.js#L6-L25

notsidney avatar Jun 11 '21 03:06 notsidney

Do I understand correctly that the issue is primarily due to the content scrapers that don’t consider encoded characters? So the React output is technically correct but some scrapers are too naïve? Or is there a reason why React behavior is technically incorrect?

gaearon avatar Sep 06 '21 22:09 gaearon

Yes and no. For example https://resoc.io/setup will fail showing you an image if you provide encoded characters (that means this service is unusable for React users). Same behavior is for Next.JS.

Inspired by https://github.com/vercel/next.js/issues/2006#issuecomment-900696618 <meta property="og:image" content='/api/share?one=1&two=2' /> will become <meta property="og:image" content='/api/share?one=1&amp;two=2' />.

In the api/share.ts, (automatically) parsed request (req.query), will become an object with keys { one: 1, 'amp;two': 2 }. It is incorrectly parsed and unusable. Libraries should handle this fine, but not all of them do. It is a bigger problem, but quite unique.

Unfortunately there is no easy solution (dangerouslySetInnerHTML, decoding escaped HTML... - it does not suit for every problem).

ihmpavel avatar Sep 06 '21 22:09 ihmpavel

Yes and no

"No" to which part?

For example https://resoc.io/setup will fail showing you an image if you provide encoded characters (that means this service is unusable for React users).

I understand that particular services may not work. What I'm asking is — technically would you agree that the problem is on their end, or is the problem unambiguously on our end? I don't mean to turn this into "not our problem, bye" kind of argument, but I think we need to get clarity on the ideal situation before we can move forward with any plan here. So is React producing technically correct output (and resoc's parser is wrong) or is React producing technically incorrect output? I understand this question may not be important to your use case (it doesn't work either way) but it is important to me.

Same behavior is for Next.JS.

I don't know what you mean by this. Next.JS is using React so it's expected that it would have the same behavior. You mention "same behavior" after resoc.io which makes it sound like you're saying Next.JS is also incorrectly parsing something, but Next.JS is producing, not consuming, and it's producing via React, so I'm not sure if there's any extra meaning I'm missing here. I'd expect Next.JS to not be relevant in this issue because the issue is about React behavior itself, and discussing React alone should be enough to figure out the path forward here. But maybe I missed something!

Inspired by vercel/next.js#2006 (comment) <meta property="og:image" content='/api/share?one=1&two=2' /> will become <meta property="og:image" content='/api/share?one=1&amp;two=2' />.

OK so maybe this is what you mean by "for Next.JS". I think you're saying: "when the URL is encoded, the request received by my API built with Next.JS contains incorrect parameter". However, this assertion is missing an important detail: what exactly is sending a request to your Next.JS API? Is this Facebook? Twitter? LinkedIn? resoc.io? This is important — I need to understand which case exactly you are trying to solve, if there is a concrete one.

In the api/share.ts, (automatically) parsed request (req.query), will become an object with keys { one: 1, 'amp;two': 2 }. It is incorrectly parsed and unusable. Libraries should handle this fine, but not all of them do.

It is unclear which libraries you're referring to here. Next.JS? I believe Next.JS's behavior is correct here — if the request itself has an encoded string, Next.JS isn't supposed to unencode it. I believe unencoding should've been done by whatever thing that parsed your meta tag. This is why I'm asking what is the exact service you're trying to make this work with. And whether the same problem exists for other services now. E.g. if FB and Twitter treat this correctly, I think this could be a strong enough argument that you should report the problem to the faulty service instead, and have them fix it.

gaearon avatar Sep 06 '21 23:09 gaearon

In my view it's just naïve scrapers, some if which are run by Facebook, see https://github.com/facebook/react/issues/6873#issuecomment-403089412

depoulo avatar Sep 07 '21 14:09 depoulo

fwiw afaict this makes you unable to write inline scripts in NextJS.

devanshj avatar Nov 03 '21 01:11 devanshj

Whilst React's approach might be technically correct, it doesn't work in the real world, because we have naive scrapers, from massive organisations like LinkedIn, Slack and WhatsApp. So you can choose to be technically correct and pedantic or to have your code work, in the real world, right now. I for one choose the latter.

I'm trying to get open graph images with query params to work, now I need to look for a hacky workaround because I can't go and force these massive tech organisations to "fix" their code.

phawk avatar Feb 23 '22 10:02 phawk

I'm wondering if most of the people who believe this is an issue for themselves are mistaken. I first found this issue because I was trying to figure out why Twitter wasn't showing an image from one of my <meta> tags. I viewed the HTML my server was rendering with React, copied the URL from the content="..." part of my meta tag, and found that it didn't work in my browser, so I assumed the reason that didn't work was the same reason Twitter wasn't loading my image. However, the reason that copied URL didn't work in my browser was just because I copied the raw HTML instead of the HTML-decoded string, and it was not the reason that Twitter wasn't loading my image. The actual issue was that I wasn't using the proper set of <meta> tags that Twitter respected. (There were no entries in my request logs from Twitter. If the issue had been that they weren't properly doing HTML decoding, then I would have seen entries in the request logs for URLs containing &amp; where there should have been &.)

I also wonder if some people are improperly HTML-encoding some strings before giving them to React and then thinking the double-encoding is caused by React doing HTML-encoding. People of both of these situations are present in the related thread https://github.com/facebook/react/issues/6873, so it seems likely that at least a decent portion of people in this thread have the same confusions.

Before suggesting putting workarounds in React for other services' parsers, we should try to be sure the problem is not one of these other dev-side confusions, we should figure out a list of services we're sure have the issue, we should try to report the issue to them first, and only then move onto figuring out workarounds if the list of services with problems is still large.


Several people have suggested that LinkedIn is unable to handle properly HTML-encoded meta tags, but from my tests they seem to handle things fine. I created this test HTML page:

<!DOCTYPE html>
<html>
  <head>
    <title>test page</title>
    <meta property="og:title" content="Title of the article" />
    <meta property="og:image" content="/1234567.png?a=b&amp;c=e" />
    <meta
      property="og:description"
      content="Description that will show in the preview"
    />
  </head>
  <body>
    this is a test page
  </body>
</html>

The HTML for the meta og:image tag has "/1234567.png?a=b&amp;c=e", which should decode to "/1234567.png?a=b&c=e".

I made a directory, put that file as "index.html" in it, put a random image named "1234567.png" in the directory too, I ran npx http-server . to host that directory through a local http server, and then I ran ngrok http 8080 to get a public URL that leads to that HTTP server. I then put the URL into https://www.linkedin.com/post-inspector/.

My local HTTP server's logs immediately showed the following, confirming that it correctly decoded:

[2022-02-23T23:33:31.661Z]  "GET /" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Apache-HttpClient +http://www.linkedin.com)"
[2022-02-23T23:33:32.181Z]  "GET /1234567.png?a=b&c=e" "LinkedInBot/1.0 (compatible; Mozilla/5.0; Apache-HttpClient +http://www.linkedin.com)"

Additionally, the LinkedIn Post Inspector page correctly showed the detected image URL as https://a126-omitted-3fb7.ngrok.io/1234567.png?a=b&c=e.

It seems like LinkedIn is handling things correctly and working with the standard HTML-encoding behavior React is following, and that no change or workaround in React is necessary to work here with them. If there are other parts of LinkedIn that do have an issue, then the issue could be demonstrated with a test like this.

Macil avatar Feb 23 '22 23:02 Macil

Same issue, using og:image where the url uses query parameters, twitter sees these and sends them to my server which doesn't recognize &amp;query=value instead of &query=value.

apecollector avatar May 24 '22 08:05 apecollector

@apecollector I had the similar experience with renderToStaticMarkup method, which escapes html tags, new line etc. I used unescape from lodash to fix it.

_.unescape('&amp;query=value');

outputs

&query=value

talentedandrew avatar Jun 16 '22 09:06 talentedandrew

Hi all, I know this is a 4 year old issue and probably most of you have probably moved on, but I'm new to it and feel I must be missing something obvious.

I suspect we're all adding content security policies to our React apps by now, and those policies have a content value often with single quotes in them. I'm running Next.JS with static site generation, and their advice is to add it into /pages/_document.tsx. But this happens:

<meta httpEquiv="Content-Security-Policy" content="default-src 'self'" />

becomes

<meta httpEquiv="Content-Security-Policy" content="default-src &amp;self&amp;" />

What's the recommended way to do this in 2022?

sc0ttdav3y avatar Sep 09 '22 07:09 sc0ttdav3y

Hi, I have the same issue as @sc0ttdav3y has, but in my case single quotes are escaping with &#x27; (Next.JS 12.3.0, ssg).

devilportez avatar Sep 14 '22 06:09 devilportez

(I just tried the same thing with NextJS and the exact output I got is <meta http-equiv="Content-Security-Policy" content="default-src &#x27;self&#x27;"/>, consistent with https://github.com/facebook/react/issues/13838#issuecomment-1246305603, so I'm going to assume the different value in https://github.com/facebook/react/issues/13838#issuecomment-1241593329 was a copy-paste error of some kind.)

@sc0ttdav3y @devilportez It should be fine that the value is html-encoded like that. The browser will decode html entities when reading the html and interpret the value of the meta tag the same way as "default-src 'self'". You should see the CSP policy be in effect: if you make the page use something denied by the pollicy (like inline styles and scripts, which NextJS uses by default) then an error will appear in the console.

Macil avatar Sep 14 '22 07:09 Macil