next.js
next.js copied to clipboard
"Paths" in __NEXT_DATA__ are picked up by Google Search Console as internal links
Verify canary release
- [X] I verified that the issue exists in the latest Next.js canary release
Provide environment information
Operating System:
Platform: linux
Arch: x64
Version: #1 SMP Wed Feb 19 06:37:35 UTC 2020
Binaries:
Node: 12.17.0
npm: 6.14.4
Yarn: 1.22.10
pnpm: N/A
Relevant packages:
next: 12.2.0
eslint-config-next: N/A
react: 17.0.2
react-dom: 17.0.2
What browser are you using? (if relevant)
N/A
How are you deploying your application? (if relevant)
N/A
Describe the Bug
I recognize that this might not even be something NextJS can "fix" but I just want to highlight a problem we're seeing with our NextJS sites. I know there are people from Google who actively support NextJS, and they don't expose their own services to issues / feature requests, so I thought I might as well try here.
tl;dr anything that even remotely resembles a path, returned as part of __NEXT_DATA__
, will be picked up by Google Search Console and used as part of its crawling scope.
The simplest __NEXT_DATA__
is something that looks like
{ ... "page":"/docs/[[...slug]]" ...}
As seen here: https://nextjs.org/docs/basic-features/data-fetching/get-static-props.
During its crawls, Google matches within __NEXT_DATA__
, seemingly anything that starts with a slash, and interprets them as internal links. In the Search Console, this manifests as a non-resolvable page, e.g. https://nextjs.org/docs/[[...slug]
, with a referrer of https://nextjs.org/docs/basic-features/data-fetching/get-static-props
(and any other page under /docs
).
Note that in the case of this NextJS page, it might not manifest as I've described as /docs
is a valid page. For my particular use case, using a custom server, we're seeing the issue come up as https://blog.com/post
with referrers of blog posts (e.g. https://blog.com/the-post-slug
) -- /post
is our NextJS page (pages/post.js
), but is otherwise not a valid URL. Let me know if you want some concrete examples, I've tried to make this as generic as possible.
Why is this an issue?
Internal 404s are interpreted as a site quality signal. Too many of them and Google figures your site is not being maintained and lowers your overall search ranking.
I really don't want to have to implement dummy pages for /post
and other slash-prefixed strings we have in __NEXT_DATA__
Expected Behavior
Path like strings in __NEXT_DATA__
should not be considered valid internal links by Google.
Link to reproduction
https://nextjs.org/docs/basic-features/data-fetching/get-static-props
To Reproduce
- build and deploy NextJS site
- have site crawled by Google
- note that Google finds and crawls internal NextJS pages, and anything else returned from data-fetching fns that looks like a path
- these pages are marked as non-crawlable in Google Search Console
there is a discussion about this at https://github.com/vercel/next.js/discussions/39377 as well
For a potential fix inside Next, in similar circumstances, I recall that escaping the forward slashes ("\/docs\/[[...slug]]"
) was sufficient to prevent Google from attempting to crawl the embedded URLs.
@karoun Can you please provide a recipe how can i sscape with forward slashes?
@karoun Can you please provide a recipe how can i sscape with forward slashes?
@p0zi not for Next, unfortunately. The workaround implemented was for a different framework, and involved doing urls.replaceAll('/', '"')
on the server and urls.replaceAll('"', '/')
on the client. That way, the contents of the HTML didn't include crawl-able URLs, but at JavaScript runtime the URLs were hydrated and correct.
@karoun Can you please provide a recipe how can i sscape with forward slashes?
@p0zi not for Next, unfortunately. The workaround implemented was for a different framework, and involved doing
urls.replaceAll('/', '"')
on the server andurls.replaceAll('"', '/')
on the client. That way, the contents of the HTML didn't include crawl-able URLs, but at JavaScript runtime the URLs were hydrated and correct.
Thank you very much for explanation!
So far i added rel="nofollow" and will observe how Google crawler will behave.
<script id="__NEXT_DATA__" type="application/json" rel="nofollow">{"props": ... }</script>
@p0zi hi, how exactly did you add this to the next app?
@d-vorobyov At the moment solution is dirty i modified module source code in file: node_modules/next/dist/pages/_document.js in render() this is valid for Next.js v12
render() { const { assetPrefix , inAmpMode , buildManifest , unstable_runtimeJS , docComponentsRendered , devOnlyCacheBusterQueryString , disableOptimizedLoading , crossOrigin , } = this.context; const disableRuntimeJS = unstable_runtimeJS === false; docComponentsRendered.NextScript = true; if (process.env.NEXT_RUNTIME !== "edge" && inAmpMode) { if (process.env.NODE_ENV === "production") { return null; } const ampDevFiles = [ ...buildManifest.devFiles, ...buildManifest.polyfillFiles, ...buildManifest.ampDevFiles, ]; return /*#__PURE__*/ _react.default.createElement(_react.default.Fragment, null, disableRuntimeJS ? null : /*#__PURE__*/ _react.default.createElement("script", { id: "__NEXT_DATA__", type: "application/json", nonce: this.props.nonce, crossOrigin: this.props.crossOrigin || crossOrigin, dangerouslySetInnerHTML: { __html: NextScript.getInlineScriptSource(this.context) }, "data-ampdevmode": true }), ampDevFiles.map((file)=>/*#__PURE__*/ _react.default.createElement("script", { key: file, src:
${assetPrefix}/_next/${file}${devOnlyCacheBusterQueryString}, nonce: this.props.nonce, crossOrigin: this.props.crossOrigin || crossOrigin, "data-ampdevmode": true }))); } if (process.env.NODE_ENV !== "production") { if (this.props.crossOrigin) console.warn("Warning:
NextScriptattribute
crossOriginis deprecated. https://nextjs.org/docs/messages/doc-crossorigin-deprecated"); } const files = getDocumentFiles(this.context.buildManifest, this.context.__NEXT_DATA__.page, process.env.NEXT_RUNTIME !== "edge" && inAmpMode); return /*#__PURE__*/ _react.default.createElement(_react.default.Fragment, null, !disableRuntimeJS && buildManifest.devFiles ? buildManifest.devFiles.map((file)=>/*#__PURE__*/ _react.default.createElement("script", { key: file, src:
${assetPrefix}/_next/${encodeURI(file)}${devOnlyCacheBusterQueryString}`,
nonce: this.props.nonce,
crossOrigin: this.props.crossOrigin || crossOrigin
})) : null, disableRuntimeJS ? null : /#PURE/ _react.default.createElement("script", {
id: "NEXT_DATA",
type: "application/json",
nonce: this.props.nonce,
crossOrigin: this.props.crossOrigin || crossOrigin,
dangerouslySetInnerHTML: {
__html: NextScript.getInlineScriptSource(this.context)
},
rel: "nofollow"
}), disableOptimizedLoading && !disableRuntimeJS && this.getPolyfillScripts(), disableOptimizedLoading && !disableRuntimeJS && this.getPreNextScripts(), disableOptimizedLoading && !disableRuntimeJS && this.getDynamicChunks(files), disableOptimizedLoading && !disableRuntimeJS && this.getScripts(files));
}
`
So far i added rel="nofollow" and will observe how Google crawler will behave.
<script id="__NEXT_DATA__" type="application/json" rel="nofollow">{"props": ... }</script>
Any luck?
Unfortunately it did not help. But i am considering another aproach:
- Backend side: a) Data source with obfuscated link that will feed overwritten "next/link" component. Since i am using JSON:API from Drupal i can use path field enhancer to provide obfuscation with for example base_64 encoder preferably with some salt so Google cannot easily decode.
- Frontend side:
a) Since our incoming data is already obfuscated we do not worry about NEXT_DATA.
b) Now it's time to write our custom(overwriting default) "next/link" component
tsconfig.json:
{ "compilerOptions": { "paths": { "next/link": [ "soft4net/components/Link/index.tsx" ] } } }
next.config.js:
const nextConfig = { webpack: (config) => { config.resolve.alias = { ...config.resolve.alias, 'next/link': path.resolve(__dirname, 'soft4net/components/Link/'), } return config }, };
soft4net/components/Link/index.tsx: ` import React from 'react' import { useRouter } from 'next/router' import Link from 'node_modules/next/dist/client/link' // import Link from 'next/link'
const ENCODE_SALT = '1qaz2wsx3edc4rfv5tgb'; // this should match backend value, compropmise lenght for minimizing size of __NEXT_DATA___
export const urlEncode = (url) => {
return btoa(url);
}
export const urlDecode = (urlEncoded) => {
return atob(urlEncoded);
}
export function isBase64(str) {
if (str === '' || str.trim() === '') {
return false;
}
try {
return btoa(atob(str)) == str;
} catch (err) {
return false;
}
}
const CustomLink = ({ href, passHref = null, children, ...props }) => {
const { locale } = useRouter();
let _href = href;
if (isBase64(_href)) {
const hrefEncoded = _href;
const hrefDecoded = urlDecode(hrefEncoded);
_href = hrefDecoded;
}
return (
<Link
{...props}
href={_href}
>
{children}
</Link>
)
}
export * from 'node_modules/next/dist/client/link';
export default CustomLink;
`
The idea is that simple if path is obfuscated we know it should be decoded, we can also use this componenet for standard, non obfuscated paths.
any news?
the whole __NEXT_DATA__
should be obfuscated IMO. Google has already picked up information on the redux state on our site and translation phrases for parts of the website that are not available before logging in.
I know I can do that on the server by modifying the getInlineScriptSource
, but is there a similar function that is used to read the content? it seem to happen on the client/index.tsx
file, not something we can easily modify except by patching the library itself.
on further tests, the following might solve your problem:
import Document, { Head, Html, Main, NextScript } from "next/document";
import React from "react";
export default class MyDocument extends Document {
public render(): JSX.Element {
return <Html>
<Head />
<body>
<Main />
<NextScript />
<script
lang="javascript"
type="text/javascript"
defer={false}
async={false}
dangerouslySetInnerHTML={
{
__html: `
const element = document.getElementById("__NEXT_DATA__");
element.innerText = atob(element.innerText);
`,
}
}
/>
</body>
</Html>;
}
}
// eslint-disable-next-line @typescript-eslint/unbound-method
const nextInlineScriptSource = NextScript.getInlineScriptSource;
NextScript.getInlineScriptSource = (props) => {
const value = nextInlineScriptSource(props);
return Buffer.from(value).toString("base64");
};
put this in your _document.tsx
file. feel free to replace the base64 with something more concrete. but this I think is enough for me personally.
if anyone can think of ways for this to not work as intended or break in a way, I really appreciate to know.
@falahati any workaround for app router
@falahati, I want to fix this issue, but I'm getting a Hydration error when adding the code you provided.
what is the error?
the nextInlineScriptSource
is only used for the content of the __next_data__
tag from what I see in the source code and the <script>
should be executed before any next related javascript. so I don't know how it got picked with the hydration code before it got executed if that is in fact the issue.
could it be that you were not using a _document.tsx
file and somehow its mere presence broke your code? what happens if you remove the nextInlineScriptSource
mutation and the script tag but keep the _document.tsx
file intact?
full disclosure, I am on a fairly old version (v12) of next in this project and we are not using any of the new features. for example, app router was added in v13 I think. however, I won't see how it could break your code still. at least from the source code, I don't see any other place being dependent on the nextInlineScriptSource
method.
unless app router uses the same nextInlineScriptSource
method somewhere else. in that case, we have to find a way to only obfuscate the first response (check the props
variable and find a way to detect other calls). now I don't really know how it works and I have not used app router. but I am sure checking the network activity of the application could help you find the reason. in any case, I hope this piece of code if not useful in its entirety could at least be used to point to some sort of final workaround.
if anyone tried to use the code above be advised that atob
is not friendly with Unicode chars. I had a few problems with it.
To prevent unicode issues, while encoding string to base64, you can use first encodeURIComponent() and then decode using decodeURIComponent():
// Encoding const encodedString = encodeURIComponent(string); const base64 = Buffer.from(encodedString).toString("base64"); return base64;
// Decoding const encodedString = atob(base64); const string = decodeURIComponent(encodedString); return string;
And for me it also resolves react-hydration-error.
Unfortunalltelly Base64 will always increase encoding information in size.
I have personally changed the client side script to this solving the issue:
const element = document.getElementById("__NEXT_DATA__");
const data = atob(element.innerText);
const bytes = new Uint8Array(data.length);
for (let b = 0; b < bytes.length; ++b) {
bytes[b] = data.charCodeAt(b);
}
element.innerText = new TextDecoder('utf-8').decode(bytes);
the server-side code doesn't need to be changed. Buffer
already encodes the data in binary. this will convert back the decoded ASCII string into binary and decode it this time properly as utf8 binary, converting it to a utf8 string.