next.js icon indicating copy to clipboard operation
next.js copied to clipboard

"Paths" in __NEXT_DATA__ are picked up by Google Search Console as internal links

Open tills13 opened this issue 2 years ago • 2 comments

Verify canary release

  • [X] I verified that the issue exists in the latest Next.js canary release

Provide environment information

Operating System:
  Platform: linux
  Arch: x64
  Version: #1 SMP Wed Feb 19 06:37:35 UTC 2020
Binaries:
  Node: 12.17.0
  npm: 6.14.4
  Yarn: 1.22.10
  pnpm: N/A
Relevant packages:
  next: 12.2.0
  eslint-config-next: N/A
  react: 17.0.2
  react-dom: 17.0.2

What browser are you using? (if relevant)

N/A

How are you deploying your application? (if relevant)

N/A

Describe the Bug

I recognize that this might not even be something NextJS can "fix" but I just want to highlight a problem we're seeing with our NextJS sites. I know there are people from Google who actively support NextJS, and they don't expose their own services to issues / feature requests, so I thought I might as well try here.

tl;dr anything that even remotely resembles a path, returned as part of __NEXT_DATA__, will be picked up by Google Search Console and used as part of its crawling scope.

The simplest __NEXT_DATA__ is something that looks like

{ ... "page":"/docs/[[...slug]]" ...}

As seen here: https://nextjs.org/docs/basic-features/data-fetching/get-static-props.

During its crawls, Google matches within __NEXT_DATA__, seemingly anything that starts with a slash, and interprets them as internal links. In the Search Console, this manifests as a non-resolvable page, e.g. https://nextjs.org/docs/[[...slug], with a referrer of https://nextjs.org/docs/basic-features/data-fetching/get-static-props (and any other page under /docs).

Note that in the case of this NextJS page, it might not manifest as I've described as /docs is a valid page. For my particular use case, using a custom server, we're seeing the issue come up as https://blog.com/post with referrers of blog posts (e.g. https://blog.com/the-post-slug) -- /post is our NextJS page (pages/post.js), but is otherwise not a valid URL. Let me know if you want some concrete examples, I've tried to make this as generic as possible.

image

image

Why is this an issue?

Internal 404s are interpreted as a site quality signal. Too many of them and Google figures your site is not being maintained and lowers your overall search ranking.

I really don't want to have to implement dummy pages for /post and other slash-prefixed strings we have in __NEXT_DATA__

Expected Behavior

Path like strings in __NEXT_DATA__ should not be considered valid internal links by Google.

Link to reproduction

https://nextjs.org/docs/basic-features/data-fetching/get-static-props

To Reproduce

  1. build and deploy NextJS site
  2. have site crawled by Google
  3. note that Google finds and crawls internal NextJS pages, and anything else returned from data-fetching fns that looks like a path
  4. these pages are marked as non-crawlable in Google Search Console

tills13 avatar Aug 31 '22 20:08 tills13

there is a discussion about this at https://github.com/vercel/next.js/discussions/39377 as well

andremendonca avatar Oct 28 '22 18:10 andremendonca

For a potential fix inside Next, in similar circumstances, I recall that escaping the forward slashes ("\/docs\/[[...slug]]") was sufficient to prevent Google from attempting to crawl the embedded URLs.

karoun avatar Oct 28 '22 20:10 karoun

@karoun Can you please provide a recipe how can i sscape with forward slashes?

p0zi avatar Nov 03 '23 16:11 p0zi

@karoun Can you please provide a recipe how can i sscape with forward slashes?

@p0zi not for Next, unfortunately. The workaround implemented was for a different framework, and involved doing urls.replaceAll('/', '"') on the server and urls.replaceAll('"', '/') on the client. That way, the contents of the HTML didn't include crawl-able URLs, but at JavaScript runtime the URLs were hydrated and correct.

karoun avatar Nov 03 '23 17:11 karoun

@karoun Can you please provide a recipe how can i sscape with forward slashes?

@p0zi not for Next, unfortunately. The workaround implemented was for a different framework, and involved doing urls.replaceAll('/', '"') on the server and urls.replaceAll('"', '/') on the client. That way, the contents of the HTML didn't include crawl-able URLs, but at JavaScript runtime the URLs were hydrated and correct.

Thank you very much for explanation!

p0zi avatar Nov 05 '23 21:11 p0zi

So far i added rel="nofollow" and will observe how Google crawler will behave.

<script id="__NEXT_DATA__" type="application/json" rel="nofollow">{"props": ... }</script>

p0zi avatar Nov 09 '23 13:11 p0zi

@p0zi hi, how exactly did you add this to the next app?

d-vorobyov avatar Nov 09 '23 14:11 d-vorobyov

@d-vorobyov At the moment solution is dirty i modified module source code in file: node_modules/next/dist/pages/_document.js in render() this is valid for Next.js v12

render() { const { assetPrefix , inAmpMode , buildManifest , unstable_runtimeJS , docComponentsRendered , devOnlyCacheBusterQueryString , disableOptimizedLoading , crossOrigin , } = this.context; const disableRuntimeJS = unstable_runtimeJS === false; docComponentsRendered.NextScript = true; if (process.env.NEXT_RUNTIME !== "edge" && inAmpMode) { if (process.env.NODE_ENV === "production") { return null; } const ampDevFiles = [ ...buildManifest.devFiles, ...buildManifest.polyfillFiles, ...buildManifest.ampDevFiles, ]; return /*#__PURE__*/ _react.default.createElement(_react.default.Fragment, null, disableRuntimeJS ? null : /*#__PURE__*/ _react.default.createElement("script", { id: "__NEXT_DATA__", type: "application/json", nonce: this.props.nonce, crossOrigin: this.props.crossOrigin || crossOrigin, dangerouslySetInnerHTML: { __html: NextScript.getInlineScriptSource(this.context) }, "data-ampdevmode": true }), ampDevFiles.map((file)=>/*#__PURE__*/ _react.default.createElement("script", { key: file, src:${assetPrefix}/_next/${file}${devOnlyCacheBusterQueryString}, nonce: this.props.nonce, crossOrigin: this.props.crossOrigin || crossOrigin, "data-ampdevmode": true }))); } if (process.env.NODE_ENV !== "production") { if (this.props.crossOrigin) console.warn("Warning: NextScriptattributecrossOriginis deprecated. https://nextjs.org/docs/messages/doc-crossorigin-deprecated"); } const files = getDocumentFiles(this.context.buildManifest, this.context.__NEXT_DATA__.page, process.env.NEXT_RUNTIME !== "edge" && inAmpMode); return /*#__PURE__*/ _react.default.createElement(_react.default.Fragment, null, !disableRuntimeJS && buildManifest.devFiles ? buildManifest.devFiles.map((file)=>/*#__PURE__*/ _react.default.createElement("script", { key: file, src:${assetPrefix}/_next/${encodeURI(file)}${devOnlyCacheBusterQueryString}`, nonce: this.props.nonce, crossOrigin: this.props.crossOrigin || crossOrigin })) : null, disableRuntimeJS ? null : /#PURE/ _react.default.createElement("script", { id: "NEXT_DATA", type: "application/json", nonce: this.props.nonce, crossOrigin: this.props.crossOrigin || crossOrigin, dangerouslySetInnerHTML: { __html: NextScript.getInlineScriptSource(this.context) },

rel: "nofollow"

    }), disableOptimizedLoading && !disableRuntimeJS && this.getPolyfillScripts(), disableOptimizedLoading && !disableRuntimeJS && this.getPreNextScripts(), disableOptimizedLoading && !disableRuntimeJS && this.getDynamicChunks(files), disableOptimizedLoading && !disableRuntimeJS && this.getScripts(files));
}

`

p0zi avatar Nov 09 '23 14:11 p0zi

So far i added rel="nofollow" and will observe how Google crawler will behave.

<script id="__NEXT_DATA__" type="application/json" rel="nofollow">{"props": ... }</script>

Any luck?

wieseljonas avatar Jan 19 '24 22:01 wieseljonas

Unfortunately it did not help. But i am considering another aproach:

  1. Backend side: a) Data source with obfuscated link that will feed overwritten "next/link" component. Since i am using JSON:API from Drupal i can use path field enhancer to provide obfuscation with for example base_64 encoder preferably with some salt so Google cannot easily decode.
  2. Frontend side: a) Since our incoming data is already obfuscated we do not worry about NEXT_DATA. b) Now it's time to write our custom(overwriting default) "next/link" component tsconfig.json: { "compilerOptions": { "paths": { "next/link": [ "soft4net/components/Link/index.tsx" ] } } }

next.config.js: const nextConfig = { webpack: (config) => { config.resolve.alias = { ...config.resolve.alias, 'next/link': path.resolve(__dirname, 'soft4net/components/Link/'), } return config }, };

soft4net/components/Link/index.tsx: ` import React from 'react' import { useRouter } from 'next/router' import Link from 'node_modules/next/dist/client/link' // import Link from 'next/link'

const ENCODE_SALT = '1qaz2wsx3edc4rfv5tgb'; // this should match backend value, compropmise lenght for minimizing size of __NEXT_DATA___

export const urlEncode = (url) => {
    return btoa(url);
}

export const urlDecode = (urlEncoded) => {
    return atob(urlEncoded);
}

export function isBase64(str) {
    if (str === '' || str.trim() === '') {
	return false;
    }

    try {
	return btoa(atob(str)) == str;
    } catch (err) {
	return false;
    }
}

const CustomLink = ({ href, passHref = null, children, ...props }) => {
    const { locale } = useRouter();

    let _href = href;
    if (isBase64(_href)) {
	const hrefEncoded = _href;
	const hrefDecoded = urlDecode(hrefEncoded);
	_href = hrefDecoded;
    }

    return (
	<Link 
	    {...props}
	    href={_href} 
	>
	    {children}
	</Link>
    )

}

export * from 'node_modules/next/dist/client/link';

export default CustomLink;

`

The idea is that simple if path is obfuscated we know it should be decoded, we can also use this componenet for standard, non obfuscated paths.

p0zi avatar Jan 24 '24 09:01 p0zi

any news?

muslu avatar Feb 07 '24 08:02 muslu

the whole __NEXT_DATA__ should be obfuscated IMO. Google has already picked up information on the redux state on our site and translation phrases for parts of the website that are not available before logging in.

I know I can do that on the server by modifying the getInlineScriptSource, but is there a similar function that is used to read the content? it seem to happen on the client/index.tsx file, not something we can easily modify except by patching the library itself.

falahati avatar Jun 07 '24 01:06 falahati

on further tests, the following might solve your problem:

import Document, { Head, Html, Main, NextScript } from "next/document";
import React from "react";

export default class MyDocument extends Document {
    public render(): JSX.Element {
        return <Html>
            <Head />
            <body>
                <Main />
                <NextScript />
                <script
                    lang="javascript"
                    type="text/javascript"
                    defer={false}
                    async={false}
                    dangerouslySetInnerHTML={
                        {
                            __html: `
                                const element = document.getElementById("__NEXT_DATA__");
                                element.innerText = atob(element.innerText);
                            `,
                        }
                    }
                />
            </body>
        </Html>;
    }
}

// eslint-disable-next-line @typescript-eslint/unbound-method
const nextInlineScriptSource = NextScript.getInlineScriptSource;
NextScript.getInlineScriptSource = (props) => {
    const value = nextInlineScriptSource(props);
    return Buffer.from(value).toString("base64");
};

put this in your _document.tsx file. feel free to replace the base64 with something more concrete. but this I think is enough for me personally.

if anyone can think of ways for this to not work as intended or break in a way, I really appreciate to know.

falahati avatar Jun 07 '24 02:06 falahati

@falahati any workaround for app router

c0b41 avatar Jun 07 '24 10:06 c0b41

githubimage @falahati, I want to fix this issue, but I'm getting a Hydration error when adding the code you provided.

utkusezici avatar Jun 07 '24 13:06 utkusezici

what is the error?

the nextInlineScriptSource is only used for the content of the __next_data__ tag from what I see in the source code and the <script> should be executed before any next related javascript. so I don't know how it got picked with the hydration code before it got executed if that is in fact the issue.

could it be that you were not using a _document.tsx file and somehow its mere presence broke your code? what happens if you remove the nextInlineScriptSource mutation and the script tag but keep the _document.tsx file intact?

full disclosure, I am on a fairly old version (v12) of next in this project and we are not using any of the new features. for example, app router was added in v13 I think. however, I won't see how it could break your code still. at least from the source code, I don't see any other place being dependent on the nextInlineScriptSource method.

falahati avatar Jun 07 '24 13:06 falahati

unless app router uses the same nextInlineScriptSource method somewhere else. in that case, we have to find a way to only obfuscate the first response (check the props variable and find a way to detect other calls). now I don't really know how it works and I have not used app router. but I am sure checking the network activity of the application could help you find the reason. in any case, I hope this piece of code if not useful in its entirety could at least be used to point to some sort of final workaround.

falahati avatar Jun 07 '24 13:06 falahati

if anyone tried to use the code above be advised that atob is not friendly with Unicode chars. I had a few problems with it.

falahati avatar Jun 10 '24 11:06 falahati

To prevent unicode issues, while encoding string to base64, you can use first encodeURIComponent() and then decode using decodeURIComponent():

// Encoding const encodedString = encodeURIComponent(string); const base64 = Buffer.from(encodedString).toString("base64"); return base64;

// Decoding const encodedString = atob(base64); const string = decodeURIComponent(encodedString); return string;

And for me it also resolves react-hydration-error.

Unfortunalltelly Base64 will always increase encoding information in size.

p0zi avatar Jun 11 '24 09:06 p0zi

I have personally changed the client side script to this solving the issue:

const element = document.getElementById("__NEXT_DATA__");
const data = atob(element.innerText);
const bytes = new Uint8Array(data.length);
for (let b = 0; b < bytes.length; ++b) {
   bytes[b] = data.charCodeAt(b);
}
element.innerText = new TextDecoder('utf-8').decode(bytes);

the server-side code doesn't need to be changed. Buffer already encodes the data in binary. this will convert back the decoded ASCII string into binary and decode it this time properly as utf8 binary, converting it to a utf8 string.

falahati avatar Jun 12 '24 17:06 falahati