migrate icon indicating copy to clipboard operation
migrate copied to clipboard

Unable to scrape URI - Medium posts

Open kmskrishna opened this issue 2 years ago • 3 comments

This tool fails if Medium user posts are added to a publication with custom domains.

For example https://medium.com/@anangsha/i-owe-an-apology-to-mostly-harmless-c5503f1f2acc this article redirects to https://baos.pub/i-owe-an-apology-to-mostly-harmless-c5503f1f2acc which is a Medium publication which has its custom domain.

This tool fails for all such posts.

kmskrishna avatar Nov 12 '22 14:11 kmskrishna

I figured out that it was failing because of a redirect loop.

https://medium.com/m/global-identity?redirectUrl=https%3A%2F%2Fbaos.pub%2Fdont-start-these-3-brilliant-books-on-a-holiday-ed84c3b143c9

It needs cookies. Tried to set them up in mg-webscraper, but it only sends cookies on the first request but not the redirects.

So, as a hacky solution, I used a burp proxy to match and replace User-Agent: Crawler/1.0 with Cookie: XXX. And used my Medium Cookies here.

and edited the index.js in tinyreq/lib/ to route the requests through burp proxy.

proxy = require("node-global-proxy").default;

proxy.setConfig({
        http: "http://127.0.0.1:8080",
        https: "http://127.0.0.1:8080",
      });
    proxy.start();

Since HTTPS requests would fail, I used this command to ignore those.

NODE_TLS_REJECT_UNAUTHORIZED='0' yarn dev medium folder.zip

Using this it ran successfully and there were no issues.

kmskrishna avatar Nov 13 '22 07:11 kmskrishna

I was able to work around this by making the following changes at https://github.com/TryGhost/migrate/blob/4fb9144fbc6e24e935f64e277157e8d45e64d803/packages/mg-webscraper/lib/WebScraper.js#L109-L114

const reqOpts = {
    url: url.replace('https://medium.com/@someSlug', 'https://your.customdomain.com'),
    headers: {
        'user-agent': 'Crawler/1.0',
        'cookie': 'your cookie string here',
    }
};

jknight12882 avatar Nov 16 '22 18:11 jknight12882

This will not work if the Medium author publishes their articles in multiple publications and not their own domain.

kmskrishna avatar Dec 02 '22 12:12 kmskrishna