migrate
migrate copied to clipboard
Unable to scrape URI - Medium posts
This tool fails if Medium user posts are added to a publication with custom domains.
For example https://medium.com/@anangsha/i-owe-an-apology-to-mostly-harmless-c5503f1f2acc this article redirects to https://baos.pub/i-owe-an-apology-to-mostly-harmless-c5503f1f2acc which is a Medium publication which has its custom domain.
This tool fails for all such posts.
I figured out that it was failing because of a redirect loop.
https://medium.com/m/global-identity?redirectUrl=https%3A%2F%2Fbaos.pub%2Fdont-start-these-3-brilliant-books-on-a-holiday-ed84c3b143c9
It needs cookies. Tried to set them up in mg-webscraper, but it only sends cookies on the first request but not the redirects.
So, as a hacky solution, I used a burp proxy to match and replace User-Agent: Crawler/1.0
with Cookie: XXX
. And used my Medium Cookies here.
and edited the index.js in tinyreq/lib/ to route the requests through burp proxy.
proxy = require("node-global-proxy").default;
proxy.setConfig({
http: "http://127.0.0.1:8080",
https: "http://127.0.0.1:8080",
});
proxy.start();
Since HTTPS requests would fail, I used this command to ignore those.
NODE_TLS_REJECT_UNAUTHORIZED='0' yarn dev medium folder.zip
Using this it ran successfully and there were no issues.
I was able to work around this by making the following changes at https://github.com/TryGhost/migrate/blob/4fb9144fbc6e24e935f64e277157e8d45e64d803/packages/mg-webscraper/lib/WebScraper.js#L109-L114
const reqOpts = {
url: url.replace('https://medium.com/@someSlug', 'https://your.customdomain.com'),
headers: {
'user-agent': 'Crawler/1.0',
'cookie': 'your cookie string here',
}
};
This will not work if the Medium author publishes their articles in multiple publications and not their own domain.