metascraper
metascraper copied to clipboard
[metascraper-twitter] Add specific connector
"metascraper": "^5.10.6", "metascraper-author": "^5.10.6", "metascraper-clearbit": "^5.10.6", "metascraper-date": "^5.10.6", "metascraper-description": "^5.10.6", "metascraper-image": "^5.10.6", "metascraper-logo": "^5.10.6", "metascraper-publisher": "^5.10.6", "metascraper-title": "^5.10.6", "metascraper-url": "^5.10.6", "metascraper-video": "^5.10.6", "metascraper-youtube": "^5.10.6",
Twitter URLs return NULL for all fields except logo
Example URL: https://twitter.com/realDonaldTrump/status/1222907250383245320
Expected behaviour
meta data returned
Actual behaviour
no data returned except for logo URL and publisher
Twitter rewrote their client website using React + Webpack.
metascraper is revealing that the new Twitter website has a very poor HTML markup, that's why it can't get almost anything useful there.
The solution: Need to add a new metascraper-twitter package to try to specify HTML markups rules for getting the right data.
It will be very similar to metascraper-soundcloud or metascraper-youtube
OK thanks for the feedback, Twitter is a pain to work with and they now require you to answer an extensive set of questions simply to get a developer API. I am currently using their widgets.js code which does not integrate well with React, so it's slightly surprising to learn that their new web platform is react-based. Cheers :)
I hope to write the Twitter connector for the next week.
In the middle time, consider use metascraper-iframe, especially if you are consuming the data in a frontend side, the iframe is the standard way that providers offer to embed their content.
If you want a zero pain solution, consider using Microlink SDK π
Looks like there is a way to force get the old Twitter interface.
Need to set the following request user-agent:
Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko
I confirm it works!
https://api.microlink.io/?url=https://twitter.com/realDonaldTrump/status/1222907250383245320
Although I'm still interested in add a specific twitter package, this should be work in the middle time π
Thanks! I have now been approved for the twitter api so i think I will use that for now but your solution looks viable also. I prefer to load the data on the server and store in mongoDB because that way I can preload the react component and avoid scroll issues due to delayed content rendering, which is my main concern with the twitter's widgets.js client side solution.
@Hereward In that case maybe you are interested into Microlink SDK since you can load external data using setData
@Kikobeats The user agent hack doesn't seem to work anymore, it looks like twitter dropped support for the browser and none of the data is in the returned HTML. perhaps we need to use the api like in metascaper-media-provider ?
@JakeCoxon you're right; they're a new workaround to access to the old version, passing X-Requested-With
https://github.com/taspinar/twitterscraper/issues/296#issuecomment-637637929
Although it isn't too affordable; Ideally we want to access to the current version with no tradeoff, need to investigate how to land a better solution there.
@Kikobeats Thanks for maintaining this helpful library π
Twitter seems to be one of the most important sites to be able to scrape previews from.
Would you be open to gauging the community's interest in sponsoring your development of the Twitter connector?
Perhaps noting that if the community can sponsor your time and effort for this feature at $n, we take it on?
I would be willing to contribute sponsorship towards this effort π
Hey, @watlandc
I prefer if someone take the initiative. My time is limited, and I will be happy to add new contributors to the project.
Note the Twitter detection is not as bad as the original issue was reported, but definitively it could be better:
https://api.microlink.io/?url=https://twitter.com/BytesAndHumans/status/1532772903523065858
Do you miss something specific? How are you using metascraper? What do you need?
π
@Kikobeats are you strictly using metacraper with the headers User-Agent and X-Requested-With to get the result shown in the image above?
I'm asking this because the X-Requested-With work around doesn't seem to work.
What's microscraper? this project is metascraper, and the hosted version is microlink.io π
I didn't use any custom header. It looks like you are experimenting issues related with getting the content, which isn't metascraper scope. Metascraper just applies rules over the content, being getting the content a precondition. As good the content is right, the metascraper output will be accurate.
In any case, I recommend you take a look to html-get!
Woops, definitely meant metacraper, must have mixed up the microlinkhq and metascraper together. My bad.
But yea, using html-get definitely gets the right html! Will need to dig in to why thats the case. i.e. what's html-get doing different from a vanilla fetch.
Do you miss something specific?
As you noted here, we were having issues getting the content from Twitter. We ended up finding a work around to get it working again.
How are you using metascraper?
Simply getting link preview content for an app that helps you stay organized https://sprout.io.
What do you need?
Looks like we're good for now (we're getting the same result as the screenshot above).
Although, returning the media inside the tweet would be a huge bonus.
Thanks!
@watlandc
I wrote a full example using metascraper + html-get, targeting a tweet with media:
const createBrowserless = require('browserless')
const getHTML = require('html-get')
// Spawn Chromium process once
const browserlessFactory = createBrowserless()
// Kill the process when Node.js exit
process.on('exit', () => {
console.log('closing resources!')
browserlessFactory.close()
})
const getContent = async url => {
// create a browser context inside Chromium process
const browserContext = browserlessFactory.createContext()
const getBrowserless = () => browserContext
const result = await getHTML(url, { getBrowserless })
// close the browser context after it's used
await getBrowserless(browser => browser.destroyContext())
return result
}
const metascraper = require('metascraper')([
require('metascraper-author')(),
require('metascraper-date')(),
require('metascraper-description')(),
require('metascraper-image')(),
require('metascraper-logo')(),
require('metascraper-clearbit')(),
require('metascraper-publisher')(),
require('metascraper-title')(),
require('metascraper-url')()
])
getContent('https://twitter.com/BytesAndHumans/status/1532772903523065858')
.then(async ({ html, url }) => {
const metadata = await metascraper({ html, url })
console.log(metadata)
process.exit()
})
.catch(error => {
console.error(error)
process.exit(1)
})
output:
{
author: null,
date: '2022-06-07T21:42:24.000Z',
description: 'βWhat a week π£β€οΈπβ',
image: 'https://pbs.twimg.com/media/FUWAUW7XoAAxuP_.jpg:large',
logo: 'https://logo.clearbit.com/twitter.com',
publisher: 'Twitter',
title: 'Elena on Twitter',
url: 'https://twitter.com/BytesAndHumans/status/1532772903523065858'
}
so could be possible, still the problem is getting the content at your side?
In case you need to detect the author, then the tweet URL https://twitter.com/:username/status/:id can be considered a pattern for getting the username as author.
still worth it to write a specific twitter package!
I wrote a full example using metascraper + html-get, targeting a tweet with media:
Very nice!
so could be possible, still the problem is getting the content at your side?
We're able to get the content now.
@Kikobeats I was testing the above solution, and works for many scenarios, however I am still having issues getting the image for some URLs such as https://twitter.com/Twitter/status/1483427748500717573
Checking the Microlink API, it appears to also have an issue getting the image from this domain.
$ curl -sL 'https://api.microlink.io?url=https://twitter.com/Twitter/status/1483427748500717573' | jq
{
"status": "success",
"data": {
"title": "Tweet / Twitter",
"description": "Donβt miss whatβs happening",
"lang": "en",
"author": null,
"publisher": "Twitter",
"image": null,
"date": "2022-06-30T13:00:05.000Z",
"url": "https://twitter.com/twitter/status/1483427748500717573",
"logo": {
"url": "https://abs.twimg.com/responsive-web/client-web/icon-ios.b1fc7278.png",
"type": "png",
"size": 8582,
"height": 1024,
"width": 1024,
"size_pretty": "8.58 kB"
}
},
"statusCode": 200,
"headers": {
"cache-control": "no-cache, no-store, must-revalidate, pre-check=0, post-check=0",
"content-encoding": "gzip",
"content-security-policy": "connect-src 'self' blob: https://*.pscp.tv https://*.video.pscp.tv https://*.twimg.com https://api.twitter.com https://api-stream.twitter.com https://ads-api.twitter.com https://aa.twitter.com https://caps.twitter.com https://pay.twitter.com https://sentry.io https://ton.twitter.com https://twitter.com https://upload.twitter.com https://www.google-analytics.com https://accounts.google.com/gsi/status https://accounts.google.com/gsi/log https://app.link https://api2.branch.io https://bnc.lt wss://*.pscp.tv https://vmap.snappytv.com https://vmapstage.snappytv.com https://vmaprel.snappytv.com https://vmap.grabyo.com https://dhdsnappytv-vh.akamaihd.net https://pdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mpdhdsnappytv-vh.akamaihd.net https://mmdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mpdhdsnappytv-vh.akamaihd.net https://mmdhdsnappytv-vh.akamaihd.net https://dwo3ckksxlb0v.cloudfront.net ; default-src 'self'; form-action 'self' https://twitter.com https://*.twitter.com; font-src 'self' https://*.twimg.com; frame-src 'self' https://twitter.com https://mobile.twitter.com https://pay.twitter.com https://cards-frame.twitter.com https://accounts.google.com/ https://client-api.arkoselabs.com/ https://iframe.arkoselabs.com/ https://recaptcha.net/recaptcha/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/; img-src 'self' blob: data: https://*.cdn.twitter.com https://ton.twitter.com https://*.twimg.com https://analytics.twitter.com https://cm.g.doubleclick.net https://www.google-analytics.com https://www.periscope.tv https://www.pscp.tv https://media.riffsy.com https://*.giphy.com https://media.tenor.com https://c.tenor.com https://*.pscp.tv https://*.periscope.tv https://prod-periscope-profile.s3-us-west-2.amazonaws.com https://platform-lookaside.fbsbx.com https://scontent.xx.fbcdn.net https://scontent-sea1-1.xx.fbcdn.net https://*.googleusercontent.com https://imgix.revue.co; manifest-src 'self'; media-src 'self' blob: https://twitter.com https://*.twimg.com https://*.vine.co https://*.pscp.tv https://*.video.pscp.tv https://dhdsnappytv-vh.akamaihd.net https://pdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mpdhdsnappytv-vh.akamaihd.net https://mmdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mpdhdsnappytv-vh.akamaihd.net https://mmdhdsnappytv-vh.akamaihd.net https://dwo3ckksxlb0v.cloudfront.net; object-src 'none'; script-src 'self' 'unsafe-inline' https://*.twimg.com https://recaptcha.net/recaptcha/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://client-api.arkoselabs.com/ https://www.google-analytics.com https://twitter.com https://app.link https://accounts.google.com/gsi/client https://appleid.cdn-apple.com/appleauth/static/jsapi/appleid/1/en_US/appleid.auth.js 'nonce-NDhlMTBkYWYtY2JjYy00OTlmLWFiZDUtOTgxYmMyODZiMWZi'; style-src 'self' 'unsafe-inline' https://accounts.google.com/gsi/style https://*.twimg.com; worker-src 'self' blob:; report-uri https://twitter.com/i/csp_report?a=O5RXE%3D%3D%3D&ro=false",
"content-type": "text/html; charset=utf-8",
"cross-origin-embedder-policy": "unsafe-none",
"cross-origin-opener-policy": "same-origin-allow-popups",
"date": "Thu, 30 Jun 2022 13:00:05 GMT",
"expiry": "Tue, 31 Mar 1981 05:00:00 GMT",
"last-modified": "Thu, 30 Jun 2022 13:00:05 GMT",
"pragma": "no-cache",
"server": "tsa_b",
"set-cookie": "guest_id_marketing=v1%3A165659400581116843; Max-Age=63072000; Expires=Sat, 29 Jun 2024 13:00:05 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None\nguest_id_ads=v1%3A165659400581116843; Max-Age=63072000; Expires=Sat, 29 Jun 2024 13:00:05 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None\npersonalization_id=\"v1_8EK5HpiZcmdp/iJUPUtSKA==\"; Max-Age=63072000; Expires=Sat, 29 Jun 2024 13:00:05 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None\nguest_id=v1%3A165659400581116843; Max-Age=63072000; Expires=Sat, 29 Jun 2024 13:00:05 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None",
"strict-transport-security": "max-age=631138519",
"x-connection-hash": "f744deb62fd346d8e4ea08b01e182bbb049039ceaffeca5a3f8472612fa1bdc1",
"x-content-type-options": "nosniff",
"x-frame-options": "DENY",
"x-powered-by": "Express",
"x-response-time": "35",
"x-xss-protection": "0"
}
Looking at metascraper-image package, I'm confused as to why this line isn't picking up the og:image tag.
toImage(($: any) => $('meta[property="og:image"]').attr('content')),

Has there been any updates on this?
Hello everyone,
The PR is prepared https://github.com/microlinkhq/metascraper/pull/608
Appreciated if you can feedback, otherwise will be merged shortly π
Hello everyone,
The PR is prepared #608
Appreciated if you can feedback, otherwise will be merged shortly slightly_smiling_face
Thanks for taking a look at it @Kikobeats - testing it out I'm noticing improvements for things like avatars on profiles, but am still having issues with many links such as the one above (https://twitter.com/Twitter/status/1483427748500717573/). Can you confirm that for you, this link is parsed correctly?
Also happens for some images - I saved the outputted HTML to a file to debug, and it appears that page is not loading correctly the majority of the time, as it's missing at minimum some essential metadata like the og:image's - though sometimes it does appear - can you confirm that this is not an issue that you can replicate? Regardless, I don't believe this issue would be directly with Metascraper - but am wondering whether you could offer your opinion on whether it could relate to html-get, or perhaps Browserless.
@Kikobeats, I have got more links working - including the one above and some others that I was having issues with.
I've found that when turning off pre-rendering for Twitter and setting a UserAgent of googlebot, it is now working. I'd be happy to put forward an PR to add Twitter to your auto-domains file in your getHTML package if it's something that you would want - for now, I'm just going to fork it - note though without setting the UserAgent, it does not work, which is something I don't think you'd want to set at a package level.
One thing to note about the metascraper-twitter package, is that I had to place it below metascraper-image, else for some tweets that metascraper-image WAS correctly grabbing, it will grab a generic Twitter bird.