metascraper [metascraper-twitter] Add specific connector

"metascraper": "^5.10.6", "metascraper-author": "^5.10.6", "metascraper-clearbit": "^5.10.6", "metascraper-date": "^5.10.6", "metascraper-description": "^5.10.6", "metascraper-image": "^5.10.6", "metascraper-logo": "^5.10.6", "metascraper-publisher": "^5.10.6", "metascraper-title": "^5.10.6", "metascraper-url": "^5.10.6", "metascraper-video": "^5.10.6", "metascraper-youtube": "^5.10.6",

Twitter URLs return NULL for all fields except logo

Example URL: https://twitter.com/realDonaldTrump/status/1222907250383245320

Expected behaviour

meta data returned

Actual behaviour

no data returned except for logo URL and publisher

twitter_meta_20100131

Jan 31 '20 13:01 Hereward

Twitter rewrote their client website using React + Webpack.

metascraper is revealing that the new Twitter website has a very poor HTML markup, that's why it can't get almost anything useful there.

The solution: Need to add a new metascraper-twitter package to try to specify HTML markups rules for getting the right data.

It will be very similar to metascraper-soundcloud or metascraper-youtube

Jan 31 '20 23:01 Kikobeats

OK thanks for the feedback, Twitter is a pain to work with and they now require you to answer an extensive set of questions simply to get a developer API. I am currently using their widgets.js code which does not integrate well with React, so it's slightly surprising to learn that their new web platform is react-based. Cheers :)

Feb 01 '20 06:02 Hereward

I hope to write the Twitter connector for the next week.

In the middle time, consider use metascraper-iframe, especially if you are consuming the data in a frontend side, the iframe is the standard way that providers offer to embed their content.

If you want a zero pain solution, consider using Microlink SDK 🙂

Feb 01 '20 09:02 Kikobeats

Looks like there is a way to force get the old Twitter interface.

Need to set the following request user-agent:

Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko

I confirm it works!

https://api.microlink.io/?url=https://twitter.com/realDonaldTrump/status/1222907250383245320

Although I'm still interested in add a specific twitter package, this should be work in the middle time 🙂

Feb 02 '20 14:02 Kikobeats

Thanks! I have now been approved for the twitter api so i think I will use that for now but your solution looks viable also. I prefer to load the data on the server and store in mongoDB because that way I can preload the react component and avoid scroll issues due to delayed content rendering, which is my main concern with the twitter's widgets.js client side solution.

Feb 03 '20 04:02 Hereward

@Hereward In that case maybe you are interested into Microlink SDK since you can load external data using setData

Feb 03 '20 09:02 Kikobeats

@Kikobeats The user agent hack doesn't seem to work anymore, it looks like twitter dropped support for the browser and none of the data is in the returned HTML. perhaps we need to use the api like in metascaper-media-provider ?

Jun 04 '20 13:06 JakeCoxon

@JakeCoxon you're right; they're a new workaround to access to the old version, passing X-Requested-With

https://github.com/taspinar/twitterscraper/issues/296#issuecomment-637637929

Although it isn't too affordable; Ideally we want to access to the current version with no tradeoff, need to investigate how to land a better solution there.

Jun 04 '20 13:06 Kikobeats

@Kikobeats Thanks for maintaining this helpful library 🙌

Twitter seems to be one of the most important sites to be able to scrape previews from.

Would you be open to gauging the community's interest in sponsoring your development of the Twitter connector?

Perhaps noting that if the community can sponsor your time and effort for this feature at $n, we take it on?

I would be willing to contribute sponsorship towards this effort 👍

Jun 03 '22 16:06 watlandc

Hey, @watlandc

I prefer if someone take the initiative. My time is limited, and I will be happy to add new contributors to the project.

Note the Twitter detection is not as bad as the original issue was reported, but definitively it could be better:

https://api.microlink.io/?url=https://twitter.com/BytesAndHumans/status/1532772903523065858

Do you miss something specific? How are you using metascraper? What do you need?

🙂

Jun 03 '22 18:06 Kikobeats

@Kikobeats are you strictly using metacraper with the headers User-Agent and X-Requested-With to get the result shown in the image above?

I'm asking this because the X-Requested-With work around doesn't seem to work.

Jun 04 '22 02:06 cusxio

What's microscraper? this project is metascraper, and the hosted version is microlink.io 😛

I didn't use any custom header. It looks like you are experimenting issues related with getting the content, which isn't metascraper scope. Metascraper just applies rules over the content, being getting the content a precondition. As good the content is right, the metascraper output will be accurate.

In any case, I recommend you take a look to html-get!

Jun 04 '22 10:06 Kikobeats

Woops, definitely meant metacraper, must have mixed up the microlinkhq and metascraper together. My bad.

But yea, using html-get definitely gets the right html! Will need to dig in to why thats the case. i.e. what's html-get doing different from a vanilla fetch.

Jun 04 '22 11:06 cusxio

Do you miss something specific?

As you noted here, we were having issues getting the content from Twitter. We ended up finding a work around to get it working again.

How are you using metascraper?

Simply getting link preview content for an app that helps you stay organized https://sprout.io.

What do you need?

Looks like we're good for now (we're getting the same result as the screenshot above).

Although, returning the media inside the tweet would be a huge bonus.

Thanks!

Jun 07 '22 17:06 watlandc

@watlandc

I wrote a full example using metascraper + html-get, targeting a tweet with media:

const createBrowserless = require('browserless')
const getHTML = require('html-get')

// Spawn Chromium process once
const browserlessFactory = createBrowserless()

// Kill the process when Node.js exit
process.on('exit', () => {
  console.log('closing resources!')
  browserlessFactory.close()
})

const getContent = async url => {
  // create a browser context inside Chromium process
  const browserContext = browserlessFactory.createContext()
  const getBrowserless = () => browserContext
  const result = await getHTML(url, { getBrowserless })
  // close the browser context after it's used
  await getBrowserless(browser => browser.destroyContext())
  return result
}

const metascraper = require('metascraper')([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-clearbit')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

getContent('https://twitter.com/BytesAndHumans/status/1532772903523065858')
  .then(async ({ html, url }) => {
    const metadata = await metascraper({ html, url })
    console.log(metadata)
    process.exit()
  })
  .catch(error => {
    console.error(error)
    process.exit(1)
  })

output:

{
  author: null,
  date: '2022-06-07T21:42:24.000Z',
  description: '“What a week 🐣❤️📈”',
  image: 'https://pbs.twimg.com/media/FUWAUW7XoAAxuP_.jpg:large',
  logo: 'https://logo.clearbit.com/twitter.com',
  publisher: 'Twitter',
  title: 'Elena on Twitter',
  url: 'https://twitter.com/BytesAndHumans/status/1532772903523065858'
}

so could be possible, still the problem is getting the content at your side?

Jun 07 '22 21:06 Kikobeats

In case you need to detect the author, then the tweet URL https://twitter.com/:username/status/:id can be considered a pattern for getting the username as author.

still worth it to write a specific twitter package!

Jun 07 '22 22:06 Kikobeats

I wrote a full example using metascraper + html-get, targeting a tweet with media:

Very nice!

so could be possible, still the problem is getting the content at your side?

We're able to get the content now.

Jun 10 '22 16:06 watlandc

@Kikobeats I was testing the above solution, and works for many scenarios, however I am still having issues getting the image for some URLs such as https://twitter.com/Twitter/status/1483427748500717573

Checking the Microlink API, it appears to also have an issue getting the image from this domain.

$ curl -sL 'https://api.microlink.io?url=https://twitter.com/Twitter/status/1483427748500717573' | jq

{
  "status": "success",
  "data": {
    "title": "Tweet / Twitter",
    "description": "Don’t miss what’s happening",
    "lang": "en",
    "author": null,
    "publisher": "Twitter",
    "image": null,
    "date": "2022-06-30T13:00:05.000Z",
    "url": "https://twitter.com/twitter/status/1483427748500717573",
    "logo": {
      "url": "https://abs.twimg.com/responsive-web/client-web/icon-ios.b1fc7278.png",
      "type": "png",
      "size": 8582,
      "height": 1024,
      "width": 1024,
      "size_pretty": "8.58 kB"
    }
  },
  "statusCode": 200,
  "headers": {
    "cache-control": "no-cache, no-store, must-revalidate, pre-check=0, post-check=0",
    "content-encoding": "gzip",
    "content-security-policy": "connect-src 'self' blob: https://*.pscp.tv https://*.video.pscp.tv https://*.twimg.com https://api.twitter.com https://api-stream.twitter.com https://ads-api.twitter.com https://aa.twitter.com https://caps.twitter.com https://pay.twitter.com https://sentry.io https://ton.twitter.com https://twitter.com https://upload.twitter.com https://www.google-analytics.com https://accounts.google.com/gsi/status https://accounts.google.com/gsi/log https://app.link https://api2.branch.io https://bnc.lt wss://*.pscp.tv https://vmap.snappytv.com https://vmapstage.snappytv.com https://vmaprel.snappytv.com https://vmap.grabyo.com https://dhdsnappytv-vh.akamaihd.net https://pdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mpdhdsnappytv-vh.akamaihd.net https://mmdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mpdhdsnappytv-vh.akamaihd.net https://mmdhdsnappytv-vh.akamaihd.net https://dwo3ckksxlb0v.cloudfront.net ; default-src 'self'; form-action 'self' https://twitter.com https://*.twitter.com; font-src 'self' https://*.twimg.com; frame-src 'self' https://twitter.com https://mobile.twitter.com https://pay.twitter.com https://cards-frame.twitter.com https://accounts.google.com/ https://client-api.arkoselabs.com/ https://iframe.arkoselabs.com/  https://recaptcha.net/recaptcha/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/; img-src 'self' blob: data: https://*.cdn.twitter.com https://ton.twitter.com https://*.twimg.com https://analytics.twitter.com https://cm.g.doubleclick.net https://www.google-analytics.com https://www.periscope.tv https://www.pscp.tv https://media.riffsy.com https://*.giphy.com https://media.tenor.com https://c.tenor.com https://*.pscp.tv https://*.periscope.tv https://prod-periscope-profile.s3-us-west-2.amazonaws.com https://platform-lookaside.fbsbx.com https://scontent.xx.fbcdn.net https://scontent-sea1-1.xx.fbcdn.net https://*.googleusercontent.com https://imgix.revue.co; manifest-src 'self'; media-src 'self' blob: https://twitter.com https://*.twimg.com https://*.vine.co https://*.pscp.tv https://*.video.pscp.tv https://dhdsnappytv-vh.akamaihd.net https://pdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mpdhdsnappytv-vh.akamaihd.net https://mmdhdsnappytv-vh.akamaihd.net https://mdhdsnappytv-vh.akamaihd.net https://mpdhdsnappytv-vh.akamaihd.net https://mmdhdsnappytv-vh.akamaihd.net https://dwo3ckksxlb0v.cloudfront.net; object-src 'none'; script-src 'self' 'unsafe-inline' https://*.twimg.com https://recaptcha.net/recaptcha/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://client-api.arkoselabs.com/ https://www.google-analytics.com https://twitter.com https://app.link https://accounts.google.com/gsi/client https://appleid.cdn-apple.com/appleauth/static/jsapi/appleid/1/en_US/appleid.auth.js  'nonce-NDhlMTBkYWYtY2JjYy00OTlmLWFiZDUtOTgxYmMyODZiMWZi'; style-src 'self' 'unsafe-inline' https://accounts.google.com/gsi/style https://*.twimg.com; worker-src 'self' blob:; report-uri https://twitter.com/i/csp_report?a=O5RXE%3D%3D%3D&ro=false",
    "content-type": "text/html; charset=utf-8",
    "cross-origin-embedder-policy": "unsafe-none",
    "cross-origin-opener-policy": "same-origin-allow-popups",
    "date": "Thu, 30 Jun 2022 13:00:05 GMT",
    "expiry": "Tue, 31 Mar 1981 05:00:00 GMT",
    "last-modified": "Thu, 30 Jun 2022 13:00:05 GMT",
    "pragma": "no-cache",
    "server": "tsa_b",
    "set-cookie": "guest_id_marketing=v1%3A165659400581116843; Max-Age=63072000; Expires=Sat, 29 Jun 2024 13:00:05 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None\nguest_id_ads=v1%3A165659400581116843; Max-Age=63072000; Expires=Sat, 29 Jun 2024 13:00:05 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None\npersonalization_id=\"v1_8EK5HpiZcmdp/iJUPUtSKA==\"; Max-Age=63072000; Expires=Sat, 29 Jun 2024 13:00:05 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None\nguest_id=v1%3A165659400581116843; Max-Age=63072000; Expires=Sat, 29 Jun 2024 13:00:05 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None",
    "strict-transport-security": "max-age=631138519",
    "x-connection-hash": "f744deb62fd346d8e4ea08b01e182bbb049039ceaffeca5a3f8472612fa1bdc1",
    "x-content-type-options": "nosniff",
    "x-frame-options": "DENY",
    "x-powered-by": "Express",
    "x-response-time": "35",
    "x-xss-protection": "0"
  }

Looking at metascraper-image package, I'm confused as to why this line isn't picking up the og:image tag.

toImage(($: any) => $('meta[property="og:image"]').attr('content')),

Jun 30 '22 13:06 bhayward93

Has there been any updates on this?

Dec 08 '22 14:12 bhayward93

Hello everyone,

The PR is prepared https://github.com/microlinkhq/metascraper/pull/608

Appreciated if you can feedback, otherwise will be merged shortly 🙂

Dec 31 '22 13:12 Kikobeats

Hello everyone,

The PR is prepared #608

Appreciated if you can feedback, otherwise will be merged shortly slightly_smiling_face

Thanks for taking a look at it @Kikobeats - testing it out I'm noticing improvements for things like avatars on profiles, but am still having issues with many links such as the one above (https://twitter.com/Twitter/status/1483427748500717573/). Can you confirm that for you, this link is parsed correctly?

Also happens for some images - I saved the outputted HTML to a file to debug, and it appears that page is not loading correctly the majority of the time, as it's missing at minimum some essential metadata like the og:image's - though sometimes it does appear - can you confirm that this is not an issue that you can replicate? Regardless, I don't believe this issue would be directly with Metascraper - but am wondering whether you could offer your opinion on whether it could relate to html-get, or perhaps Browserless.

Jan 20 '23 15:01 bhayward93

@Kikobeats, I have got more links working - including the one above and some others that I was having issues with.

I've found that when turning off pre-rendering for Twitter and setting a UserAgent of googlebot, it is now working. I'd be happy to put forward an PR to add Twitter to your auto-domains file in your getHTML package if it's something that you would want - for now, I'm just going to fork it - note though without setting the UserAgent, it does not work, which is something I don't think you'd want to set at a package level.

One thing to note about the metascraper-twitter package, is that I had to place it below metascraper-image, else for some tweets that metascraper-image WAS correctly grabbing, it will grab a generic Twitter bird.

Jan 20 '23 19:01 bhayward93

metascraper metascraper copied to clipboard

[metascraper-twitter] Add specific connector

Expected behaviour

Actual behaviour

metascraper
metascraper copied to clipboard