api icon indicating copy to clipboard operation
api copied to clipboard

receiving contacts/social media accounts for a given url

Open hbakhtiyor opened this issue 7 years ago • 4 comments

I built quick version, and not yet implemented the list

  • Extract data from parsed structured data. e.g.
<script type="application/ld+json">
{
    "@context": "http://schema.org",
    "@type": "Organization",
    "name": "Let's Validate",
    "url": "https://www.letsvalidate.com/",
    "logo": "https://www.letsvalidate.com/img/logo.png",
    "email": "[email protected]",
    "description": "Site launch checklist checker",
    "sameAs": [
        "https://www.facebook.com/letsvalidate",
        "https://m.me/letsvalidate",
        "https://twitter.com/letsvalidate",
        "https://plus.google.com/+letsvalidate",
        "https://vk.com/letsvalidate"
    ]
}
</script>
  • Extract data from parsed meta data of twitter card. e.g.
<meta name="twitter:site" content="@letsvalidate">
  • Prerender web apps before extracting data.
  • Deeply crawl, e.g. contact page
  • Maybe to use Google knowledge graph too?

the endpoint:

https://api.letsvalidate.com/v1/contacts?url=docker.com&prettify=true

result:

{
  "url": "https://www.docker.com/",
  "originalUrl": "http://docker.com",
  "contacts": {
    "email": null,
    "fax": null,
    "tel": null,
    "socials": [
      {
        "domain": "twitter.com",
        "id": null,
        "name": "docker",
        "confidence": 100,
        "url": "http://twitter.com/docker"
      },
      {
        "domain": "youtube.com",
        "id": null,
        "name": "dockerrun",
        "confidence": 100,
        "url": "http://www.youtube.com/user/dockerrun"
      },
      {
        "domain": "facebook.com",
        "id": null,
        "name": "docker.run",
        "confidence": 100,
        "url": "https://www.facebook.com/docker.run"
      }
    ]
  }
}

@jhabdas What do you think, is it worth to implement it or already available such api?

hbakhtiyor avatar May 30 '17 11:05 hbakhtiyor

Taking a look

ghost avatar May 30 '17 12:05 ghost

Here are some specific thoughts on the approach. Please keep in mind these are more of a knee jerk reaction than anything, and contain some bias as I like to build simple easy-to-maintain apps which require little maintenance (so I can build other cool stuff).

First off. I'm not aware of an existing API to pull this kind of data. But I'd be surprised if some don't already exist and made available as a micro service which might be ingested for aggregation. That said, I don't see any harm in rolling your own as it'll be easier to maintain that way and you won't have to rely on a 3rd party which could fail and/or require maintenance.

Prerender web apps before extracting data.

For first pass I'd skip pre-rendering unless you've already got an easy way to scrape (Headless Chromium?) and focus on getting the Structured Data parsing logic right. Some initial questions that pop into mind is which of the Structured Data types take precedence when multiple are present. And which of those should win in the case of (a tie, incomplete data, data with a later associated date, if applicable).

Maybe to use Google knowledge graph too?

I'm not familiar with this. But Google knows a lot. Though it may be better to pull data from multiple sources to help ensure data independence and richness.

Deeply crawl, e.g. contact page

If you do this probably just look at, what, the /about and /contact, or build a small list. I'm not sure if there's a semantic way to identify the location of this page. Not sure how Web Feeds (RSS/Atom) would help here but they may be useful in making determinations about site structure.


Jekyll SEO Tag gem has unit tests you could look at to see what things it looks for when it produces it's meta data. WordPress could be another place to look since I believe most of the sites on the Web today are actually WordPress and not anything else.

If building I'd try and lean into specs as much as possible and return null on anything which doesn't conform with a chosen specification. For structured and social data those specs basically boil down to schema.org (three types of meta), twitter dev and http://ogp.me/.

While scraping you may find some value in Portia to help define the implementation logic visually so you don't end up pulling your hair out trying to get the scraping nailed down: https://github.com/scrapinghub/portia

EDIT HERE: Sorry, since you're pulling from Meta probably best to skip portia and build the tests starting with https://github.com/scrapinghub/scrapy or similar if it makes sense in the environment and toolset being used currently.

EDIT 2: Probably better not to use a fork of Scrapy. 😝 https://github.com/scrapy/scrapy

Not sure if that's helpful. Just some thoughts.

ghost avatar May 30 '17 12:05 ghost

One more thing. IIRC https://scrapinghub.com has a list of existing services (somewhere) where people have already defined their own scrapers which collect data. You might be able to take the blue pill and just combine a few of these to build out some relatively simple heuristics logic to combine them for the API output with a level of fault tolerance not possible using a single 3rd party.

EDIT: Scratch that. Terrible idea. But the existing scrapes may be extremely insightful to help build out the algo for the API.

ghost avatar May 30 '17 13:05 ghost

wow, thanks a lot for your advices and taking your times.

using headless chrome only for capturing screenshots, for js rendering, i consider https://github.com/scrapinghub/splash for its lightweight

how about the idea itself? anyone will be interested in?

hbakhtiyor avatar May 30 '17 16:05 hbakhtiyor