httparchive.org icon indicating copy to clipboard operation
httparchive.org copied to clipboard

Categorize origins

Open rviscomi opened this issue 8 years ago • 9 comments

Group origins by category/vertical, for example news/travel/etc. This will enable category deep dives and comparisons.

DMOZ is no longer operational but a recent data dump is available. We should look for alternate sources.

Slightly related: https://github.com/HTTPArchive/httparchive/issues/75. Alexa is deprecating their top 1M ranking, so finding a rank+category solution would be a bonus.

rviscomi avatar Apr 11 '17 18:04 rviscomi

Rick - I listened to your recent talk in NYC at Performance Meet Up about the categorizing of URLs. I think this would be a powerful feature!! Any thoughts on how this could get moved forward? I would volunteer to help the effort. Hopefully more folks will feel the same way and we could proceed before too long.

gregorywolf avatar Sep 12 '17 16:09 gregorywolf

Hey Greg, thanks for volunteering! Assigning this to you :)

The next steps for this issue are:

  • [ ] survey the landscape of options: are there any other services similar to DMOZ that are regularly updated? what is the URL coverage and how does it overlap with the Alexa 500K that we're using? is there room for growth as we expand URL coverage? category correctness/granularity/etc...
  • [ ] plan and integrate the category info with HTTP Archive's data: what changes need to be made to the Dataflow pipeline and BigQuery schema?
  • [ ] analyze the new data and surface interesting reports on the beta site

rviscomi avatar Sep 12 '17 17:09 rviscomi

I was thinking about this the other day and didn't realize there was an issue open. During my searches the only thing I was able to find was archived DMOZ data. Here's the dump I found in case it's useful - https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OMV93V

paulcalvano avatar Sep 12 '17 17:09 paulcalvano

Sadly, DMOZ is deprecated.. I don't think we should hitch our wagon to this particular dataset.

igrigorik avatar Sep 12 '17 18:09 igrigorik

Yeah the DMOZ dump could be used as a last resort but it'd be preferable to find a service that's actively maintained.

Ilya also did some work on joining DMOZ data with Alexa URLs here: https://bigquery.cloud.google.com/table/httparchive:urls.20170315?tab=preview. Of the 1M URLs, only ~170K (17%) have topics/categories.

rviscomi avatar Sep 12 '17 18:09 rviscomi

Ah, cool. I'll stop uploading that dataset to bigquery then. Was about to do the same analysis :)

paulcalvano avatar Sep 12 '17 18:09 paulcalvano

Rick - I'll start poking around and see what I can find. Stay tuned.

gregorywolf avatar Sep 13 '17 13:09 gregorywolf

Hey @gregorywolf have you made any progress on this?

rviscomi avatar Nov 03 '17 19:11 rviscomi

Hi Rick -

Unfortunately I have not been able to spend any time on this yet. It seems that work keeps getting in my way :(

I am not sure of the urgency or your expectations. However, I would not be offended if you reassigned this to someone else and use me in a support role. If on the other hand the topic is not urgent I do plan on working on this moving it forward.

I have to apologize ahead of time since I have not been involved before and really do not know the ropes.

A suggestion is perhaps you and I could have a brief conversation and you could give me some pointers on getting things off the ground.

Let me know how you would like to proceed.

Thanks.

Greg Wolf (908) 578-8013

On November 3, 2017 at 3:03:27 PM, Rick Viscomi ([email protected]mailto:[email protected]) wrote:

Hey @gregorywolfhttps://github.com/gregorywolf have you made any progress on this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/HTTPArchive/httparchive/issues/91#issuecomment-341797898, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIGeW05uMAb7hQi0oFLlfW5Wsax71zI9ks5sy2N5gaJpZM4M6ct5.

gregorywolf avatar Nov 03 '17 20:11 gregorywolf