httparchive.org
httparchive.org copied to clipboard
Categorize origins
Group origins by category/vertical, for example news/travel/etc. This will enable category deep dives and comparisons.
DMOZ is no longer operational but a recent data dump is available. We should look for alternate sources.
Slightly related: https://github.com/HTTPArchive/httparchive/issues/75. Alexa is deprecating their top 1M ranking, so finding a rank+category solution would be a bonus.
Rick - I listened to your recent talk in NYC at Performance Meet Up about the categorizing of URLs. I think this would be a powerful feature!! Any thoughts on how this could get moved forward? I would volunteer to help the effort. Hopefully more folks will feel the same way and we could proceed before too long.
Hey Greg, thanks for volunteering! Assigning this to you :)
The next steps for this issue are:
- [ ] survey the landscape of options: are there any other services similar to DMOZ that are regularly updated? what is the URL coverage and how does it overlap with the Alexa 500K that we're using? is there room for growth as we expand URL coverage? category correctness/granularity/etc...
- [ ] plan and integrate the category info with HTTP Archive's data: what changes need to be made to the Dataflow pipeline and BigQuery schema?
- [ ] analyze the new data and surface interesting reports on the beta site
I was thinking about this the other day and didn't realize there was an issue open. During my searches the only thing I was able to find was archived DMOZ data. Here's the dump I found in case it's useful - https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OMV93V
Sadly, DMOZ is deprecated.. I don't think we should hitch our wagon to this particular dataset.
Yeah the DMOZ dump could be used as a last resort but it'd be preferable to find a service that's actively maintained.
Ilya also did some work on joining DMOZ data with Alexa URLs here: https://bigquery.cloud.google.com/table/httparchive:urls.20170315?tab=preview. Of the 1M URLs, only ~170K (17%) have topics/categories.
Ah, cool. I'll stop uploading that dataset to bigquery then. Was about to do the same analysis :)
Rick - I'll start poking around and see what I can find. Stay tuned.
Hey @gregorywolf have you made any progress on this?
Hi Rick -
Unfortunately I have not been able to spend any time on this yet. It seems that work keeps getting in my way :(
I am not sure of the urgency or your expectations. However, I would not be offended if you reassigned this to someone else and use me in a support role. If on the other hand the topic is not urgent I do plan on working on this moving it forward.
I have to apologize ahead of time since I have not been involved before and really do not know the ropes.
A suggestion is perhaps you and I could have a brief conversation and you could give me some pointers on getting things off the ground.
Let me know how you would like to proceed.
Thanks.
Greg Wolf (908) 578-8013
On November 3, 2017 at 3:03:27 PM, Rick Viscomi ([email protected]mailto:[email protected]) wrote:
Hey @gregorywolfhttps://github.com/gregorywolf have you made any progress on this?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/HTTPArchive/httparchive/issues/91#issuecomment-341797898, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIGeW05uMAb7hQi0oFLlfW5Wsax71zI9ks5sy2N5gaJpZM4M6ct5.