w3id.org icon indicating copy to clipboard operation
w3id.org copied to clipboard

How big can we go

Open phillord opened this issue 10 years ago • 10 comments

I was wondering how many rules you would be willing to accept in an .htaccess?

The reason I ask, is that we want to use w3id.org to redirect to a secondary service which in turn redirects further. One way to speed the process up would be to put our forward rules directly into w3id, say once every year or so. We're currently have around 10,000 URLs. Am guessing this is too many, but it would be good to know for sure.

phillord avatar May 11 '15 14:05 phillord

I'm not sure how well apache handles such use cases. That sounds like a pain to maintain in any case. Do these URLs follow some regular pattern such that you could use a general rule with pattern matching and substitution? Also note that multiple levels of redirects is just going to slow everything down. If you have some example rules perhaps we could help with a good solution.

davidlehn avatar May 11 '15 16:05 davidlehn

Nor I.

In the general case, we just redirect with a single rule and then redirect further onwards. Yes, it involves to and fro, but networks are fast these days, and I am not too worried about that.

The URLs we redirect to are not patternisable, I am afraid, although.

I'm just thinking through ideas here. I've never tried large-scale redirects and have no idea how well apache would handle it.

phillord avatar May 11 '15 17:05 phillord

Hi Phil,

I would suggest using another PURL server for this, either Callimachus [1] or PURLz [2]. It would be much easier to manage 10K URLs in a database instead of a text file.

Please note that Callimachus 1.5 will be released around 1 June 2015, and will be used in PURLz by the end of the summer. An Alpha of Callimachus 1.5 is available now, but the PURL UI isn’t all there yet.

Regards,

Dave

http://about.me/david_wood

[1] http://callimachusproject.org/ http://callimachusproject.org/ [2] http://purlz.org http://purlz.org/

On May 11, 2015, at 10:12, Phil Lord [email protected] wrote:

I was wondering how many rules you would be willing to accept in an .htaccess?

The reason I ask, is that we want to use w3id.org to redirect to a secondary service which in turn redirects further. One way to speed the process up would be to put our forward rules directly into w3id, say once every year or so. We're currently have around 10,000 URLs. Am guessing this is too many, but it would be good to know for sure.

— Reply to this email directly or view it on GitHub https://github.com/perma-id/w3id.org/issues/84.

perma-id avatar May 12 '15 17:05 perma-id

I have a hunch that Apache would handle it just fine, but it would be good to get some data behind that theory.

msporny avatar May 12 '15 20:05 msporny

I think so also. 10,000 redirect rules is not that much in memory. And running a web server is easier than running Callimachus.

Ah, well, I will test it out and let you know.

phillord avatar May 12 '15 21:05 phillord

My assumptions:

  1. Apache will lazy-load redirect rules.
  2. After being loaded into memory, the added rules won't affect performance outside of the path they're a part of.

If 1 doesn't hold, then we're eating a couple of hundred KB more memory (which isn't terrible).

If 2 doesn't hold, then we're going to have to find another solution. If adding 10K redirect entries slows down site re-direct performance by 20ms or more per HTTP request (regardless of which path is being loaded), we'll have a problem on our hands. Also note the potential for a DDoS by just loading URLs in the directory w/ 10K entries (assuming performance isn't great).

In any case, worth looking into. Leaving it to you to @phillord to give us some performance data proving that this isn't going to hurt site performance.

msporny avatar May 12 '15 21:05 msporny

To be clear, I also agree that Apache would handle this just fine.

Regards,

Dave

http://about.me/david_wood

On May 12, 2015, at 17:09, Manu Sporny [email protected] wrote:

Assumptions:

Apache will lazy-load redirect rules. After being loaded into memory, the added rules won't affect performance outside of the path they're a part of. If 1 doesn't hold, then we're eating a couple of hundred KB more memory (which isn't terrible).

If 2 doesn't hold, then we're going to have to find another solution. If adding 10K redirect entries slows down site re-direct performance by 20ms or more per HTTP access, we'll have a problem on our hands.

In any case, worth looking into. Leaving it to you to @phillord https://github.com/phillord to give us some performance data proving that this isn't going to hurt site performance.

— Reply to this email directly or view it on GitHub https://github.com/perma-id/w3id.org/issues/84#issuecomment-101421608.

prototypo avatar May 12 '15 21:05 prototypo

Okay, so done some poking around -- it looks like the solution is a RewriteMap, running over DBM, which should give constant time dispatch -- I guess that would scale to much larger than is every likely to be needed.

Something that cannot be configured from .htaccess unfortunately.

phillord avatar May 12 '15 21:05 phillord

@phillord I'm curious, can you share more about your use case and a few lines of example data?

Our use cases so far have been basic redirects and pattern replacement redirects. We originally went with a git+apache setup because it was quick and easy. The idea was that if the service needed to scale to more advanced use cases we would migrate to a database system with accounts and self service management UI and an API and so on. Apache has certainly been easier so far! I imagine if the DBM/httxt2dbm method is the ideal solution for this case we could look at how to integrate some automatic compilation step from the txt file format and add the needed rules into the server config.

So far it's been easy enough for one of the admins to review pull requests. Reviewing a 10k line change is a bit different. :-) I'm not sure if that's an issue for this or not.

davidlehn avatar May 12 '15 22:05 davidlehn

Sure. We've written a tool called greycite (http://greycite.knowledgeblog.org) which is aimed at helping support publication of scientific material on the web.

It does a number of things. It scrapes metadata from URLs, and returns this in a series of forms, either through the website or through content negotiation. It also submits the URL to archive.org. And, finally, it gives a permanent identifier to that URL. Currently, we are using PURLS for this, but want to move to w3id. So, an academic can just publish on the web, and someone else gets to worry about archiving.

Now, of course, we can use a generic redirect, backed by greycite. So, for instance....

RewriteRule ^(.+)$ http://greycite.knowledgeblog.org/?uid=$1 [R=302,L]

This 302s to greycite. It's got a database backend which in turn 302s to the original resource, or to the archive.org version if the original has gone.

All well and good. But there are two questions. Firstly, can we get rid of the double redirect. And, second, what is the exit strategy? Eventually, we will retire, or get a new job, or die. And greycite will fade away. So, the permalinks become not very permanent. Not that we are going to do this soon, but it's good to have a plan.

Now RewriteMaps are potentially a solution to both of these. For URLs which have gone 404 (or we decide have died for other reasons), we just redirect permanently to archive.org. And, when greycite retires, we could freeze everything to new entries and upload all the existing redirects to w3id. The advanced functionality of greycite would stop, but redirects would work for as long as w3id does.

As I say, I am exploring possibilities and trying to ask the question, how would we preserve knowledge for the future, which seems to fit with w3id's intention.

"David I. Lehn" [email protected] writes:

@phillord I'm curious, can you share more about your use case and a few lines of example data?

Our use cases so far have been basic redirects and pattern replacement redirects. We originally went with a git+apache setup because it was quick and easy. The idea was that if the service needed to scale to more advanced use cases we would migrate to a database system with accounts and self service management UI and an API and so on. Apache has certainly been easier so far! I imagine if the DBM/httxt2dbm method is the ideal solution for this case we could look at how to integrate some automatic compilation step from the txt file format and add the needed rules into the server config.

So far it's been easy enough for one of the admins to review pull requests. Reviewing a 10k line change is a bit different. :-) I'm not sure if that's an issue for this or not.


Reply to this email directly or view it on GitHub: https://github.com/perma-id/w3id.org/issues/84#issuecomment-101449136

phillord avatar May 13 '15 10:05 phillord