Serendipity icon indicating copy to clipboard operation
Serendipity copied to clipboard

Denial of Service via template cache and disk space

Open hannob opened this issue 6 years ago • 29 comments

It is possible to fill up the disk space available to an s9y installation. The only precondition is that URL rewriting is enabled.

s9y will generate a cache entry for every URL including 404 URLs in templates_c. So by simply calling many invalid paths on a host with an s9y installation one can fill up the disk space. The speed of this attack depends on how much blog content there is, as the 404 page contains the blog front page.

With a simple curl-based attack (code below) I can fill up around 3 megabyte per second. This attack is not optimized, by parallelizing and maybe using http2 one can probably do this much faster. But this optimized version already allows filling up the disk with ~ 10 gigabyte per hour.

There should be some mechanism to avoid caching 404 pages (or alternatively they should all access the same cache entry).

Attack script:

#!/bin/bash

target="$1"

urls=""
for i in $(seq 1 1000); do
	for j in $(seq 1 1000); do
		urls="$urls $target/$i-$j"
	done
	curl -sk $urls > /dev/null
done

(I herebly allow redistributing this script licensed as CC0 in case anyone cares)

hannob avatar Sep 17 '19 13:09 hannob

I'm wondering if the cache is actually needed at all.

I have done some simple benchmarks (using recursive "wget -l 1 -r --domain [host]"), with existing cache, no existing cache (cleaning templates_c dir before) and with disabled cache. I don't see any significant difference at all. Always takes something between 9 and 10 seconds, if there is any speed advantage by the cache then it's hardly measurable.

This is certainly not the most sophisticated benchmark. But it makes me wonder if the cache is based on outdated assumptions (after all PHP got a lot of performance in recent years) and maybe it's just the best to get rid of it?

hannob avatar Nov 04 '19 12:11 hannob

Hi @hannob. I'm just realizing I'm not sure which cache you mean. There actually are at least two: the caching layer we added, simple_cache, and the php files Smarty is producing. Thinking about it you probably mean the simple_cache?

Some time ago I did test how useful the caching layers s9y had back then are. https://www.onli-blogging.de/1476/Serendipity-mit-Cache-beschleunigen-Tester-gesucht.html is the writeup and links to the forum thread. In https://docs.google.com/spreadsheets/d/1eISXJ5SENPb8Vyz8KXOf_T4fbFkDOrmQBTL0NkZZMfs/edit#gid=0 I wrote down siege -d10 -c100 -t1M as benchmark command, maybe you could test with that again? It was more about handling a peak of concurrent requests, not about one single visitor requesting many pages one after the other.

onli avatar Nov 04 '19 12:11 onli

I think I'm talking about the simple_cache, not the smarty cache (can that be disabled at all?).

I did some tests with siege, see below. As far as I can tell if there is any difference it's within the measuring uncertainty.

I'm aware that web performance testing can be tricky business, but for now this really looks to me that the difference is hardly measurable, if it exists at all. Maybe PHP just has gotten so fast in the meantime that caching is no longer useful?

Without cache from home Internet:

Transactions:		        3803 hits
Availability:		       98.86 %
Elapsed time:		       59.39 secs
Data transferred:	       14.81 MB
Response time:		        1.24 secs
Transaction rate:	       64.03 trans/sec
Throughput:		        0.25 MB/sec
Concurrency:		       79.64
Successful transactions:        3865
Failed transactions:	          44
Longest transaction:	       10.50
Shortest transaction:	        0.17

Transactions:		        1840 hits
Availability:		       61.13 %
Elapsed time:		       59.81 secs
Data transferred:	        8.47 MB
Response time:		        1.54 secs
Transaction rate:	       30.76 trans/sec
Throughput:		        0.14 MB/sec
Concurrency:		       47.22
Successful transactions:        1935
Failed transactions:	        1170
Longest transaction:	       13.77
Shortest transaction:	        0.16

With Cache from home Internet:

Transactions:		         712 hits
Availability:		       24.27 %
Elapsed time:		       45.22 secs
Data transferred:	        8.34 MB
Response time:		        1.05 secs
Transaction rate:	       15.75 trans/sec
Throughput:		        0.18 MB/sec
Concurrency:		       16.48
Successful transactions:         828
Failed transactions:	        2222
Longest transaction:	        9.55
Shortest transaction:	        0.15


Transactions:		        3816 hits
Availability:		       99.61 %
Elapsed time:		       59.84 secs
Data transferred:	       15.16 MB
Response time:		        1.35 secs
Transaction rate:	       63.77 trans/sec
Throughput:		        0.25 MB/sec
Concurrency:		       86.21
Successful transactions:        3851
Failed transactions:	          15
Longest transaction:	       13.76
Shortest transaction:	        0.17

From server, without cache:

Transactions:                  23951 hits
Availability:                 100.00 %
Elapsed time:                  59.11 secs
Data transferred:              86.08 MB
Response time:                  0.09 secs
Transaction rate:             405.19 trans/sec
Throughput:                     1.46 MB/sec
Concurrency:                   37.76
Successful transactions:       23962
Failed transactions:               0
Longest transaction:            1.28
Shortest transaction:           0.04

Transactions:                  24468 hits
Availability:                 100.00 %
Elapsed time:                  59.91 secs
Data transferred:              88.03 MB
Response time:                  0.09 secs
Transaction rate:             408.41 trans/sec
Throughput:                     1.47 MB/sec
Concurrency:                   38.56
Successful transactions:       24484
Failed transactions:               0
Longest transaction:            5.13
Shortest transaction:           0.04



From server, with cache:

Transactions:                  25054 hits
Availability:                 100.00 %
Elapsed time:                  59.73 secs
Data transferred:              89.89 MB
Response time:                  0.09 secs
Transaction rate:             419.45 trans/sec
Throughput:                     1.50 MB/sec
Concurrency:                   37.77
Successful transactions:       25069
Failed transactions:               0
Longest transaction:            5.14
Shortest transaction:           0.04

Transactions:                  23959 hits
Availability:                 100.00 %
Elapsed time:                  59.89 secs
Data transferred:              86.19 MB
Response time:                  0.10 secs
Transaction rate:             400.05 trans/sec
Throughput:                     1.44 MB/sec
Concurrency:                   39.12
Successful transactions:       23971
Failed transactions:               0
Longest transaction:            5.13
Shortest transaction:           0.04

hannob avatar Nov 10 '19 17:11 hannob

Caching is required for very good reasons. You could easily bog down a highly visited blog with no caching, to the point that either the server goes down, or the blog gets suspended (if running from a shared server). Having a cache system means less database connections, and less processing.

However, the simple cache implemented by Serendipity is not enough by itself for high traffic websites. It should be used in conjuntion with a true reverse cache solution like Squid, Varnish, mod_cache, or CloudFlare. Every reverse cache mechanism for Serendipity needs "simple cache" to work, since they require static html files with a proper last-modified header.

You shouldn't serve any HTTP errors with Serendipity. You can use the .htacces file to set up a different page for HTTP errors like, 403, 404, 500, etc.. and skip s9y altogether for them.

However. You should be aware that, the behaviour you describe may not only apply to HTTP errors or to DoS atatcks, but also to tags, categories and any other type of content. I haven't checked out recent versions of Serendipity, but the last one I used did a poor job validating existing or non-existing content. A blog could easily fall in an endless existent/non-existent URL loop just by having bad links on regular content. For example. Add a link to a TAG on every blog post but without the leading slash like "tag/Food" instead of "/tag/Food". A spider like Google will then follow every URL created, to the point it crawls hundreds of thounsands of duplicate content pages like "/post/123/tag/Food" "/post/124/tag/Food", "/post/125/tag/Food".. and so on. So if you have 500 posts and 1.000 tags, you could potentially create 1.000 invalid tag URLs for each public blog post (that's 1.000 tags x 500 posts = 500.000 bad URLs spiders would attempt to crawl). Big numbers just to make a point.

Again.. not sure if these problems have been fixed in recent versions of s9y, but if they haven't, they can be prevented with .htacess entries forbidding in bulk such bad URL patterns.

And I would recommend keeping caching ON.

Important afterthought: I just remembered there are two kinds of HTTP errors one should be aware of with Serendipity. Those for static files that can be easily handled with .htacess, and those for .php files and dinamic content that s9y catches by default. In such case one could still use .htacess to prevent a content loop by explicitly allowing known URL patterns, and forbidding everything else (or by just forbidding known bad URL patterns). The cache system could also be modified not to cache any HTTP errors and redirect to the proper error pages.

mdnava avatar Feb 21 '20 00:02 mdnava

@mdnava several notes:

To be clear: Disabling error page handling by serendipity does not resolve the DoS in a meaningful way. You can still fill up the cache by calling articles with a variety of urls, because usually s9y accepts the article ID plus anything as a valid URL. (i.e. you can call [blogurl]/5-[randomname].html while changing [randomname] constantly.)

Your remarks about caching: enabling or disabling the cache does not seem to change the caching headers at all (the default headers sent by s9y seem to all imply that web caches should be disabled, which may be an issue that should also be adressed, but this is unrelated to the issue discussed here, as are any recommendations about other caching mechanisms.)

hannob avatar Feb 22 '20 08:02 hannob

@hannob

To be clear: Disabling error page handling by serendipity does not resolve the DoS in a meaningful way. You can still fill up the cache by calling articles with a variety of urls, because usually s9y accepts the article ID plus anything as a valid URL. (i.e. you can call [blogurl]/5-[randomname].html while changing [randomname] constantly.)

I'm just trying to help. If you read my response again you'll notice I warned that the problem is not limited to error pages, and not even limited to intentional DoS attacks. This is because s9y doesn't validate that the requested content actually exists before caching the response. It always returns "HTTP 200 OK" by default. This behaviour was hardcoded last time I checked (but can be modified). And I say "not limited" to DoS attacks because even a bad link structure could cause web crawlers to produce a massive amount of duplicate and/or non-existent content. This means of course, a lot of unnecesary cached pages, a lot of unnecesary server requests, but also means that the massive amount of duplicate or invalid content could be the cause of SEO problems and search engine penalizations.

That being said, the problem can be minimized by using smart .htaccess rules, at least while there's a more permanent solution (which I wouldn't wait for anytime soon since the problem has been a core part of s9y since its conception). The idea is to use .htaccess to allow only known URL patterns like the one you described and forbid anything else. If you don't use tags for instance, disallow those URLs. I can tell you for experience this would prevent many unwanted errors and behaviours.

Your remarks about caching: enabling or disabling the cache does not seem to change the caching headers at all (the default headers sent by s9y seem to all imply that web caches should be disabled, which may be an issue that should also be adressed, but this is unrelated to the issue discussed here, as are any recommendations about other caching mechanisms.)

My "remarks" about caching are not unrelated to the issue in discussion since you proposed yourself the idea "to get rid of it?" (with a question mark). That particular remark was what prompted me to write a response. Caching is a necessary feature unless you plan to keep a very small website forever.

Now... about HTTP headers, that's a tricky subject. It can be very simple or very complex depending on the requirements. And things might have changed since the last time I used Serendipity. But as I recall, the caching feature did modify HTTP headers unless you enabled the option to "force clients to maintain fresh copy"; in such case, all client web caches should be disabled as you mentioned (although the server side cache would still reduce DB usage). One can modify the code to send customized headers if need to. The simple cache feature is not perfect, nor it produces the perfect HTTP headers for client caching and/or reverse proxy caching, but I think it's still the only available solution. If there was an alternative for s9y we probably wouldn't even be having this discussion.

Post note: I do agree that these issues with caching, HTTP headers and URL rewriting should be addressed in the software itself (these are real BUGS). I just mean to point out that removing simple cache altogether without an alternative is not a smart solution; and true potential problems and DoS attacks can be minimized (and in most cases prevented) with .htaccess rules and/or code fixes.

mdnava avatar Feb 23 '20 07:02 mdnava

I am currently hit by this (in 2.3.5 btw every other day. It's kind of funny since I didn't experience "disk full" issues due to inode exhaustion since 20 years ago. However, two questions:

  • Shouldn't there be some job cleaning up the template_c directory regularly? I currently have half a million files hanging around in the directory, written in the last 14 days. how long does a cache entry make sense before it should be removed?
  • What would be the "smart .htaccess" that @mdnava keeps mentioning? Are there examples? Why are they not part of the distribution?

That being said, I'd love to have a short-term workaround for this, for example the information that it might be ok to zap files older than, say, five days from template_c on a daily basis.

Greetings Marc

Zugschlus avatar Sep 25 '20 19:09 Zugschlus

You can simply delete everything in the templates_c directory, it will be regenerated when needed.

As I was unable to measure any performance benefit from the cache my recommendation for a short-term workaround is to disable the caching plugin.

hannob avatar Sep 25 '20 20:09 hannob

That is "Enable Caching" in Settings => Configuration => General Settings?

Zugschlus avatar Sep 27 '20 05:09 Zugschlus

@Zugschlus

  • Shouldn't there be some job cleaning up the template_c directory regularly? I currently have half a million files hanging around in the directory, written in the last 14 days. how long does a cache entry make sense before it should be removed?

Cache files are supposed to have an expiration date and the a cache system is supposed to remove expired files automatically. Last time I used s9y the cache was implemented with the "serendipity_event_cachesimple" and it worked fairly well, but now as I understand they bundled a caching system. I'm not sure if it's the same caching system at the core or not, but the bug I've been talking about IS NOT CAUSED BY CACHING ITSELF, but but a lack of proper validation in s9y. For example, I can have a valid URL like this: "domain.com/posts/10567/"; but then all other URLs like this will be valid and the same in a Serendipity blog:

"domain.com/posts/10567/sadas" "domain.com/posts/10567/1" "domain.com/posts/10567/2" "domain.com/posts/10567/3"

This is a problem because anyone can link maliciously or by mistake to those bad links and they will be cached internally and followed by search engines (thus creating an infinite loop of duplicate content). The key here is that Serendipity doesn't answer with a proper "404 ERROR" when such invalid URLs are requested. It always returns with status "200 OK". You can find these kinds of bad URLs by checking access logs or Google Webmaster Tools. But if you have something like 50 to 500 posts and your "template_c" directory has half a million files, most likely you're having this very problem.

If the "template_c" directory is not being properly cleaned up you can just remove it and it will be created instantly when the blog is accessed. You can even create a daily or weekly cronjob for such task. Not ideal, but it could prevent your "disk full" problem unless you're the victim of a DoS attack.

  • What would be the "smart .htaccess" that @mdnava keeps mentioning? Are there examples? Why are they not part of the distribution?

The .htaccess rules I mentioned to prevent these errors SHOULD NOT be a part of the software. It's only a temporary measure. The real solution if for Serendipity to validate content URLs properly. If I request "domain.com/posts/10567/" it should return "200 OK" and show its content, but if I request "domain.com/posts/10567/sadas" or any other invalid URL it should return a proper "404 NOT FOUND" error. This will prevent invalid cache files and duplicate content on search engines.

As for an example .htaccess (please note this is a different URL scheme than the one used by default):

# Redirect URLs like "/posts/10567/sadas" to "/posts/10567/"
RewriteRule   ^posts/([0-9]+)/(.*)$   https://domain.com/posts/$1/ [R=301,L,QSA]

# Forbid URLs like "/posts/10567/sadas" (no redirection)
RedirectMatch gone /posts/([0-9]+)/(.*)/
RedirectMatch gone /posts/([0-9]+)/(.*)/(.*)/
RedirectMatch gone /posts/([0-9]+)/(.*)/(.*)/(.*)/

# Forbid invalid URLs (these are only examples of common Serendipity URL errors)
RedirectMatch gone ^(.*)www(.*)$
RedirectMatch gone ^(.*)function.(.*)$
RedirectMatch gone ^(.*)/index/(.*)$

The main idea of such entries in the .htaccess file is to forbid content we know it shouldn't exist, while allowing valid content. Again, this task should be done at the core of the application and this workaround shouldn't be the default solution.

On the other hand, the caching system should properly validate expiration dates and remove stalled/expired content. As I mentioned before, I haven't tested the new cache system so I'm not sure if it does the job.

@hannob

As I was unable to measure any performance benefit from the cache my recommendation for a short-term workaround is to disable the caching plugin.

If you guys have low traffic websites you could consider disable caching altogether, but I wouldn't recommend it. If you also have a site on a SSD drive most likely you won't notice any performance diff. However, caching is ESSENTIAL for any mid to high traffic website or even if you plan of growing your traffic in the future. Without proper caching many bad things can happen; the performance will degrade the more visits you get, the blog can easily get your account suspended, or even could take down an entire dedicated server. Without proper caching your blog will make tens (or even hundreds) of database connections/queries for each user's visit (the more plugins, the more queries), and every Webhosting service has a hard limit for those. So at the very least you can be serving database errors instead of content without even knowing.

Also, if you plan to use a reverse cache software like Varnish, Nginx, Squid, Memcached or even Apache mod_cache; you NEED the internal s9y caching to be ON, because reverse proxies don't cache dynamic PHP content. The Serendipity internal caching creates static files with HTTP expiration headers that allow content to be properly cached.

Happy blogging!!

mdnava avatar Sep 27 '20 15:09 mdnava

Ok, now I am totally confused and at a total loss about what to do short of putting my blog on a dedicated file system so that it's at least only the blog going down when my inode table is full.

Zugschlus avatar Sep 27 '20 16:09 Zugschlus

@Zugschlus Easiest solution for now is to go into configuration and to disable the cache. It's under "Konfiguration -> Generelle Einstellungen", at the very bottom.

I'm currently looking into returning 404s when the title is not met (but configured as part of the permalink), but I'm not sure the internal structure of s9y allows that without breaking too much.

onli avatar Sep 27 '20 16:09 onli

Ok, now I am totally confused and at a total loss about what to do short of putting my blog on a dedicated file system so that it's at least only the blog going down when my inode table is full.

@Zugschlus If you remove the "template_c" directory (it's created automatically) and make .htaccess rules to forbid invalid content you might just fix your problem, at least until there's a more permanent solution in the application. However, this problem has been around since the very beginning (over 15 years ago) so I wouldn't expect a fix anytime soon. Bare in mind that you will need to have a pretty good idea of what are those bad URLs that are causing the problem by checking access logs or Google Webmaster Tools.

Ask yourself: How many posts/categories/tags does the blog have? Does it makes sense to have more cached files than the number of posts/categories/tags (in short, no!)

@onli @Zugschlus Again, I would not recommend disabling caching. Is not really the problem. If you guys disable caching you will get a false sense of security. First, you might be serving database errors instead of content without even knowing. Second, disabling caching will not solve the duplicate content problem (if it exist). If there were half a million cache files in your "templace_c" directory that could mean that there's a duplicate content problem of near that size.

mdnava avatar Sep 27 '20 16:09 mdnava

@mdnava

Again, I would not recommend disabling caching. Is not really the problem.

It is the problem here. The file system breaks down because of too many entries -> the problem is the caching.

It just happens to be that limiting the output to a single valid url would also probably help with that specific caching/filesystem problem.

If there were half a million cache files in your "templace_c" directory that could mean that there's a duplicate content problem of near that size.

Actually, not exactly. Have a look at the head of https://www.onli-blogging.de/1973/Deus-Ex-Das-2020-Review.html and https://www.onli-blogging.de/1973/Deus-Ex-Das-2020-Reviewwithlongertitle.html. Both of them are set to the same canonical url. No proper search machine would categorize this as duplicate content.

First, you might be serving database errors instead of content without even knowing.

Unlikely. The moment s9y sees an id it serves the content. If it finds nothing it will serve a 404.

Does it makes sense to have more cached files than the number of posts/categories/tags (in short, no!)

Yes! You might have plugins active that lead to larger permutation of valid entries.

onli avatar Sep 27 '20 16:09 onli

@Zugschlus Easiest solution for now is to go into configuration and to disable the cache. It's under "Konfiguration -> Generelle Einstellungen", at the very bottom.

Done. Can I now clean out the templates_c/simple_cache directory? Shouldn't s9y do this by itself when disabling the cache?

If you remove the "template_c" directory (it's created automatically) and make .htaccess rules to forbid invalid content you might just fix your problem, at least until there's a more permanent solution in the application. However, this problem has been around since the very beginning (over 15 years ago) so I wouldn't expect a fix anytime soon. Bare in mind that you will need to have a pretty good idea of what are those bad URLs that are causing the problem by checking access logs or Google Webmaster Tools.

I am just blogging, I do not have enough knowledge around s9y to write a .htaccess rule that fits my s9y configuration and will correctly forbid invalid content. That seems to be highly configuration dependent.

Ask yourself: How many posts/categories/tags does the blog have? Does it makes sense to have more cached files than the number of posts/categories/tags (in short, no!)

I cannot judge that since I do not know what is cached in the simple_cache directory. My blog has little more then a thousand articles, about 20 categories and way too many tags (probably a couple of hundred). Nothing of this probably justifies half a milllion of cache files.

I have been blogging pretty regularly from 2004 to 2011 and have never seen this issue in the early versions of s9y. I have revived the blog on refreshed infrastructure earlier this year (current OS, current PHP, PostgreSQL instead of MySQL), so we might be seeing an issue that might have been caused by the major changes in the database. Changing the database engine on a database that wasn't used and updated for three years is probably asking for big trouble.

Again, I would not recommend disabling caching. Is not really the problem. If you guys disable caching you will get a false sense of security. First, you might be serving database errors instead of content without even knowing. Second, disabling caching will not solve the duplicate content problem (if it exist). If there were half a million cache files in your "templace_c" directory that could mean that there's a duplicate content problem of near that size.

Elaborate on the duplicate content problem.

Greetings Marc

Zugschlus avatar Sep 27 '20 16:09 Zugschlus

Done. Can I now clean out the templates_c/simple_cache directory? Shouldn't s9y do this by itself when disabling the cache?

Yes. What is needed will be regenerated. Hopefully a bit less (there are still other systems of s9y that will create files there).

onli avatar Sep 27 '20 17:09 onli

@mdnava I tried out a patch that would limit the issue. https://github.com/s9y/Serendipity/commit/09d670475a565e4f36b32cead550e63d731580cb, will only work if the configured entry permalink pattern includes %title% at the second position (as is the default). If you have the setup for that you could try it out!

Please do not test this in production (so not a solution for @Zugschlus).

onli avatar Sep 27 '20 17:09 onli

@mdnava [1] I tried out a patch that would limit the issue. 09d6704 [2], will only work if the configured entry permalink pattern includes %title% at the second position (as is the default). If you have the setup for that you could try it out!

That should at least made configurable (at least via serendipity_config_local.inc.php) - it's feature, not a bug that links remain valid if you change the title of the post (i.e. fixing typos). I don't think s9y should error out with a 404 if the post id is valid, but the title is not.

If that's a problem with the cache, we should try to fix it there (even more as the cache may be disabled, AFAIS even without really large performance hits).

(Sorry, still no time to get more involved in development again ...)

Links:

[1] https://github.com/mdnava [2] https://github.com/s9y/Serendipity/commit/09d670475a565e4f36b32cead550e63d731580cb

th-h avatar Sep 27 '20 17:09 th-h

Agree with @th-h - this is really important.

The point with URLs is that if you change your title (even if it's just a typo) you want your old links still to work. If you want to avoid duplicate URLs then you need to make sure old URLs get a redirect.

As for the cache I think everyone's still missing the elephant in the room: Does the cache even do anything useful? I think we can all agree that a cache is only helpful if it is faster than not having a cache (no matter if it's a low or high traffic site). I have provided some numbers indicating that it does not. Maybe my measuring method isn't good enough. But noone has provided other results yet.

hannob avatar Sep 27 '20 17:09 hannob

As for the cache I think everyone's still missing the elephant in the room: Does the cache even do anything useful? I think we can all agree that a cache is only helpful if it is faster than not having a cache (no matter if it's a low or high traffic site). I have provided some numbers indicating that it does not. Maybe my measuring method isn't good enough. But noone has provided other results yet.

I think that @mdnava 's reasoning that turning off the cache will cause more load on storage and database backend has a HUGE point, and it's qutite obvious. Would be good to have some data points though.

Greetings Marc

Zugschlus avatar Sep 27 '20 17:09 Zugschlus

@onli @Zugschlus

It is the problem here. The file system breaks down because of too many entries -> the problem is the caching. It just happens to be that limiting the output to a single valid url would also probably help with that specific caching/filesystem problem.

Let's assume there are 500 posts, and 100 plugins and 100 categories, and 1.000 tags... Still the numbers don't add up. Half a million cache files means one of two things: Either the cache system is in fact the cause of the problem by not removing expired content, or the application is not properly validating requests and returns everything that appears to have an ID on the URL as valid content. In the latter case the caching system does create those files anyway, but it wouldn't be the root of the problem.

But I have to insist.. The root of the problem is most likely not caused by the cache even if those half a billion files are created by it. Caching is not (and should not be) in charge of validating if content is valid (that would defeat its purpose and make it slower). You can have it disabled and you'd probably wouldn't notice there's a problem with the site structure you still have to deal with. Then you can blame the user for a bad link structure, but the truth is that the application should validate correct and incorrect URIs, not by searching an ID within the post URL but correctly parsing the full permalink. I don't mean to be disrespectful or anything but perhaps consider checking out how Wordpress validates permalinks.

If there were half a million cache files in your "templace_c" directory that could mean that there's a duplicate content problem of near that size.

Actually, not exactly. Have a look at the head of https://www.onli-blogging.de/1973/Deus-Ex-Das-2020-Review.html and https://www.onli-blogging.de/1973/Deus-Ex-Das-2020-Reviewwithlongertitle.html. Both of them are set to the same canonical url. No proper search machine would categorize this at duplicate content.

It doesn't change the fact that half a million cached files could still be duplicate content, whether the "canonical" HTML header is set or not. And bare in mind that any user with this problem that doesn't have that header on every page will be heavily penalized by search engines. That's a real problem.

For Serendipity all these URLs are valid and the same:

https://www.onli-blogging.de/1973/Deus-Ex-Das-2020-Review.html https://www.onli-blogging.de/1973/Deus-Ex-Das-2020-ReviewAA.html https://www.onli-blogging.de/1973/Deus-Ex-Das-2020-ReviewA2.html https://www.onli-blogging.de/1973/Deus-Ex-Das-2020-ReviewA3.html https://www.onli-blogging.de/1973/Deus-Ex-Das-2020-ReviewA4.html https://www.onli-blogging.de/1973/Deus-Ex-Das-2020-ReviewA5.html https://www.onli-blogging.de/1973/Deus-Ex-Das-2020-ReviewA6.html https://www.onli-blogging.de/1973/Deus-Ex-Das-2020-ReviewA7.html https://www.onli-blogging.de/1973/Deus-Ex-Das-2020-ReviewA8.html https://www.onli-blogging.de/1973/Deus-Ex-Das-2020-ReviewA9.html https://www.onli-blogging.de/1973/Deus-Ex-Das-2020-ReviewAXXXXXXXXXXXXXXXX https://www.onli-blogging.de/1973/AD-INFINITUM.html

Now, let's assume we've changed the blog URL scheme to something like "/posts/01/". Then you have an user that by his own mistake posts a single link in one entry like this: "tag/Whatever" instead of "/tag/Whatever" (note that it's a user link, not a template generated link). Assuming you have 10 categories you will get these URLs created by Serendipity:

http://domain.com/posts/01/ http://domain.com/posts/01/tag/Whatever http://domain.com/posts/01/tag/Whatever/posts/01/ http://domain.com/posts/01/tag/Whatever/posts/01/Whatever http://domain.com/Categories/Cat1/tag/Whatever http://domain.com/Categories/Cat1/tag/Whatever/posts/01/ http://domain.com/Categories/Cat2/tag/Whatever/posts/01/Whatever http://domain.com/Categories/Cat3/tag/Whatever http://domain.com/Categories/Cat3/tag/Whatever/posts/01/ http://domain.com/Categories/Cat3/tag/Whatever/posts/01/Whatever http://domain.com/Categories/Cat4/tag/Whatever http://domain.com/Categories/Cat4/tag/Whatever/posts/01/ http://domain.com/Categories/Cat4/tag/Whatever/posts/01/Whatever http://domain.com/Categories/Cat5/tag/Whatever http://domain.com/Categories/Cat5/tag/Whatever/posts/01/ http://domain.com/Categories/Cat5/tag/Whatever/posts/01/Whatever http://domain.com/Categories/Cat6/tag/Whatever http://domain.com/Categories/Cat6/tag/Whatever/posts/01/ http://domain.com/Categories/Cat6/tag/Whatever/posts/01/Whatever http://domain.com/Categories/Cat7/tag/Whatever http://domain.com/Categories/Cat7/tag/Whatever/posts/01/ http://domain.com/Categories/Cat7/tag/Whatever/posts/01/Whatever http://domain.com/Categories/Cat8/tag/Whatever http://domain.com/Categories/Cat8/tag/Whatever/posts/01/ http://domain.com/Categories/Cat8/tag/Whatever/posts/01/Whatever http://domain.com/Categories/Cat9/tag/Whatever http://domain.com/Categories/Cat9/tag/Whatever/posts/01/ http://domain.com/Categories/Cat9/tag/Whatever/posts/01/Whatever http://domain.com/Categories/Cat10/tag/Whatever http://domain.com/Categories/Cat10/tag/Whatever/posts/01/ http://domain.com/Categories/Cat10/tag/Whatever/posts/01/Whatever [... ANOTHER URI AD INFINUTUM NIGHTMARE...]

This can go on and on and on and on.. up to half a million duplicate entries (or more if allowed).. And whether you have caching enabled or not, and whether you have the "canonical" header set up or not, all those URLs will be followed by search engines unless you explicitly forbid them in .htacess or by a "nofollow" directive. This all means an infinite duplicate content problem, hundreds of thousands of concurrent database connections, higher memory usage and higher CPU usage. These are only examples (I have seen), there are many ways in which the validation issue can cause all sorts of problems in a blog.

First, you might be serving database errors instead of content without even knowing.

Unlikely. The moment s9y sees an id it serves the content. If it finds nothing it will serve a 404.

This is very likely. I've experienced it. If you reach any limit on your service the user gets an error and you won't know it unless you've set up some sort of notification (as I did long ago). If the database server is down for any reason the blog will simple not work and a database connection error will be shown. A mid to high traffic site without cache would easily take down the database server and even the rest of the server if there's not a protection like those implemented by CloudLinux.

Please note that I've been using Serendipity since 2006 on a fairly high traffic website. The only way I could make it work was to implement a reverse proxy server (first Squid, then Varnish) in conjunction with the caching plugin (with some modifications). Without caching the blog would easily take down a server without limits protection in minutes. In my case the CPU usage would go "red" over 60 and die, while caching would help to keep CPU usage below "1" (a huge improvement) and much faster load times. Then I implemented CloudLinux. The blog no longer was a problem for the server, but still without caching the LVE limits would easily block the site several times an hour.

Does it makes sense to have more cached files than the number of posts/categories/tags (in short, no!)

Yes! You might have plugins active that lead to larger permutation of valid entries.

Half a million?.. that's a bug, not a feature my friend.

mdnava avatar Sep 27 '20 17:09 mdnava