Cache folder grows large due to URLs coming from The Events Calendar empty dates
Before submitting an issue please check that you’ve completed the following steps: Yes - Made sure you’re on the latest version Yes - Used the search feature to ensure that the bug hasn’t been reported before
Describe the bug The Events Calendar plugin allows browsing empty dates as URLs, for example:
https://example.com/calendar/2021-05-31
https://example.com/calendar/2019-03
https://example.com/calendar/2019-08-10
https://example.com/calendar/2016-12-10
etc
These URLs are added to the wpr_rocket_cache table.
If a bot visits these URLs, for example, we might end up with thousands of empty dates to preload, this will overload the database and the path where the calendar is published will start growing (/calendar/ in this case).
The cache folder can grow a lot mostly with not really useful cached URLs. Additionally, this will be increasing the overall processing time, preload, used CSS generation, etc.
Excluding /calendar/20(.*) fixed the issue in a particular case.
To Reproduce Steps to reproduce the behavior:
- install The Events Calendar
- Enable Preload
- Start browsing random empty calendar dates
- See the cache folder and the wpr_rocket_cache table grows
Expected behavior We should exclude the empty calendar dates from being preloaded.
Screenshots
https://i.imgur.com/2NdJoli.png
Additional context Ticket https://secure.helpscout.net/conversation/2036275504/374580?folderId=2683093
Slack thread: https://wp-media.slack.com/archives/C43T1AYMQ/p1666015442151769
Backlog Grooming (for WP Media dev team use only)
- [ ] Reproduce the problem
- [ ] Identify the root cause
- [ ] Scope a solution
- [ ] Estimate the effort
We'll need to create a 3rd-party compatibility here, find URL structure for that kind of archives and exclude them from the preload.
To be sure we're not excluding too much, we can target the /calendar/20(.*)
Scope a solution
For this we will create a third party class that detect if the plugin is activated using that constant TRIBE_EVENTS_FILE.
Then if it is the case it will register a callback to the filter rocket_preload_exclude_urls with the following logic:
public function exclude_from_preload_calendars($excluded) {
$excluded[] = '/calendar/20(.*)';
return $excluded;
}
Then we will have to create tests linked to it.
Estimate the effort
Effort XS
During QA for 3.13.4, I found that the eventsSlug option we are checking to get the events slug isn't set until the Events Calendar plugin settings are saved from the UI.:
https://github.com/wp-media/wp-rocket/blob/329278e5de7483f311784e65cb400926f7a9f1c2/inc/ThirdParty/Plugins/TheEventsCalendar.php#L38
So, instead of the correct slug, we are using the fallback event which prevents us from excluding the events' URLs from preloading.
According to @piotrbak, this is acceptable, and if there are reports of issues we can revisit this one.
This issue is still happening, categories and tags are still being cached, here is a new case: https://secure.helpscout.net/conversation/2483254973/469582?folderId=2952229 WP Rocket version: 3.15.7
Tribe events has the some functions and hooks we could maybe use?
https://docs.theeventscalendar.com/reference/functions/tribe_is_event https://docs.theeventscalendar.com/reference/functions/tribe_is_event_category/
We need to revisit this one to make sure that we're matching initial expectations
What are the missing expectations? The initial issue talks about the calendar type only, which is currently handled by the third party.
Currently I am sitting on one site with 2,35GB and second site 16,5GB because of this. Any update to the issue would be appreciated.
@Smexhy Can you elaborate a bit on the issue you are observing? The expected behavior from this GH issue is to exclude from preload URLs like .../calendar/20... ; which, we believe, is currently done by WP Rocket. Do you see something different?
@piotrbak While the original issue targeted URLs like .../calendar/20..., it looks like some other URLs are reported (see @mifrero's HelpScout conversation, and maybe @Smexhy's feedback). It looks like a dedicated issue, or at least a new expected behavior would need to be defined ; if we need to exclude more stuff.
@Smexhy Can you elaborate a bit on the issue you are observing? The expected behavior from this GH issue is to exclude from preload URLs like .../calendar/20... ; which, we believe, is currently done by WP Rocket. Do you see something different?
@piotrbak While the original issue targeted URLs like .../calendar/20..., it looks like some other URLs are reported (see @mifrero's HelpScout conversation, and maybe @Smexhy's feedback). It looks like a dedicated issue, or at least a new expected behavior would need to be defined ; if we need to exclude more stuff.
I had to be very quick with the solution since I was running out of space, but I noticed that there were folders in the cache/wp-rocket/events folder with dates ranging from like 1900 to 2100 or something like that. I really poorly docimented it, but it seems like this issue is related to my problem with huge cache folders due to the preloading of all possible dates searchable in the event calendar plugin? I temporarily disabled wp rocket because of that. Not sure if this was supposed to be fixed already or if the fix is currently to just exclude events cache entirely in plugin’s settings. I can let it run again and document it some more to get more info, but the core problem being reported here was how I found this issue through google search in the first place.
I just had this issue a week ago (turned out they weren't using The Events Calendar plugin anymore so my final fix was disabling it), and found that the problem existed primarily in the site paths:
/events/category/{CATEGORY_NAME}/
where each month going back to the oldest event in the system had a folder (2024-06, 2024-05, etc)
and
/events/category/{CATEGORY_NAME}/day/
where each day going back to the oldest event in the system had a folder (2024-06-26, 2024-06-25, 2024-06-24, etc) regardless of there being actual events on these days or months. The client only had 30 events over a 2 year or so period.
With multiple categories this created thousands of pages with pre-load generated cache files in it eating up multiple GB of space.
Before disabling the calendar plugin I ended up addind the following exclusion urls to the Preload tab.
/events/category/(.*)
…and for good measure just in case…
/events/month/(.*)
/events/today/(.*)
Excluding it in the preload then would only cache the pages if visited rather than entirely since it's the Preload process causing all the unneeded pages, not normal caching.
Hope this is useful.
I also had /events/tags folder pretty much bigger than what I expected. I have very small site and like a few events in the calendar, so I was very surprised for the folders to be that big.
This problem is now much worse because of cloud based Remove Unused CSS (RUCSS) .
It does not just fill up the cache but it also DDoS's the site from the cloud, e.g.: requests like this:
https://<example.com>/events/month/2045-06/?nowprocket=1&no_optimize=1&wpr_imagedimensions=1" "WP-Rocket/SaaS Mozilla/5.0 (Linux; Android 13; Pixel 5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Mobile Safari/537.36"
with the accompanying JS/CSS all loaded too for optimization. The problem is, this is an empty link to an empty calendar date in 2045! The RUCSS spider is just inventing dates by following programmatic calendar links from the tribe events calendar.
This is the most popular calendar plugin that powers probably 90% of calendars on wordpress. It's a major problem (and expense for cloud resources) for RUCSS to be DDoSing sites with these calendar links. The requests come in fast enough to pin a 4 CPU server which is expensive and confusing for end users who will never figure it out, and expensive and wasteful for WP Rocket on their cloud infrastructure.
So this has grown beyond a disk full in the cache issue to become a sites DDoSing themselves issue.
Please find a way to exclude future events calendar URLs from RUCSS!
The only workarounds I've found seem to be disabling RUCSS entirely site-wide, or excluding the events pages from the cache, neither of which is a good solution. Likewise downloading the helper plugin to reduce the frequency doesn't solve the problem either. we don't need RUCSS running on imaginary non-existant empty event pages decades in the future!
If this should be a new issue I can open one, but it's maybe easily fixed here?