crawler
crawler copied to clipboard
hundreds of errors in the typo3 log since updating PHP from 8.2 to 8.3
Bug Report
Current Behavior each day hundreds of errors in the typo3 log since updating PHP from 8.2 to 8.3: Core: Error handler (BE): PHP Warning: unserialize(): Error at offset 0 of 15 bytes in /html/typo3/typo3conf/ext/crawler/Classes/Converter/JsonCompatibilityConverter.php line 53
Expected behavior/output no errors in the log
Steps to reproduce not running the console command for crawler:processQueue
Environment
- Crawler version(s): 12.0.7
- TYPO3 version(s): 12.4.25
- PHP version(s): 8.3.15
- Is your TYPO3 installation set up with Composer (Composer Mode): no
Possible Solution
Additional context
Hi there, thank you for taking your time to create your first issue. Please give us a bit of time to review it.
Thanks. Will have a look at it.
PRs are welcome
@Typo3AndMore Could you do me a favor, and try to empty you queue and see if it's still happens, technically it should never reach the unserializer as only new records are json encoded.
Could this be the same "issue" / solution? https://github.com/tomasnorre/crawler/pull/1124#issuecomment-2633143714
@tomasnorre Thank you for your investigation. Two days ago, I emptied the table tx_crawler_queue. This afternoon, I again had over 200 items in the log with error level "warning" regarding errors at offset 0 to 15 bytes in the unserializer.
I checked the system setting for [SYS][exceptionalErrors], it has the value 4096, which corresponds to E_RECOVERABLE_ERROR; the [SYS][errorHandlerErrors] is set to 30466, which corresponds to E_WARNING | E_USER_ERROR | E_USER_WARNING | E_USER_NOTICE | E_RECOVERABLE_ERROR | E_DEPRECATED | E_USER_DEPRECATED
best regards
I don't understand why, because as written in the comment of the file.
To ensure that older crawler entries, which have already been stored as serialized data still works, we have added this converter that can be used for the reading part. The writing part will be done in json from now on.
Don't sound like it's written in json, cause it should return on line 49 then https://github.com/tomasnorre/crawler/blob/main/Classes/Converter/JsonCompatibilityConverter.php#L49.
Would it be possible for you to post a queue entry without host, client or any GDPR violating data, so that I can see what it written to database in your setup.
I'm especially interested in the json string stored. tx_crawler_queue.parameters field.
This might help me write a test case that can fix the problem.
Thanks in advance.
This is an example from the crawler devbox. https://3v4l.org/WKnKd
It shows that it writes json data to database, and it will return, as it's array on the first return.
You data must be someway "invalid" or unexpected. So the data from database would help me a lot. Thanks.
As I tried to reproduce this, I saw some similar warning in the TYPO3 Frontend Cache https://forge.typo3.org/issues/106107 might be related.
I cannot reproduce this. So until further information on how to reproduce this, I'll mark this as on hold.
Hello Tomas,
since clearing the crawler queue, the error in the log occours 226 times every day at the same time slot.
I have now checked the database table "tx_crawler_queue" and found that in the "result data" column there are entries with "{"content":""404 Not Found""}" and "{"content":""403 Forbidden""}" - exactly 226 times.
That seems to be the reason. I have no idea why these entries are in the database table.
I hope there is a way to ignore this data.
Thank you for your support and best regards.
That's not valid json, so that's why it's continuing to the serializer. The question is how is it ending up in the DB? Looks like URLs not found, or Access forbidden is tried to be crawled.
Do you have a FE user set in your crawler configuration ? https://docs.typo3.org/p/tomasnorre/crawler/12.0/en-us/Configuration/ConfigurationRecords/Index.html
This might help with the 403 Forbidden at least.
Hello Tomas,
yes, we have 4 FE user groups and the "all" group.
There are 5 times the 404 error. These are caused by the URL "https://www.schuhmann.de/aktuelles-service/aktuelle-steuerinfos/newsdetails.html" which doesn't exist in this shortened form. It only exists when followed by the name of the news article.
The rest are the 403 errors because of the FE user groups.
Best regards
yes, we have 4 FE user groups and the "all" group.
There are 5 times the 404 error. These are caused by the URL "https://www.schuhmann.de/aktuelles-service/aktuelle-steuerinfos/newsdetails.html" which doesn't exist in this shortened form. It only exists when followed by the name of the news article.
The rest are the 403 errors because of the FE user groups.
Then I don't think your crawler is configured correctly, there shouldn't be any 404 or 403 at all in the logs.
I know it's not the culprit here, but still an reason for many logs not needed.
Hello Tomas,
I analyzed the 403 entries in the database table "tx_crawler_queue". These entries are created in the following cases:
In the pagetree, there is a page "internal" with several subpages. Each subpage can belong to one or more FE user groups. For example:
- internal
- -> page 1 (accessible to group A and B)
- -> page 2 (accessible to group A and B)
- -> page 3 (accessible to group A and B)
- -> page 3.1 (same)
- -> page 3.2 (same)
- -> page 4 (accessible only to group A)
The configuration is set up as follows:
- for group A, the crawler runs with the FE user group A only
- for group B, the crawler runs with the FE user group B only
This results in the following behavior:
- with the pages 1 to 3 including subpages everything works fine
- when the crawler processes page 4 using the configuration for group A, everything works fine
- when the crawler processes page 4 using the configuration for group B, a 403-entry is created
How is the way to prevent these 403 entries through configuration?
For my "public" group, I excluded the page "internal" and its subpages in the crawler configuration record. This significantly reduces the number of 403 entries. However, I can't do this with my other groups because it's hard to exclude every single relevant page manually. The content is too complex and it changes too often.
Thank you so much
Best regards
How does you configuration look like? do you only do request for a specific group with the correct user, or or do you do request with not authenticated users too against some like?
The crawler will memic what you ask it today.
So if you have selected all pages including subpages, but no access group added, then some will throw an 403 as not authenticated.
So only call the pages that don't need authentication with non-authenticated users or none. And the pages that requires authentication, should be called with their respective authentications groups/users.
The configuration is wrong:
I checked the system setting for [SYS][exceptionalErrors], it has the value 4096, which corresponds to E_RECOVERABLE_ERROR
This setting must include E_WARNING.
The configuration change is not necessary anymore with #1167.
This is closed according to comment. https://github.com/tomasnorre/crawler/issues/1123#issuecomment-3248356030