crawler icon indicating copy to clipboard operation
crawler copied to clipboard

hundreds of errors in the typo3 log since updating PHP from 8.2 to 8.3

Open Typo3AndMore opened this issue 9 months ago • 15 comments

Bug Report

Current Behavior each day hundreds of errors in the typo3 log since updating PHP from 8.2 to 8.3: Core: Error handler (BE): PHP Warning: unserialize(): Error at offset 0 of 15 bytes in /html/typo3/typo3conf/ext/crawler/Classes/Converter/JsonCompatibilityConverter.php line 53

Expected behavior/output no errors in the log

Steps to reproduce not running the console command for crawler:processQueue

Environment

  • Crawler version(s): 12.0.7
  • TYPO3 version(s): 12.4.25
  • PHP version(s): 8.3.15
  • Is your TYPO3 installation set up with Composer (Composer Mode): no

Possible Solution

Additional context

Typo3AndMore avatar Feb 03 '25 10:02 Typo3AndMore

Hi there, thank you for taking your time to create your first issue. Please give us a bit of time to review it.

github-actions[bot] avatar Feb 03 '25 10:02 github-actions[bot]

Thanks. Will have a look at it.

PRs are welcome

tomasnorre avatar Feb 03 '25 16:02 tomasnorre

@Typo3AndMore Could you do me a favor, and try to empty you queue and see if it's still happens, technically it should never reach the unserializer as only new records are json encoded.

tomasnorre avatar Feb 03 '25 17:02 tomasnorre

Could this be the same "issue" / solution? https://github.com/tomasnorre/crawler/pull/1124#issuecomment-2633143714

tomasnorre avatar Feb 04 '25 18:02 tomasnorre

@tomasnorre Thank you for your investigation. Two days ago, I emptied the table tx_crawler_queue. This afternoon, I again had over 200 items in the log with error level "warning" regarding errors at offset 0 to 15 bytes in the unserializer.

I checked the system setting for [SYS][exceptionalErrors], it has the value 4096, which corresponds to E_RECOVERABLE_ERROR; the [SYS][errorHandlerErrors] is set to 30466, which corresponds to E_WARNING | E_USER_ERROR | E_USER_WARNING | E_USER_NOTICE | E_RECOVERABLE_ERROR | E_DEPRECATED | E_USER_DEPRECATED

best regards

Typo3AndMore avatar Feb 05 '25 15:02 Typo3AndMore

I don't understand why, because as written in the comment of the file.

To ensure that older crawler entries, which have already been stored as serialized data still works, we have added this converter that can be used for the reading part. The writing part will be done in json from now on.

Don't sound like it's written in json, cause it should return on line 49 then https://github.com/tomasnorre/crawler/blob/main/Classes/Converter/JsonCompatibilityConverter.php#L49.

Would it be possible for you to post a queue entry without host, client or any GDPR violating data, so that I can see what it written to database in your setup.

I'm especially interested in the json string stored. tx_crawler_queue.parameters field.

This might help me write a test case that can fix the problem.

Thanks in advance.

tomasnorre avatar Feb 05 '25 21:02 tomasnorre

This is an example from the crawler devbox. https://3v4l.org/WKnKd

It shows that it writes json data to database, and it will return, as it's array on the first return.

You data must be someway "invalid" or unexpected. So the data from database would help me a lot. Thanks.

tomasnorre avatar Feb 05 '25 21:02 tomasnorre

As I tried to reproduce this, I saw some similar warning in the TYPO3 Frontend Cache https://forge.typo3.org/issues/106107 might be related.

tomasnorre avatar Feb 07 '25 06:02 tomasnorre

I cannot reproduce this. So until further information on how to reproduce this, I'll mark this as on hold.

tomasnorre avatar Feb 11 '25 15:02 tomasnorre

Hello Tomas,

since clearing the crawler queue, the error in the log occours 226 times every day at the same time slot.

I have now checked the database table "tx_crawler_queue" and found that in the "result data" column there are entries with "{"content":""404 Not Found""}" and "{"content":""403 Forbidden""}" - exactly 226 times.

That seems to be the reason. I have no idea why these entries are in the database table.

I hope there is a way to ignore this data.

Thank you for your support and best regards.

Typo3AndMore avatar Feb 11 '25 16:02 Typo3AndMore

That's not valid json, so that's why it's continuing to the serializer. The question is how is it ending up in the DB? Looks like URLs not found, or Access forbidden is tried to be crawled.

Do you have a FE user set in your crawler configuration ? https://docs.typo3.org/p/tomasnorre/crawler/12.0/en-us/Configuration/ConfigurationRecords/Index.html

This might help with the 403 Forbidden at least.

tomasnorre avatar Feb 11 '25 16:02 tomasnorre

Hello Tomas,

yes, we have 4 FE user groups and the "all" group.

There are 5 times the 404 error. These are caused by the URL "https://www.schuhmann.de/aktuelles-service/aktuelle-steuerinfos/newsdetails.html" which doesn't exist in this shortened form. It only exists when followed by the name of the news article.

The rest are the 403 errors because of the FE user groups.

Best regards

Typo3AndMore avatar Feb 12 '25 13:02 Typo3AndMore

yes, we have 4 FE user groups and the "all" group.

There are 5 times the 404 error. These are caused by the URL "https://www.schuhmann.de/aktuelles-service/aktuelle-steuerinfos/newsdetails.html" which doesn't exist in this shortened form. It only exists when followed by the name of the news article.

The rest are the 403 errors because of the FE user groups.

Then I don't think your crawler is configured correctly, there shouldn't be any 404 or 403 at all in the logs.

I know it's not the culprit here, but still an reason for many logs not needed.

tomasnorre avatar Feb 12 '25 14:02 tomasnorre

Hello Tomas,

I analyzed the 403 entries in the database table "tx_crawler_queue". These entries are created in the following cases:

In the pagetree, there is a page "internal" with several subpages. Each subpage can belong to one or more FE user groups. For example:

  • internal
    • -> page 1 (accessible to group A and B)
    • -> page 2 (accessible to group A and B)
    • -> page 3 (accessible to group A and B)
      • -> page 3.1 (same)
      • -> page 3.2 (same)
    • -> page 4 (accessible only to group A)

The configuration is set up as follows:

  • for group A, the crawler runs with the FE user group A only
  • for group B, the crawler runs with the FE user group B only

This results in the following behavior:

  • with the pages 1 to 3 including subpages everything works fine
  • when the crawler processes page 4 using the configuration for group A, everything works fine
  • when the crawler processes page 4 using the configuration for group B, a 403-entry is created

How is the way to prevent these 403 entries through configuration?

For my "public" group, I excluded the page "internal" and its subpages in the crawler configuration record. This significantly reduces the number of 403 entries. However, I can't do this with my other groups because it's hard to exclude every single relevant page manually. The content is too complex and it changes too often.

Thank you so much

Best regards

Typo3AndMore avatar Mar 28 '25 09:03 Typo3AndMore

How does you configuration look like? do you only do request for a specific group with the correct user, or or do you do request with not authenticated users too against some like?

The crawler will memic what you ask it today.

So if you have selected all pages including subpages, but no access group added, then some will throw an 403 as not authenticated.

So only call the pages that don't need authentication with non-authenticated users or none. And the pages that requires authentication, should be called with their respective authentications groups/users.

tomasnorre avatar Mar 31 '25 07:03 tomasnorre

The configuration is wrong:

I checked the system setting for [SYS][exceptionalErrors], it has the value 4096, which corresponds to E_RECOVERABLE_ERROR

This setting must include E_WARNING.

cweiske avatar Aug 20 '25 12:08 cweiske

The configuration change is not necessary anymore with #1167.

cweiske avatar Sep 03 '25 09:09 cweiske

This is closed according to comment. https://github.com/tomasnorre/crawler/issues/1123#issuecomment-3248356030

tomasnorre avatar Sep 03 '25 13:09 tomasnorre