crawler icon indicating copy to clipboard operation
crawler copied to clipboard

Indexing wasnt forced anymore since middleware order has changed

Open rengaw83 opened this issue 4 years ago • 23 comments

Indexing was'nt forced anymore.

TYPO3 9.5.25 Crawler: 9.2.2

The urls are crawled, but the indexing wasn't forced.

In have debugged a lot, and in my case the problem is the current middleware registration order. In #642 the order was changed, to fix issues with StaticFileCache extension. See https://github.com/AOEpeople/crawler/commit/0f7cb6abac4d8d6e228b0385665dc02f17cdfdcb#diff-38337bad08776b0fe94f73f1cf471f8d184ec3f9d2e9254f7c2a275b83501e34

The middleware order (without staticfilecache installed) now looks like this:

    typo3/cms-frontend/timetracker
    typo3/cms-core/normalized-params-attribute
    typo3/cms-frontend/preprocessing
    typo3/cms-frontend/eid
    typo3/cms-frontend/tsfe
    typo3/cms-frontend/authentication
    typo3/cms-frontend/site
    typo3/cms-frontend/preview-simulator
    aoe/crawler/authentication
    typo3/cms-frontend/base-redirect-resolver
    typo3/cms-frontend/static-route-resolver
    typo3/cms-frontend/page-resolver
    typo3/cms-frontend/maintenance-mode
    typo3/cms-frontend/page-argument-validator
    typo3/cms-frontend/prepare-tsfe-rendering
    typo3/cms-frontend/shortcut-and-mountpoint-redirect
    typo3/cms-frontend/content-length-headers
    aoe/crawler/initialization
    typo3/cms-frontend/output-compression

The aoe/crawler/initialization middleware is loaded after the typo3/cms-frontend/prepare-tsfe-rendering now. The crawler can't fill $GLOBALS['TSFE']->applicationData['tx_crawler'] before the TSFE checks the headerNoCache and so indexed_search cant disable cache output. so no page is generated and no content can be indexed.

Here typo3 9 checks if the page should be generated or the cache will be used https://github.com/TYPO3/TYPO3.CMS/blob/9.5/typo3/sysext/frontend/Classes/Controller/TypoScriptFrontendController.php#L2498-L2519

Here indexed search checks if the crawler is enabled and a indexing is in progress https://github.com/TYPO3/TYPO3.CMS/blob/9.5/typo3/sysext/indexed_search/Classes/Hook/TypoScriptFrontendHook.php#L25-L40

I have reverted https://github.com/AOEpeople/crawler/commit/0f7cb6abac4d8d6e228b0385665dc02f17cdfdcb and the indexing works like expected again.

The crawler/initialization can't be the last in middleware chain, it has to be before the tsfe rendering! I do not use the staticfilecache, so i can't fix issues with the staticfilecache and create a pr.

Maybe this is related to #679, but i have created a new issue, becouse i do not have staticfilecache like bh-teufels has.

rengaw83 avatar Mar 18 '21 08:03 rengaw83

Thanks for you detailed information, I'll see if I can find a solution for this. It sounds like it's related to the issue as you mentioned, and the information about the revert is also very helpful.

Will get back to you when I have more information on this.

tomasnorre avatar Mar 18 '21 14:03 tomasnorre

Have tested the same crawler setup in a TYPO3 10.4.14 Installation. Same problem there, no indexing is forced. After reverting the changes at the middleware order, indexing works like expected in TYPO3 10 too.

rengaw83 avatar Mar 18 '21 15:03 rengaw83

Super. That's really valuable feedback.

I'll need to dig more into Middleware, it's unfortunately something that I don't know well enough yet.

tomasnorre avatar Mar 18 '21 18:03 tomasnorre

Thank you, no problem, do not rush. I have corrected the crawler middleware order in my own extension, so for me there is no hurry.

rengaw83 avatar Mar 19 '21 09:03 rengaw83

I'm afraid we have a chicken/egg problem here.

The https://github.com/AOEpeople/crawler/commit/0f7cb6abac4d8d6e228b0385665dc02f17cdfdcb moved the aoe/crawler/initialization to after the typo3/cms-frontend/tsfe (close to the very end), and this was needed to have the Static File Cache Working.

So moving this back in that position will be a regression, that introduces the bug from #642 again.

Any hints on this would be welcome.

Do you @CDRO or @bmack have an idea on how to solve this?

tomasnorre avatar Apr 10 '21 20:04 tomasnorre

Could we split up the middlewares into one called initialization (as we have it now) and a pre-compression-initialization and extract the parts that are needed to where they need to be to both accomodate indexing (of course, this is the primary need) and static file cache (which in my opinion is secondary to the crawler and the indexing business but very important, nontheless)?

CDRO avatar Apr 12 '21 06:04 CDRO

I honestly don't know. I'm still missing some basics knowledge on the middleware concept, but in general I don't see a problem in splitting it if it would solve the problem. But still not sure if it will solve the problem or introduce a new one.

Can be because of conceptual missing information from my site.

tomasnorre avatar Apr 12 '21 06:04 tomasnorre

We could discuss this over slack tomorrow afternoon, if you have some time, we can still summarize our talking points here.

CDRO avatar Apr 12 '21 06:04 CDRO

Btw... it is possible to define restrictions in EXT:crawler for middlewares that do not exist in an installation (as far as I can remember): "Load after TSFE initialization, but before EXT:staticfilecache middleware XYZ") staticfilecache has very "lax" middleware orderings so we can push it in the right order (don't know from just reading this issue where EXT: staticfilecache should be loaded and where the crawler middleware should be loaded).

bmack avatar Apr 12 '21 07:04 bmack

Do you have a link to the documentation @bmack ?

tomasnorre avatar Apr 12 '21 08:04 tomasnorre

So, we discussed it with @tomasnorre and found what we need to change.

Tomas prepares an according PR.

What we did is split the initialization to have a part that ensures that the content is always rendered, before we return the result status to the crawler runner.

CDRO avatar Apr 13 '21 15:04 CDRO

@rengaw83 Could you please check if this PR fixes the problem for you? https://github.com/AOEpeople/crawler/pull/735

tomasnorre avatar Apr 13 '21 18:04 tomasnorre

hi @tomasnorre, i have testet the branch but it does not work.

The position of the initialization have not changed and is still in the wrong position. The initialization middleware has to be placed before the typo3/cms-frontend/prepare-tsfe-rendering.

The $GLOBALS['TSFE']->applicationData['tx_crawler'] stuff, which is set by the initialization middleware, has to be set before the prepare-tsfe-rendering middleware. Otherwise TYPO3 will provide cached content and indexed_search does not index cached content.

At the typo3/cms-frontend/prepare-tsfe-rendering middleware the nocache checks are performed and the indexed_search TSFE hook to disable the caching kicks in.

indexed_search needs the crawler initialisation to disable the cache for the request at this point.

Its the same problem like #610.

rengaw83 avatar Apr 15 '21 09:04 rengaw83

@rengaw83 Thanks for getting back to me. We thought splitting it into the ContentFinisher the problem would be solved. I'll go back to the drawing board.

Don't want to break anything in the StaticFileCache with a fix in the Crawler.

tomasnorre avatar Apr 15 '21 10:04 tomasnorre

I understand, it's a bit tricky with the different dependencies, i guess. I don't know what's the problem with staticfilecache, have never used it. But the thing is, currently the crawler does not work if the staticfilecache extension is not installed.

I have added a RequestMiddlewares.php in my project extension and take care of the correct middleware position by myself.
With these restrictions, everything works as expected:


return [
    'frontend' => [
        'aoe/crawler/initialization' => [
            'target' => \AOE\Crawler\Middleware\CrawlerInitialization::class,
            'after' => [
                'typo3/cms-frontend/tsfe',
            ],
            'before' => [
                'typo3/cms-frontend/prepare-tsfe-rendering',
            ],
        ],
    ],
];

rengaw83 avatar Apr 15 '21 18:04 rengaw83

Thanks will have a look at it.

tomasnorre avatar Apr 15 '21 19:04 tomasnorre

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 21 '22 17:12 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Feb 20 '23 02:02 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 22 '23 15:04 stale[bot]

Since commit 8a9b896721f479d7babc4bb5920536e8124a1797 (#837) the "real" response is returned by the CrawlerInitialization middleware. The information about the crawling/indexing process is transferred via a HTTP header now.

The reason for the patch 0f7cb6abac4d8d6e228b0385665dc02f17cdfdcb (#642) that broke the middleware order is gone, and we can restore the proper loading order now.

cweiske avatar May 25 '23 15:05 cweiske

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 25 '23 05:07 stale[bot]

Stalebot I hate you

Regards/Mit freundlichen Grüßen Christian Weiske

cweiske avatar Jul 26 '23 08:07 cweiske

I'm considering disabling it.

tomasnorre avatar Jul 27 '23 09:07 tomasnorre