joomla-cms icon indicating copy to clipboard operation
joomla-cms copied to clipboard

[4.0] Prevent finding no stemmer in finder when language is "*"

Open bembelimen opened this issue 4 years ago • 25 comments

Pull Request for Issue #30372 .

Summary of Changes

Try to find the correct language when "default star (*)" is given in finder stemmmer.

Testing Instructions

See: https://github.com/joomla/joomla-cms/issues/30372

Actual result BEFORE applying this Pull Request

Current language not found when "*"

Expected result AFTER applying this Pull Request

Current language found.

bembelimen avatar Dec 28 '20 04:12 bembelimen

@Hackwar could you please check here and give your feedback?

bembelimen avatar Dec 28 '20 11:12 bembelimen

There are loads of languages which have no stemmer, thus we can't assume a stemmer is present. If the language is *, we shouldn't even try finding a stemmer in the constructor of the Language class. Generally however we are handling this correctly from my perspective, since we are catching the exception correctly. If xdebug is too eager here, I don't really think this is our fault... Anyway, the correct and easiest solution here is to simply exclude * from getting a stemmer and that's it.

Hackwar avatar Dec 28 '20 12:12 Hackwar

To extend a bit: stemming is a process that takes processing power and I can see several situations where you don't want to have a stemmer at all. * is the equivalent of "I don't want a stemmer"

Hackwar avatar Dec 28 '20 13:12 Hackwar

It was not the intention to prevent the exception but to find a stemmer if possible.

But I have no feelings if this improvement should be merge or not.

bembelimen avatar Dec 28 '20 13:12 bembelimen

Anyway, the correct and easiest solution here is to simply exclude * from getting a stemmer and that's it.

@bembelimen Could you change this PR so it follows @Hackwar 's suggestion? To me it seems to be the better way, too.

richard67 avatar Feb 01 '21 13:02 richard67

That would mean that a default install will never use the stemmer even though the stemmer is available for english, wouldnt it?

brianteeman avatar Feb 01 '21 14:02 brianteeman

That would mean that a default install will never use the stemmer even though the stemmer is available for english, wouldnt it?

Hmm, that's true ... not what's desired.

richard67 avatar Feb 01 '21 14:02 richard67

That is intended behavior. The stemmer is resource intensive and I want people to make a conscious decision to enable this, also to select the right one/configure the search in its entirety for their system instead of simply enabling it and not looking into it further.

Hackwar avatar Feb 01 '21 16:02 Hackwar

That is intended behavior. The stemmer is resource intensive and I want people to make a conscious decision to enable this, also to select the right one/configure the search in its entirety for their system instead of simply enabling it and not looking into it further.

And where do they do that?

* is the equivalent of "I don't want a stemmer"

* is used to mean either required or everything - I have never seen it used to mean nothing

brianteeman avatar Feb 01 '21 16:02 brianteeman

That is intended behavior. The stemmer is resource intensive and I want people to make a conscious decision to enable this, also to select the right one/configure the search in its entirety for their system instead of simply enabling it and not looking into it further.

And where do they do that?

The option is in the component configuration.

* is the equivalent of "I don't want a stemmer"

* is used to mean either required or everything - I have never seen it used to mean nothing

That is why that isn't visible to the user. I did not decide that * is the internal identifier for items without a language. *means I don't know the language (unless set in the component configuration) and thus I wouldn't know which stemmer to use. Since the wrong stemmer would be worse than no stemmer, people should make a conscious decision what they need.

Hackwar avatar Feb 01 '21 22:02 Hackwar

That is intended behavior. The stemmer is resource intensive and I want people to make a conscious decision to enable this, also to select the right one/configure the search in its entirety for their system instead of simply enabling it and not looking into it further.

And where do they do that?

The option is in the component configuration.

I asked as I dont see anything regarding a stemmer in the component configuration for Joomla 4.

I can see it in Joomla 3 but not 4

image

brianteeman avatar Feb 01 '21 22:02 brianteeman

It is the "Default Language" option.

Hackwar avatar Feb 01 '21 23:02 Hackwar

Oh that is so obvious - not

How on earth is anyone to know then that by selecting a default language there is (as you claim) a performance hit.

That list is showing the languages on the site. Nothing to do with if those languages have a stemmer available

brianteeman avatar Feb 02 '21 00:02 brianteeman

And as the j3 field no longer exists you're breaking functionality on upgrade without any documentation

brianteeman avatar Feb 02 '21 00:02 brianteeman

Can we fix this with better text in the default language select?

Maybe

None -> None (Stemmer disabled) Default Site Languge (Stemmer enabled)

rdeutz avatar Mar 23 '21 20:03 rdeutz

From what @hackwar has said then no because the default site language might not have a stemmer available.

brianteeman avatar Mar 23 '21 20:03 brianteeman

Plus there is nothing in the option to indicate that it is anything to do with a stemmer at all

image

brianteeman avatar Mar 23 '21 20:03 brianteeman

Because it's not only the stemmer. This setting selects which language to use for indexing the content. That selection both influences how the text is tokenised as well as the stemmer to use. For example chinese texts don't have spaces between words, so we can't simply do a $words = explode(' ', $text); to split up the text into words. It also influences which list of stopwords to use and maybe in the future this could also select a synonym/thesaurus system. If it were just about the stemmer, it would be called "Select the stemmer".

Hackwar avatar Mar 23 '21 21:03 Hackwar

WOW - it's no wonder we get nowhere when you keep changing your mind on what information you will share about your code.

This is the first time in this entire thread that you have said it was about something more than a stemmer. Every comment you have made here has only referred to this as being a stemmer. @bembelimen obviously thought it was just about a stemmer as he wrote in the PR. You didnt correct him then. I thought it was about a stemmer and you didnt correct me then.

Only now three months later you wake up and start telling everyone its more than a stemmer.

As I said before and now even more re-enforced by your latest revelation this is all an undocumented backwards compatibility break that is being hidden by you from everyone.

brianteeman avatar Mar 23 '21 22:03 brianteeman

Since this feature was broken in J3, there is no b/c to be broken. Besides the fact that our b/c claims don't actually extend to components and a major version actually does allow us to break b/c.

I've described the functionality in the original PR here: #20391

I did not go into an elaborate rant about how the complete indexing system of Smart Search works, because I tend to not write 6-page-long essays to shift blame or try to make others feel stupid. I was asked to give feedback to this PR and concentrated on the context of this PR, which was the code to generate a stemmer object.

All the relevant code in the Joomla\Component\Finder\Administrator\Indexer\Language class has been there since Joomla 2.5, except it was spread out over several classes and actually did not work, especially for languages other than english. It didn't bother you until now.

Hackwar avatar Mar 23 '21 23:03 Hackwar

Breaking b/c is fine it just has to be documented

brianteeman avatar Mar 24 '21 10:03 brianteeman

So the missing point is that we need to document this somewhere, can @Hackwar or @bembelimen do that please, Thanks.

rdeutz avatar Apr 06 '21 18:04 rdeutz

I dont understand why you have removed the release blocker. Its not resolved

brianteeman avatar Apr 06 '21 19:04 brianteeman

This pull request has automatically rebased to 4.2-dev.

HLeithner avatar Jun 27 '22 13:06 HLeithner

This pull requests has been automatically converted to the PSR-12 coding standard.

joomla-bot avatar Jun 27 '22 21:06 joomla-bot

Closing because this change is not a solution for the initial problem. Reopened issue #30372

rdeutz avatar Oct 21 '22 11:10 rdeutz