php-readability icon indicating copy to clipboard operation
php-readability copied to clipboard

Unexpected title cleaning

Open Simounet opened this issue 3 years ago • 2 comments

Hi there, I don't get why we are cleaning the title content before the : character. It could be legit content. https://github.com/j0k3r/php-readability/blob/9a490fac078b0f773c9848af1c6d76336a073a8d/src/Readability.php#L850

Simounet avatar Apr 11 '21 12:04 Simounet

I agree it could be a legit content but I guess that in most cases, the text before : is often the website name.

It's here since the beginning: https://bitbucket.org/fivefilters/php-readability/src/5112edb387b53931ab9324b890fa581c0e951d2d/Readability.php#lines-260

j0k3r avatar Apr 12 '21 08:04 j0k3r

I understand but do you know many sites using this pattern? I don't. If we follow this rule, we should do the same with - and |. Sometimes the site name is at the beginning, sometimes at the end. Hard to tell. I think that we should remove this or at least be able to bypass this condition.

Simounet avatar Apr 12 '21 09:04 Simounet