pkp-lib HTML galley indexing can't be disabled

Describe the bug In OJS' config file, there are applications that can be mapped to mime types for search indexing. These can be commented out if you want to disable indexing for PDF, for example. However, HTML galleys cannot be disabled. On journals with very large search index tables, this means that the second step when uploading a galley can take a long time, and it also seems that the indexer has become more prone to problems. We have a very math-heavy journal that uses a lot of math markup that seems to get parsed in weird ways now that they are on 3.3, resulting occasionally in errors like this:

PHP message: PHP Fatal error: Uncaught PDOException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '\xe2\x88\x91i\xe2\x88\x88' for key 'submission_search_keyword_text'

And the only way that I could get this journal working again was to remove the call to:

$articleSearchIndex->submissionFilesChanged($submission);

In PKPManageFileApiHandler.inc.php in the saveMetadata method. This also greatly sped up the upload process.

Seen in OJS 3.3.0.7 and did not seem to be a problem in 3.2.1.

Jul 20 '21 16:07 jnugent

@jnugent, the slow indexing problem will be solved with https://github.com/pkp/pkp-lib/issues/4622 (for release in 3.4).

I'm not sure what would have caused the duplicate entry problem, but it looks to me like a combination of the HTML not being encoded as expected by OJS (maybe not UTF-8?) in combination with a collation problem. Simplified, the keyword indexing stuff works like this:

$result = DB::select('SELECT keyword_id FROM keywords WHERE keyword = ?', [$keyword]);
if ($row = $result->next()) {
   // Found an existing keyword.
   $keywordId = $result->keyword_id;
} else {
    // Did not find an existing keyword -- insert one.
    DB::insert('INSERT INTO keywords (keyword) VALUES (?)', [$keyword]);
    $keywordId = DB::lastInsertId();
}

If you get a duplicate entry, it's because the SELECT did not return a result, but the INSERT caused a collision. I don't think this has changed since 3.2.x.

Jul 20 '21 17:07 asmecher

Hi @asmecher thanks for the note re 3.4!

This database is entirely UTF8, and uses the utf8_general_ci collation for everything. The file in question is also utf8:

jason@MalHavoc-Linux:~/Downloads$ file art.html 
art.html: HTML document, UTF-8 Unicode text, with very long lines, with CRLF line terminators

The installation itself has been around for a long time, and has 4300 submissions and 14 thousand submission files. This problem also cropped up after the upgrade when I attempted to rebuild the search index. The index cleared fine, but the rebuild failed because of the same constraint violation. Removing the call to submissionFilesChanged in the rebuild fixed the problem because the submission metadata indexes just fine.

While we're talking about the search for this install, we're also seeing single characters being stored in index. These are not ASCII characters, they are UTF8 characters representing Greek symbols in math formulae. But the min_word_length is set to 3 so they should not be getting indexed at all, unless the string length calculation is not correct in some way.

This wasn't an issue with the 3.2.1 version of the installation, which ran from Sept 2019 until last week.

Jul 20 '21 17:07 jnugent

There have only been syntactical changes to that code since stable-3_2_0, so I'm not sure what would be causing that (and I would also have expected to get other reports by now); could you dig a little further into what's happening during the keyword lookup? See SubmissionSearchDAO::insertKeyword.

Jul 20 '21 17:07 asmecher

I will do that and report back. Coming back to the title of the issue, though, it'd be nice to be able to disable file indexing completely. In this particular case the search index ends up getting filled with weird nonsensical strings based on partially truncated math formulae and we end up with a search index table with 7 million records. The articles have great submission metadata so the file indexing is just problematic.

Jul 20 '21 18:07 jnugent

I suspect the math characters causing the indexing error here is because the mysql 'utf8' character set doesn't include 4 byte characters. utf8mb4 includes the full set. The indexer ignores the fancy character, possibly resulting in a word that's already in the index. When you try to insert that word, it gets rejected as a duplicate. I am working on a pull request to allow all file types to have the option of routing through the programs defined in config.inc.php to pre-process the data during indexing. Right now, text/html is forcibly routed into a built in tag stripper, where we'd like to process our galley files first to strip 4 byte characters and use a third party program to strip the html and pass back the result to the indexer.

Dec 06 '22 21:12 wopsononock

Hi @wopsononock!

Not sure if you're still working on this, but I've included some fixes, including the ones which were discussed here, at this issue: https://github.com/pkp/pkp-lib/issues/8915.

I wasn't planning to backport it to 3.3.0, so I'll leave this issue opened for now, let me know if you've got different ideas.

Apr 13 '23 20:04 jonasraoni

@asmecher I've covered the issues here at https://github.com/pkp/pkp-lib/issues/8915. Should I backport it to stable-3_3_0? If not, then I think this can be closed as a won't fix.

Mar 19 '24 14:03 jonasraoni

pkp-lib pkp-lib copied to clipboard

HTML galley indexing can't be disabled

pkp-lib
pkp-lib copied to clipboard