redmine_didyoumean icon indicating copy to clipboard operation
redmine_didyoumean copied to clipboard

Fuzzy Word Matching

Open ZaLiTHkA opened this issue 10 years ago • 46 comments

Could DYM be extended to use the idea of "approximate string matching" (Wikipedia reference)? Not everybody phrases their sentences the same way.. I honestly believe this plugin would be a million times more useful if words like "faulty" also returned a match for "fault", "saving" matched "saved" or "save", etc etc.

Unfortunately I'm not familiar enough with Ruby to do this myself, but a quick Google search shows many Ruby Gems that provide this type of functionality. So I'm guessing (read: hoping) this isn't an unreasonable request. :)

ZaLiTHkA avatar Jul 08 '14 09:07 ZaLiTHkA

Yes, that's something that we had considered in the past, but never implemented as we did not want to kill the underlying database with too many queries.

Can you please list a few of the Ruby Gems you had found?

abahgat avatar Jul 08 '14 21:07 abahgat

From my limited understanding, I can only imagine a DB hit when the front end gets an updated string to compare (I use the "as you type" method in my setup, which could actually skip key presses like shift, space etc.. but that's not the point of my request here). Do you think this idea would cause more hits?

One that keeps popping up in SO questions is amatch, but after a bit of digging it looks like either fuzzy_match or fuzzy-string-match might be faster.

Another interesting looking one is fuzzily, however the dev recommends using blurrily instead for large datasets.

From the lib weight point of view, it looks like amatch and fuzzily have less dependencies, but I'm not too sure how much difference that would actually make. Keep in mind I don't speak Ruby, so I'm running blind here. :)

ZaLiTHkA avatar Jul 09 '14 06:07 ZaLiTHkA

What about full text search with elasticsearch-rails I could work with sidekiq worker as an async indexer. Indexer could be triggered by hooks in Issue (added via patch). There is redmine_sidekiq plugin so you neeed only provide worker. What you say?

rlisowski avatar Aug 05 '14 19:08 rlisowski

Consider also thinkingsphinx which can fit better that elasticsearch to current features list (filter by project_id, issue_id, status)

rlisowski avatar Aug 05 '14 20:08 rlisowski

:+1:

rlisowski avatar Sep 16 '14 11:09 rlisowski

I just pulled the changes referenced above into my fork of the repo, but I still don't get any fuzzy searching.. When I migrate plugins, I do get a warning to say "Sphinx cannot be found on your system". Tried installing the gem manually (gem install sphinx from Redmine's htdocs root folder), which didn't give any errors, but it also didn't change anything, so I'm not sure how to get around this.

@korin, was that thumbs-up meant to imply that @swiatkiewicz's changes work? Or have you not had a chance to test them yet?

ZaLiTHkA avatar Sep 16 '14 16:09 ZaLiTHkA

It works, you need to install Sphinx. See ThinkingSphinx quickstart guide it's already available in most distros.

To be honest @swiatkiewicz changes allow replace sql like search with sphinx indexer which is faster option. It's only one step from fuzzy matching feature.

rlisowski avatar Sep 16 '14 17:09 rlisowski

Thanks for the link. Unfortunately I need to run my system in a Windows environment, so it looks like I've got some reading to do before I get that part working.

The SQL like search works perfectly and the additional configuration options are helpful, so this is already a nice improvement.

ZaLiTHkA avatar Sep 16 '14 18:09 ZaLiTHkA

I can't get this feature to work :( I have already started sphinx etc, then trying "test" and have results: Bug # 1 – Testowy błąd (Closed w projekcie Projekt testowy) Bug # 11 – test (New w projekcie Projekt testowy) Feature # 19 – Test (New w projekcie) Feature # 21 – testowe zadanie (New w projekcie Zadania w realizacji KD) Feature # 22 – Ticket testowo pokazowy (New w projekcie Zadania w realizacji KD) so I assume it should show something for "testy" (should strip "y" and match like "test"?)

dominch avatar Feb 04 '15 17:02 dominch

show us plugin settings /settings/plugin/redmine_didyoumean

rlisowski avatar Feb 04 '15 17:02 rlisowski

it's in polish but you know all settings. http://i.imgur.com/gvLDYVy.png any ideas?

dominch avatar Feb 04 '15 17:02 dominch

Should work, I have similar settings. Any errors in redmine log file? or in browser development console?

rlisowski avatar Feb 04 '15 18:02 rlisowski

You can also try rebuild thinking sphinx index with rake ts:rebuild.

rlisowski avatar Feb 04 '15 18:02 rlisowski

No errors so far noticed :( I'll try out this with new tickets, BTW: is ts:rebuild better than ts:index ?

dominch avatar Feb 04 '15 18:02 dominch

only small difference with configuration see http://pat.github.io/thinking-sphinx/rake_tasks.html

rlisowski avatar Feb 04 '15 18:02 rlisowski

Ok, so basically ts:rebuild is same as stop+index+start, great :) Still can't get this to work as expected. I tried mysql first, is there a chance it still use it? rake tasks are running ok without any errors, also settings should be ok, but still I can't get right results. Maybe it's something wrong with my sphinx in system? how can I test?

dominch avatar Feb 05 '15 14:02 dominch

Have you tried testing it in rails console? It's working like " Issue.search 'somethig' ", and then you should see info about search engine (sql or sphinx).

2015-02-05 15:57 GMT+01:00 dominch [email protected]:

Ok, so basically ts:rebuild is same as stop+index+start, great :) Still can't get this to work as expected. I tried mysql first, is there a chance it still use it? rake tasks are running ok without any errors, also settings should be ok, but still I can't get right results. Maybe it's something wrong with my sphinx in system? how can I test?

— Reply to this email directly or view it on GitHub https://github.com/abahgat/redmine_didyoumean/issues/56#issuecomment-73059199 .

swiatkiewicz avatar Feb 05 '15 16:02 swiatkiewicz

Trying now:

2.0.0-p594 :001 > Issue.search 'somethig'
  CustomField Load (0.6ms)  SELECT `custom_fields`.* FROM `custom_fields` WHERE `custom_fields`.`type` = 'IssueCustomField' AND `custom_fields`.`searchable` = 1
  Role Load (0.3ms)  SELECT `roles`.* FROM `roles` WHERE `roles`.`builtin` = 2 LIMIT 1
  GroupAnonymous Load (0.6ms)  SELECT `users`.* FROM `users` WHERE `users`.`type` IN ('GroupAnonymous') ORDER BY id LIMIT 1
  Member Load (0.4ms)  SELECT `members`.* FROM `members` INNER JOIN `projects` ON `projects`.`id` = `members`.`project_id` WHERE (projects.status  9) AND (members.user_id = 2 OR (projects.is_public = 1 AND members.user_id = 49))
   (15.0ms)  SELECT COUNT(DISTINCT `issues`.`id`) FROM `issues` LEFT OUTER JOIN `projects` ON `projects`.`id` = `issues`.`project_id` LEFT OUTER JOIN `journals` ON `journals`.`journalized_id` = `issues`.`id` AND (journals.private_notes = 0 OR (1=0)) AND `journals`.`journalized_type` = 'Issue' WHERE (((projects.status  9 AND projects.id IN (SELECT em.project_id FROM enabled_modules em WHERE em.name='issue_tracking')) AND ((projects.is_public = 1 AND ((issues.is_private = 0)))))) AND (((LOWER(subject) LIKE '%somethig%') OR (LOWER(issues.description) LIKE '%somethig%') OR (LOWER(journals.notes) LIKE '%somethig%') OR issues.id IN (SELECT cfs.customized_id FROM custom_values cfs WHERE cfs.customized_type='Issue' AND cfs.customized_id=issues.id AND LOWER(cfs.value) LIKE '%somethig%' AND cfs.custom_field_id IN (2,4) AND ((1=1) AND (issues.tracker_id IN (SELECT tracker_id FROM custom_fields_trackers WHERE custom_field_id = cfs.custom_field_id)) AND (EXISTS (SELECT 1 FROM custom_fields ifa WHERE ifa.is_for_all = 1 AND ifa.id = cfs.custom_field_id) OR issues.project_id IN (SELECT project_id FROM custom_fields_projects WHERE custom_field_id = cfs.custom_field_id))))))
  SQL (23.8ms)  SELECT `issues`.`id` AS t0_r0, `issues`.`tracker_id` AS t0_r1, `issues`.`project_id` AS t0_r2, `issues`.`subject` AS t0_r3, `issues`.`description` AS t0_r4, `issues`.`due_date` AS t0_r5, `issues`.`category_id` AS t0_r6, `issues`.`status_id` AS t0_r7, `issues`.`assigned_to_id` AS t0_r8, `issues`.`priority_id` AS t0_r9, `issues`.`fixed_version_id` AS t0_r10, `issues`.`author_id` AS t0_r11, `issues`.`lock_version` AS t0_r12, `issues`.`created_on` AS t0_r13, `issues`.`updated_on` AS t0_r14, `issues`.`start_date` AS t0_r15, `issues`.`done_ratio` AS t0_r16, `issues`.`estimated_hours` AS t0_r17, `issues`.`parent_id` AS t0_r18, `issues`.`root_id` AS t0_r19, `issues`.`lft` AS t0_r20, `issues`.`rgt` AS t0_r21, `issues`.`is_private` AS t0_r22, `issues`.`ir_position` AS t0_r23, `issues`.`closed_on` AS t0_r24, `issues`.`sprint_id` AS t0_r25, `issues`.`position` AS t0_r26, `projects`.`id` AS t1_r0, `projects`.`name` AS t1_r1, `projects`.`description` AS t1_r2, `projects`.`homepage` AS t1_r3, `projects`.`is_public` AS t1_r4, `projects`.`parent_id` AS t1_r5, `projects`.`created_on` AS t1_r6, `projects`.`updated_on` AS t1_r7, `projects`.`identifier` AS t1_r8, `projects`.`status` AS t1_r9, `projects`.`lft` AS t1_r10, `projects`.`rgt` AS t1_r11, `projects`.`inherit_members` AS t1_r12, `projects`.`default_assignee_id` AS t1_r13, `projects`.`product_backlog_id` AS t1_r14, `journals`.`id` AS t2_r0, `journals`.`journalized_id` AS t2_r1, `journals`.`journalized_type` AS t2_r2, `journals`.`user_id` AS t2_r3, `journals`.`notes` AS t2_r4, `journals`.`created_on` AS t2_r5, `journals`.`private_notes` AS t2_r6 FROM `issues` LEFT OUTER JOIN `projects` ON `projects`.`id` = `issues`.`project_id` LEFT OUTER JOIN `journals` ON `journals`.`journalized_id` = `issues`.`id` AND (journals.private_notes = 0 OR (1=0)) AND `journals`.`journalized_type` = 'Issue' WHERE (((projects.status  9 AND projects.id IN (SELECT em.project_id FROM enabled_modules em WHERE em.name='issue_tracking')) AND ((projects.is_public = 1 AND ((issues.is_private = 0)))))) AND (((LOWER(subject) LIKE '%somethig%') OR (LOWER(issues.description) LIKE '%somethig%') OR (LOWER(journals.notes) LIKE '%somethig%') OR issues.id IN (SELECT cfs.customized_id FROM custom_values cfs WHERE cfs.customized_type='Issue' AND cfs.customized_id=issues.id AND LOWER(cfs.value) LIKE '%somethig%' AND cfs.custom_field_id IN (2,4) AND ((1=1) AND (issues.tracker_id IN (SELECT tracker_id FROM custom_fields_trackers WHERE custom_field_id = cfs.custom_field_id)) AND (EXISTS (SELECT 1 FROM custom_fields ifa WHERE ifa.is_for_all = 1 AND ifa.id = cfs.custom_field_id) OR issues.project_id IN (SELECT project_id FROM custom_fields_projects WHERE custom_field_id = cfs.custom_field_id)))))) ORDER BY issues.id ASC
 => [[], 0]

It seems to be SQL, isn't it? I migrated old redmine to new version and still it's searching without fuzzy words feature. Any ideas what can I check? Plugin settings seems to be saved correctly. Is there any debug available so I can see what engine it's using?

dominch avatar Feb 05 '15 19:02 dominch

Setting.plugin_redmine_didyoumean['search_method'] in rails console, 0 - SQL 1- TS

rlisowski avatar Feb 05 '15 19:02 rlisowski

I'm trying in console:

2.0.0-p594 :006 > Issue.sphinx_search 'test'
  Issue Load (0.8ms)  SELECT `issues`.* FROM `issues` WHERE `issues`.`id` IN (231, 1, 51, 52, 53, 114, 150, 153, 167, 173, 232, 235, 244, 284, 381, 523, 687, 717, 747, 912)
(20 results)

nad same for word 'testy' gives me only one result. So sphinx is working and gives me some results but they are amost same as for sql.

dominch avatar Feb 05 '15 20:02 dominch

That seems to be ok:

2.0.0-p594 :009 > Setting.plugin_redmine_didyoumean['search_method']
 => "1"

And that's:

  def search_class
    case Setting.plugin_redmine_didyoumean['search_method']
    when "0"
      SqlSearch
    when "1"
      ThinkingSphinxSearch
    else
      raise 'There is no search method selected!'
    end
  end

so its sphinx. I tried to modify searching_by_thinking_sphinx.rb and that caused an effect so it's using it for sure. The question is what is wrong with sphinx that results are wrong.

dominch avatar Feb 05 '15 20:02 dominch

@dominch Follow this steps to fix this: Open file: issues_index.rb and replace line 7 with set_property :enable_star => true, next in main project("Redmine") catalog (not in plugin) run 'rake ts:rebuid' and then try to search duplicates.

My case was: 'test', 'tester', 'testowy'. And before this steps I got only 1 results, but should be 3, now after these steps, I got a good result (3).

Check your application log for something like : Sphinx Query (1.4ms) SELECT * FROM issue_core WHERE MATCH('test') AND project_id IN (450) AND sphinx_deleted = 0 AND sphinx_internal_id NOT IN (0) LIMIT 0, 10 Sphinx Found 3 results

swiatkiewicz avatar Feb 06 '15 09:02 swiatkiewicz

How can I turn on debug mode? Trying

script/rails server webrick -e production -d -p 3000

plus:

http://localhost:3000/searchissues?project_id=1&issue_id=&query=testowy

gives me:

Processing by SearchIssuesController#index as HTML
  Parameters: {"project_id"=>"1", "issue_id"=>"", "query"=>"testowy"}
  Current user: dominik.chmaj (id=3)
Completed 200 OK in 485.4ms (Views: 1.0ms | ActiveRecord: 15.7ms)

previous setting for :enable_star was 1, changed that to true but still no effect :(

dominch avatar Feb 06 '15 13:02 dominch

@dominch Open config/production.rb then find line config.logger.lever or log_level, and it should equal to :debug like config.log_level = :debug

In thinkingSphinx is another problem, because if you add new issue or edit exisitng one, then you should run ts:index, to update indexes. You can use unix cron, and run this every five minutes or something.

I'm trying to implement RealTime indexing but it's doesn't work as I expected and it's can take a while.

swiatkiewicz avatar Feb 06 '15 13:02 swiatkiewicz

Ok, debug logs are working and I have:

  Sphinx Query (0.8ms)  SELECT * FROM `issue_core` WHERE MATCH('*testowej*') AND `project_id` IN (1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 17, 18, 19, 20, 23, 24, 25, 26) AND `sphinx_deleted` = 0 AND `sphinx_internal_id` NOT IN (0) LIMIT 0, 5
  Sphinx  Found 3 results
  Issue Load (0.6ms)  SELECT `issues`.* FROM `issues` WHERE `issues`.`id` IN (336, 717, 962)

So that's proof - sphinx are working, somehow it does not return much results. Only exact words.

my development.sphinx.conf looks like:

indexer
{
}
searchd
{
  listen = 127.0.0.1:9306:mysql41
  log = /var/data/redmine/log/development.searchd.log
  query_log = /var/data/redmine/log/development.searchd.query.log
  pid_file = /var/data/redmine/log/development.sphinx.pid
  workers = threads
  binlog_path = /var/data/redmine/tmp/binlog/development
}
source issue_core_0
{
  type = mysql
  sql_host = localhost
  sql_user = redmine
  sql_pass = ***
  sql_db = redmine
  sql_query_pre = SET TIME_ZONE = '+0:00'
  sql_query_pre = SET NAMES utf8
  sql_query = SELECT SQL_NO_CACHE `issues`.`id` * 2 + 0 AS `id`, `issues`.`subject` AS `subject`, `issues`.`id` AS `sphinx_internal_id`, 'Issue' AS `sphinx_internal_class`, 0 AS `sphinx_deleted`, `issues`.`id` AS `id`, `issues`.`status_id` AS `status_id`, `issues`.`project_id` AS `project_id` FROM `issues`  WHERE (`issues`.`id` BETWEEN $start AND $end) GROUP BY `issues`.`id`, `issues`.`subject`, `issues`.`id`, `issues`.`id`, `issues`.`status_id`, `issues`.`project_id` ORDER BY NULL
  sql_query_range = SELECT IFNULL(MIN(`issues`.`id`), 1), IFNULL(MAX(`issues`.`id`), 1) FROM `issues`
  sql_attr_uint = sphinx_internal_id
  sql_attr_uint = sphinx_deleted
  sql_attr_uint = id
  sql_attr_uint = status_id
  sql_attr_uint = project_id
  sql_attr_string = sphinx_internal_class
  sql_field_string = subject
  sql_query_info = SELECT `issues`.* FROM `issues`  WHERE (`issues`.`id` = ($id - 0) / 2)
}
index issue_core
{
  type = plain
  path = /var/data/redmine/db/sphinx/development/issue_core
  docinfo = extern
  charset_type = utf-8
  min_infix_len = 2
  enable_star = 1
  source = issue_core_0
}
index issue
{
  type = distributed
  local = issue_core
}

everything seems to work except right results :) This have to be something with tokenization etc. I assume that language is not that important because it should look for any word in any language with described rules. Is that correct?

dominch avatar Feb 06 '15 14:02 dominch

@dominch In my case when I'm looking for word 'test': Project: dym2 Results: dym1

This seems to be ok, right? I think you like get something like this, right?

swiatkiewicz avatar Feb 06 '15 15:02 swiatkiewicz

In description there is:

"faulty" also returned a match for "fault", "saving" matched "saved" or "save"

from plugin settings:

- Thinking Sphinx - firsly search words 1:1 after then substract last character and search again ('Running' will be looking for 'Runner' 'Running' etc.). Substract to min word length which is definded below

So in Your case for word "tester" it should find everything with word "test" (substacted 2 letters)
Of course ideal is with all forms like in example - Running should find Runner and Running.

Right now and for Your example it's loking for test and I have same thing - above You can find: "FROM issue_core WHERE MATCH('testowej') " that is equiwalent of sql "where x like "%test%". Sure it will find "testowy" and even "wytestuj", but that's not very useful. Fulltext search should tokenize words, change them to basic forms, remove stopwords and try to match with query (with same rules).

Try "tester" - it should find # 101497 and of course # 101512 I expect that this word should search for "tester" + "teste" + "test" + "tes", should assign weights etc. That should give much more results.

dominch avatar Feb 06 '15 16:02 dominch

It seems that sphinxsearch does not tokenize words by default. To make it work install it with libstemmer library.

links: https://pat.github.io/thinking-sphinx/advanced_config.html http://sphinxsearch.com/docs/current.html#conf-morphology http://snowball.tartarus.org/download.php

rlisowski avatar Feb 14 '15 23:02 rlisowski

I just changed my config and now it's working. I think that should be placed in readme to be clear :) I needed to add config/thinking_sphinx.yml file with content:

production:
  morphology: stem_en
  mem_limit: 128M
  wordforms: "/var/data/redmine/config/sphinx/wordforms.txt"
  stopwords: "/var/data/redmine/config/sphinx/stopwords.txt"

This added morphology and steammer for my generated files, Also I added wordforms and stopwords to my config, that's easy to get for any language from google.

Now it's working great! :) Thank You for help, I wasn't sure if I need anything more to my configuration.

dominch avatar Feb 18 '15 13:02 dominch

@dominch Hello, I am highly interested in what you found. Could you please describe a little more how you achieved that ?

  • Did you find your wordforms/stopwords or did you generate them (and how) ?
  • when you write config/thinking_sphinx.yml you means redmine_root_path/config/thinking_sphinx.yml?
  • did you do anything else than just put the thinking_sphinx.yml file ?

I would like to use a different language than english/russian and after reading documentation, it appears I have to make more steps to achieve the morphology search.

Androc avatar Feb 18 '15 14:02 Androc