redmine_didyoumean
redmine_didyoumean copied to clipboard
Fuzzy Word Matching
Could DYM be extended to use the idea of "approximate string matching" (Wikipedia reference)? Not everybody phrases their sentences the same way.. I honestly believe this plugin would be a million times more useful if words like "faulty" also returned a match for "fault", "saving" matched "saved" or "save", etc etc.
Unfortunately I'm not familiar enough with Ruby to do this myself, but a quick Google search shows many Ruby Gems that provide this type of functionality. So I'm guessing (read: hoping) this isn't an unreasonable request. :)
Yes, that's something that we had considered in the past, but never implemented as we did not want to kill the underlying database with too many queries.
Can you please list a few of the Ruby Gems you had found?
From my limited understanding, I can only imagine a DB hit when the front end gets an updated string to compare (I use the "as you type" method in my setup, which could actually skip key presses like shift
, space
etc.. but that's not the point of my request here). Do you think this idea would cause more hits?
One that keeps popping up in SO questions is amatch
, but after a bit of digging it looks like either fuzzy_match
or fuzzy-string-match
might be faster.
Another interesting looking one is fuzzily
, however the dev recommends using blurrily
instead for large datasets.
From the lib weight point of view, it looks like amatch
and fuzzily
have less dependencies, but I'm not too sure how much difference that would actually make. Keep in mind I don't speak Ruby, so I'm running blind here. :)
What about full text search with elasticsearch-rails I could work with sidekiq worker as an async indexer. Indexer could be triggered by hooks in Issue (added via patch). There is redmine_sidekiq plugin so you neeed only provide worker. What you say?
Consider also thinkingsphinx which can fit better that elasticsearch to current features list (filter by project_id, issue_id, status)
:+1:
I just pulled the changes referenced above into my fork of the repo, but I still don't get any fuzzy searching.. When I migrate plugins, I do get a warning to say "Sphinx cannot be found on your system". Tried installing the gem manually (gem install sphinx
from Redmine's htdocs
root folder), which didn't give any errors, but it also didn't change anything, so I'm not sure how to get around this.
@korin, was that thumbs-up meant to imply that @swiatkiewicz's changes work? Or have you not had a chance to test them yet?
It works, you need to install Sphinx. See ThinkingSphinx quickstart guide it's already available in most distros.
To be honest @swiatkiewicz changes allow replace sql like search with sphinx indexer which is faster option. It's only one step from fuzzy matching feature.
Thanks for the link. Unfortunately I need to run my system in a Windows environment, so it looks like I've got some reading to do before I get that part working.
The SQL like
search works perfectly and the additional configuration options are helpful, so this is already a nice improvement.
I can't get this feature to work :( I have already started sphinx etc, then trying "test" and have results: Bug # 1 – Testowy błąd (Closed w projekcie Projekt testowy) Bug # 11 – test (New w projekcie Projekt testowy) Feature # 19 – Test (New w projekcie) Feature # 21 – testowe zadanie (New w projekcie Zadania w realizacji KD) Feature # 22 – Ticket testowo pokazowy (New w projekcie Zadania w realizacji KD) so I assume it should show something for "testy" (should strip "y" and match like "test"?)
show us plugin settings /settings/plugin/redmine_didyoumean
it's in polish but you know all settings. http://i.imgur.com/gvLDYVy.png any ideas?
Should work, I have similar settings. Any errors in redmine log file? or in browser development console?
You can also try rebuild thinking sphinx index with rake ts:rebuild.
No errors so far noticed :( I'll try out this with new tickets, BTW: is ts:rebuild better than ts:index ?
only small difference with configuration see http://pat.github.io/thinking-sphinx/rake_tasks.html
Ok, so basically ts:rebuild is same as stop+index+start, great :) Still can't get this to work as expected. I tried mysql first, is there a chance it still use it? rake tasks are running ok without any errors, also settings should be ok, but still I can't get right results. Maybe it's something wrong with my sphinx in system? how can I test?
Have you tried testing it in rails console? It's working like " Issue.search 'somethig' ", and then you should see info about search engine (sql or sphinx).
2015-02-05 15:57 GMT+01:00 dominch [email protected]:
Ok, so basically ts:rebuild is same as stop+index+start, great :) Still can't get this to work as expected. I tried mysql first, is there a chance it still use it? rake tasks are running ok without any errors, also settings should be ok, but still I can't get right results. Maybe it's something wrong with my sphinx in system? how can I test?
— Reply to this email directly or view it on GitHub https://github.com/abahgat/redmine_didyoumean/issues/56#issuecomment-73059199 .
Trying now:
2.0.0-p594 :001 > Issue.search 'somethig' CustomField Load (0.6ms) SELECT `custom_fields`.* FROM `custom_fields` WHERE `custom_fields`.`type` = 'IssueCustomField' AND `custom_fields`.`searchable` = 1 Role Load (0.3ms) SELECT `roles`.* FROM `roles` WHERE `roles`.`builtin` = 2 LIMIT 1 GroupAnonymous Load (0.6ms) SELECT `users`.* FROM `users` WHERE `users`.`type` IN ('GroupAnonymous') ORDER BY id LIMIT 1 Member Load (0.4ms) SELECT `members`.* FROM `members` INNER JOIN `projects` ON `projects`.`id` = `members`.`project_id` WHERE (projects.status 9) AND (members.user_id = 2 OR (projects.is_public = 1 AND members.user_id = 49)) (15.0ms) SELECT COUNT(DISTINCT `issues`.`id`) FROM `issues` LEFT OUTER JOIN `projects` ON `projects`.`id` = `issues`.`project_id` LEFT OUTER JOIN `journals` ON `journals`.`journalized_id` = `issues`.`id` AND (journals.private_notes = 0 OR (1=0)) AND `journals`.`journalized_type` = 'Issue' WHERE (((projects.status 9 AND projects.id IN (SELECT em.project_id FROM enabled_modules em WHERE em.name='issue_tracking')) AND ((projects.is_public = 1 AND ((issues.is_private = 0)))))) AND (((LOWER(subject) LIKE '%somethig%') OR (LOWER(issues.description) LIKE '%somethig%') OR (LOWER(journals.notes) LIKE '%somethig%') OR issues.id IN (SELECT cfs.customized_id FROM custom_values cfs WHERE cfs.customized_type='Issue' AND cfs.customized_id=issues.id AND LOWER(cfs.value) LIKE '%somethig%' AND cfs.custom_field_id IN (2,4) AND ((1=1) AND (issues.tracker_id IN (SELECT tracker_id FROM custom_fields_trackers WHERE custom_field_id = cfs.custom_field_id)) AND (EXISTS (SELECT 1 FROM custom_fields ifa WHERE ifa.is_for_all = 1 AND ifa.id = cfs.custom_field_id) OR issues.project_id IN (SELECT project_id FROM custom_fields_projects WHERE custom_field_id = cfs.custom_field_id)))))) SQL (23.8ms) SELECT `issues`.`id` AS t0_r0, `issues`.`tracker_id` AS t0_r1, `issues`.`project_id` AS t0_r2, `issues`.`subject` AS t0_r3, `issues`.`description` AS t0_r4, `issues`.`due_date` AS t0_r5, `issues`.`category_id` AS t0_r6, `issues`.`status_id` AS t0_r7, `issues`.`assigned_to_id` AS t0_r8, `issues`.`priority_id` AS t0_r9, `issues`.`fixed_version_id` AS t0_r10, `issues`.`author_id` AS t0_r11, `issues`.`lock_version` AS t0_r12, `issues`.`created_on` AS t0_r13, `issues`.`updated_on` AS t0_r14, `issues`.`start_date` AS t0_r15, `issues`.`done_ratio` AS t0_r16, `issues`.`estimated_hours` AS t0_r17, `issues`.`parent_id` AS t0_r18, `issues`.`root_id` AS t0_r19, `issues`.`lft` AS t0_r20, `issues`.`rgt` AS t0_r21, `issues`.`is_private` AS t0_r22, `issues`.`ir_position` AS t0_r23, `issues`.`closed_on` AS t0_r24, `issues`.`sprint_id` AS t0_r25, `issues`.`position` AS t0_r26, `projects`.`id` AS t1_r0, `projects`.`name` AS t1_r1, `projects`.`description` AS t1_r2, `projects`.`homepage` AS t1_r3, `projects`.`is_public` AS t1_r4, `projects`.`parent_id` AS t1_r5, `projects`.`created_on` AS t1_r6, `projects`.`updated_on` AS t1_r7, `projects`.`identifier` AS t1_r8, `projects`.`status` AS t1_r9, `projects`.`lft` AS t1_r10, `projects`.`rgt` AS t1_r11, `projects`.`inherit_members` AS t1_r12, `projects`.`default_assignee_id` AS t1_r13, `projects`.`product_backlog_id` AS t1_r14, `journals`.`id` AS t2_r0, `journals`.`journalized_id` AS t2_r1, `journals`.`journalized_type` AS t2_r2, `journals`.`user_id` AS t2_r3, `journals`.`notes` AS t2_r4, `journals`.`created_on` AS t2_r5, `journals`.`private_notes` AS t2_r6 FROM `issues` LEFT OUTER JOIN `projects` ON `projects`.`id` = `issues`.`project_id` LEFT OUTER JOIN `journals` ON `journals`.`journalized_id` = `issues`.`id` AND (journals.private_notes = 0 OR (1=0)) AND `journals`.`journalized_type` = 'Issue' WHERE (((projects.status 9 AND projects.id IN (SELECT em.project_id FROM enabled_modules em WHERE em.name='issue_tracking')) AND ((projects.is_public = 1 AND ((issues.is_private = 0)))))) AND (((LOWER(subject) LIKE '%somethig%') OR (LOWER(issues.description) LIKE '%somethig%') OR (LOWER(journals.notes) LIKE '%somethig%') OR issues.id IN (SELECT cfs.customized_id FROM custom_values cfs WHERE cfs.customized_type='Issue' AND cfs.customized_id=issues.id AND LOWER(cfs.value) LIKE '%somethig%' AND cfs.custom_field_id IN (2,4) AND ((1=1) AND (issues.tracker_id IN (SELECT tracker_id FROM custom_fields_trackers WHERE custom_field_id = cfs.custom_field_id)) AND (EXISTS (SELECT 1 FROM custom_fields ifa WHERE ifa.is_for_all = 1 AND ifa.id = cfs.custom_field_id) OR issues.project_id IN (SELECT project_id FROM custom_fields_projects WHERE custom_field_id = cfs.custom_field_id)))))) ORDER BY issues.id ASC => [[], 0]
It seems to be SQL, isn't it? I migrated old redmine to new version and still it's searching without fuzzy words feature. Any ideas what can I check? Plugin settings seems to be saved correctly. Is there any debug available so I can see what engine it's using?
Setting.plugin_redmine_didyoumean['search_method'] in rails console, 0 - SQL 1- TS
I'm trying in console:
2.0.0-p594 :006 > Issue.sphinx_search 'test' Issue Load (0.8ms) SELECT `issues`.* FROM `issues` WHERE `issues`.`id` IN (231, 1, 51, 52, 53, 114, 150, 153, 167, 173, 232, 235, 244, 284, 381, 523, 687, 717, 747, 912) (20 results)
nad same for word 'testy' gives me only one result. So sphinx is working and gives me some results but they are amost same as for sql.
That seems to be ok:
2.0.0-p594 :009 > Setting.plugin_redmine_didyoumean['search_method'] => "1"
And that's:
def search_class case Setting.plugin_redmine_didyoumean['search_method'] when "0" SqlSearch when "1" ThinkingSphinxSearch else raise 'There is no search method selected!' end end
so its sphinx. I tried to modify searching_by_thinking_sphinx.rb and that caused an effect so it's using it for sure. The question is what is wrong with sphinx that results are wrong.
@dominch Follow this steps to fix this: Open file: issues_index.rb and replace line 7 with set_property :enable_star => true, next in main project("Redmine") catalog (not in plugin) run 'rake ts:rebuid' and then try to search duplicates.
My case was: 'test', 'tester', 'testowy'. And before this steps I got only 1 results, but should be 3, now after these steps, I got a good result (3).
Check your application log for something like :
Sphinx Query (1.4ms) SELECT * FROM issue_core
WHERE MATCH('test') AND project_id
IN (450) AND sphinx_deleted
= 0 AND sphinx_internal_id
NOT IN (0) LIMIT 0, 10
Sphinx Found 3 results
How can I turn on debug mode? Trying
script/rails server webrick -e production -d -p 3000
plus:
http://localhost:3000/searchissues?project_id=1&issue_id=&query=testowy
gives me:
Processing by SearchIssuesController#index as HTML Parameters: {"project_id"=>"1", "issue_id"=>"", "query"=>"testowy"} Current user: dominik.chmaj (id=3) Completed 200 OK in 485.4ms (Views: 1.0ms | ActiveRecord: 15.7ms)
previous setting for :enable_star was 1, changed that to true but still no effect :(
@dominch Open config/production.rb then find line config.logger.lever or log_level, and it should equal to :debug like config.log_level = :debug
In thinkingSphinx is another problem, because if you add new issue or edit exisitng one, then you should run ts:index, to update indexes. You can use unix cron, and run this every five minutes or something.
I'm trying to implement RealTime indexing but it's doesn't work as I expected and it's can take a while.
Ok, debug logs are working and I have:
Sphinx Query (0.8ms) SELECT * FROM `issue_core` WHERE MATCH('*testowej*') AND `project_id` IN (1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 17, 18, 19, 20, 23, 24, 25, 26) AND `sphinx_deleted` = 0 AND `sphinx_internal_id` NOT IN (0) LIMIT 0, 5 Sphinx Found 3 results Issue Load (0.6ms) SELECT `issues`.* FROM `issues` WHERE `issues`.`id` IN (336, 717, 962)
So that's proof - sphinx are working, somehow it does not return much results. Only exact words.
my development.sphinx.conf looks like:
indexer { } searchd { listen = 127.0.0.1:9306:mysql41 log = /var/data/redmine/log/development.searchd.log query_log = /var/data/redmine/log/development.searchd.query.log pid_file = /var/data/redmine/log/development.sphinx.pid workers = threads binlog_path = /var/data/redmine/tmp/binlog/development } source issue_core_0 { type = mysql sql_host = localhost sql_user = redmine sql_pass = *** sql_db = redmine sql_query_pre = SET TIME_ZONE = '+0:00' sql_query_pre = SET NAMES utf8 sql_query = SELECT SQL_NO_CACHE `issues`.`id` * 2 + 0 AS `id`, `issues`.`subject` AS `subject`, `issues`.`id` AS `sphinx_internal_id`, 'Issue' AS `sphinx_internal_class`, 0 AS `sphinx_deleted`, `issues`.`id` AS `id`, `issues`.`status_id` AS `status_id`, `issues`.`project_id` AS `project_id` FROM `issues` WHERE (`issues`.`id` BETWEEN $start AND $end) GROUP BY `issues`.`id`, `issues`.`subject`, `issues`.`id`, `issues`.`id`, `issues`.`status_id`, `issues`.`project_id` ORDER BY NULL sql_query_range = SELECT IFNULL(MIN(`issues`.`id`), 1), IFNULL(MAX(`issues`.`id`), 1) FROM `issues` sql_attr_uint = sphinx_internal_id sql_attr_uint = sphinx_deleted sql_attr_uint = id sql_attr_uint = status_id sql_attr_uint = project_id sql_attr_string = sphinx_internal_class sql_field_string = subject sql_query_info = SELECT `issues`.* FROM `issues` WHERE (`issues`.`id` = ($id - 0) / 2) } index issue_core { type = plain path = /var/data/redmine/db/sphinx/development/issue_core docinfo = extern charset_type = utf-8 min_infix_len = 2 enable_star = 1 source = issue_core_0 } index issue { type = distributed local = issue_core }
everything seems to work except right results :) This have to be something with tokenization etc. I assume that language is not that important because it should look for any word in any language with described rules. Is that correct?
@dominch
In my case when I'm looking for word 'test':
Project:
Results:
This seems to be ok, right? I think you like get something like this, right?
In description there is:
"faulty" also returned a match for "fault", "saving" matched "saved" or "save"
from plugin settings:
- Thinking Sphinx - firsly search words 1:1 after then substract last character and search again ('Running' will be looking for 'Runner' 'Running' etc.). Substract to min word length which is definded below
So in Your case for word "tester" it should find everything with word "test" (substacted 2 letters)
Of course ideal is with all forms like in example - Running should find Runner and Running.
Right now and for Your example it's loking for test and I have same thing - above You can find: "FROM issue_core
WHERE MATCH('testowej') "
that is equiwalent of sql "where x like "%test%". Sure it will find "testowy" and even "wytestuj", but that's not very useful. Fulltext search should tokenize words, change them to basic forms, remove stopwords and try to match with query (with same rules).
Try "tester" - it should find # 101497 and of course # 101512 I expect that this word should search for "tester" + "teste" + "test" + "tes", should assign weights etc. That should give much more results.
It seems that sphinxsearch does not tokenize words by default. To make it work install it with libstemmer library.
links: https://pat.github.io/thinking-sphinx/advanced_config.html http://sphinxsearch.com/docs/current.html#conf-morphology http://snowball.tartarus.org/download.php
I just changed my config and now it's working. I think that should be placed in readme to be clear :) I needed to add config/thinking_sphinx.yml file with content:
production: morphology: stem_en mem_limit: 128M wordforms: "/var/data/redmine/config/sphinx/wordforms.txt" stopwords: "/var/data/redmine/config/sphinx/stopwords.txt"
This added morphology and steammer for my generated files, Also I added wordforms and stopwords to my config, that's easy to get for any language from google.
Now it's working great! :) Thank You for help, I wasn't sure if I need anything more to my configuration.
@dominch Hello, I am highly interested in what you found. Could you please describe a little more how you achieved that ?
- Did you find your wordforms/stopwords or did you generate them (and how) ?
- when you write
config/thinking_sphinx.yml
you meansredmine_root_path/config/thinking_sphinx.yml
? - did you do anything else than just put the
thinking_sphinx.yml
file ?
I would like to use a different language than english/russian and after reading documentation, it appears I have to make more steps to achieve the morphology search.