fuzzy icon indicating copy to clipboard operation
fuzzy copied to clipboard

Soundex Appears Broken?

Open quantology opened this issue 8 years ago • 12 comments

Using the test case, in python 3.5:

phrase = 'FancyFree'
print(repr(fuzzy.Soundex(4)(phrase)))

yields: ''

Occasionally instead of yielding an empty string, it yields a unicode error. dmeta and nysiis are working fine in this install, so I don't believe it was an install error.

quantology avatar Nov 05 '17 03:11 quantology

Hi, same for me on python 2.7, please see example below. Thank you in advance for your help.

| => python
Python 2.7.9 (default, Mar  1 2015, 12:57:24) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import fuzzy
>>> sdx = fuzzy.Soundex(8)
>>> sdx('Test')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.__call__
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

I had to put back previous version 1.1 :

| => python
Python 2.7.9 (default, Mar  1 2015, 12:57:24) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import fuzzy
>>> sdx = fuzzy.Soundex(8)
>>> sdx('Test')
'T2300000'

pw717 avatar Nov 13 '17 10:11 pw717

In this job, you can see the tests I added in fa184ba now failing. Annoyingly, they pass when I run the same tests on my mac. So there are apparently some issues with Cython or maybe with the compiler on Linux. I welcome someone to dive deeper and find a solution.

jaraco avatar Nov 14 '17 21:11 jaraco

As you can see, little changed with fuzzy.pyx from 1.1 to 1.2, and it changed slightly from 1.2 to 1.2.2.

jaraco avatar Nov 14 '17 21:11 jaraco

Hi, thank you very much for your answer.

mac

On mac as you mentionned (OSX Sierra 10.12.6) it's not OK either: it doesn't show any error but the return value appears to be wrong:

with 1.2.2 python Python 2.7.10 (default, Feb 7 2017, 00:08:15) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import fuzzy >>> fuzzy.Soundex(8)('Test') u'T23'

We should have this instead: with 1.1 Python 2.7.10 (default, Feb 7 2017, 00:08:15) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import fuzzy >>> fuzzy.Soundex(8)('Test') 'T2300000'

It may be noticeable that the function on newer versions returns unicode type rather than str as before.

Linux

On linux debian 8.2 jessie (with both versions 1.2 and 1.2.2), this may interest you :

with 1.1 | => python Python 2.7.9 (default, Mar 1 2015, 12:57:24) [GCC 4.9.2] on linux2 >>> import fuzzy >>> fuzzy.Soundex(8)('Test') 'T2300000'

with 1.2.2 | => python Python 2.7.9 (default, Mar 1 2015, 12:57:24) [GCC 4.9.2] on linux2 >>> import fuzzy >>> fuzzy.Soundex(8)('Test') u''

Also below: sorry for the repetitions but this may help if you look at the the third attempt: the return value remains wrong but it doens't throw any error!

>>> sdx = fuzzy.Soundex(8) >>> sdx('Test')

Traceback (most recent call last): File "", line 1, in File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.call UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 0: ordinal not in range(128)

>>> sdx('Test')

Traceback (most recent call last): File "", line 1, in File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.call UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

>>> sdx('Test') u'' >>> sdx('Test')

Traceback (most recent call last): File "", line 1, in File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.call UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

>>> sdx('Test')

Traceback (most recent call last): File "", line 1, in File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.call UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

and so on...

pw717 avatar Nov 16 '17 10:11 pw717

I tested the sample code from the documentation with versions 1.0, 1.1, 1.2, 1.2.1 and 1.2.2 on a GoogleCloud Ubuntu 16.04 instance:

import fuzzy
soundex = fuzzy.Soundex(4)
print soundex('fuzzy')
print 'should be: F200'

Versions 1.0 and 1.1 produce the expected results 'F200'. Versions 1.2 onward produce empty strings.

yaakov2 avatar Dec 03 '17 13:12 yaakov2

It seems to me that two weeks ago, version 1.2.2 used to work for us --- but then something changed and the results are wrong. Also, the results are sporadic --- we get different error messages in different runs of the program. For the time being, we go back to version 1.1 -- but it is not clear whether this solves the problem.

I would think that a basic test for the Soundex function should not be marked as "expected to fail": If the test doesn't produce the correct answer, then there is some problem that needs to be corrected (and people should see that when they decide whether to use the package or not).

yaakov2 avatar Dec 03 '17 13:12 yaakov2

Hi, I'm porting my project on python3 and it seems that the library doesn't work as it should with Soundex, as @metaperture reported earlier.

Please see examples below:

python
Python 3.6.5 (default, Mar 30 2018, 06:42:10) [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import fuzzy; sdx = fuzzy.Soundex(8) >>> sdx('fuzzy') "F2x('fuzzy')\n" >>> sdx('Jéroboam') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "src/fuzzy.pyx", line 207, in fuzzy.Soundex.__call__ UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 1: ordinal not in range(128)

The "test" string "fuzzy" does not give the expected result and a string containing accented character throws an exception.

Thank-you in advance for your help.

Regards,

Philippe

pw717 avatar Apr 24 '18 10:04 pw717

I've personally given up and used another implementation (https://pypi.org/project/soundex/)

morvan-s avatar Jul 02 '18 13:07 morvan-s

As I'm not the original author, I have little visibility to the project, so I can give little guidance to what this library should be doing, so it's nice to know of the soundex library, as we can use that as a guide for what may or may not be correct.

Thinking about @pw717's comment above, it seems to me that on the mac, the behavior on 1.2 is more correct than that of 1.1, especially considering that the trailing zeros seem like padding, but also because the soundex lib doesn't render them either:

fuzzy master $ rwt soundex
Collecting soundex
  Downloading https://files.pythonhosted.org/packages/f8/8f/37b9711595d007e82f70ae6f41b6ab6a1fda406a8321ccfc458fb5023b5f/soundex-1.1.3.tar.gz
Collecting silpa_common>=0.3 (from soundex)
  Downloading https://files.pythonhosted.org/packages/8d/55/452f5103cb7071d188a818d9e2f12c19c4c8a12124a28aaa212eb6716a4d/silpa_common-0.3.tar.gz
Building wheels for collected packages: soundex, silpa-common
  Running setup.py bdist_wheel for soundex ... done
  Stored in directory: /Users/jaraco/Library/Caches/pip/wheels/b5/bb/e6/9a4b6be56c40aa707509bddaf6d414187461ded9db7a25a41a
  Running setup.py bdist_wheel for silpa-common ... done
  Stored in directory: /Users/jaraco/Library/Caches/pip/wheels/16/4f/ba/604a82bf904740f1a1d3ad88029c0df5c638bd8825a3cb972d
Successfully built soundex silpa-common
Installing collected packages: silpa-common, soundex
Successfully installed silpa-common-0.3 soundex-1.1.3
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 26 2018, 23:26:24)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import soundex
>>> ob = soundex.Soundex()
>>> ob.soundex('Test')
'T23'

jaraco avatar Jul 02 '18 14:07 jaraco

There are several issues at play here. Let's set aside for the moment the issue that non-ascii characters are not yet supported (as the encoding for strings is declared to be ascii). I'll file that as a separate issue for clarity.

Excluding that issue, the tests pass on macOS.

What we need is someone to spend some time to understand the Cython code and dig into the details on a system where the tests are failing and devise a fix.

jaraco avatar Jul 02 '18 15:07 jaraco

Soundex also has other bugs:

>>> import fuzzy
>>> soundex = fuzzy.Soundex(4)
>>> soundex('hello')
'H4'
>>> soundex('hi')
"Houndex('hi')\n"

This is on Python 3.7.0 (macOS 10.14) with Fuzzy 1.2.2

supriyo-biswas avatar Oct 07 '18 06:10 supriyo-biswas

I am getting weird errors. Sometimes, I am getting blank strings with return carriage. Sometimes, I am getting this error.

Traceback (most recent call last): File "", line 1, in File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.call UnicodeDecodeError: 'ascii' codec can't decode byte 0xd6 in position 1: ordinal not in range(128)

CognitiveClouds-Prasad avatar Jan 24 '19 09:01 CognitiveClouds-Prasad