Soundex Appears Broken?
Using the test case, in python 3.5:
phrase = 'FancyFree'
print(repr(fuzzy.Soundex(4)(phrase)))
yields: ''
Occasionally instead of yielding an empty string, it yields a unicode error. dmeta and nysiis are working fine in this install, so I don't believe it was an install error.
Hi, same for me on python 2.7, please see example below. Thank you in advance for your help.
| => python
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import fuzzy
>>> sdx = fuzzy.Soundex(8)
>>> sdx('Test')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.__call__
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
I had to put back previous version 1.1 :
| => python
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import fuzzy
>>> sdx = fuzzy.Soundex(8)
>>> sdx('Test')
'T2300000'
In this job, you can see the tests I added in fa184ba now failing. Annoyingly, they pass when I run the same tests on my mac. So there are apparently some issues with Cython or maybe with the compiler on Linux. I welcome someone to dive deeper and find a solution.
As you can see, little changed with fuzzy.pyx from 1.1 to 1.2, and it changed slightly from 1.2 to 1.2.2.
Hi, thank you very much for your answer.
mac
On mac as you mentionned (OSX Sierra 10.12.6) it's not OK either: it doesn't show any error but the return value appears to be wrong:
with 1.2.2
python
Python 2.7.10 (default, Feb 7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import fuzzy
>>> fuzzy.Soundex(8)('Test')
u'T23'
We should have this instead:
with 1.1
Python 2.7.10 (default, Feb 7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import fuzzy
>>> fuzzy.Soundex(8)('Test')
'T2300000'
It may be noticeable that the function on newer versions returns unicode type rather than str as before.
Linux
On linux debian 8.2 jessie (with both versions 1.2 and 1.2.2), this may interest you :
with 1.1
| => python
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
[GCC 4.9.2] on linux2
>>> import fuzzy
>>> fuzzy.Soundex(8)('Test')
'T2300000'
with 1.2.2
| => python
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
[GCC 4.9.2] on linux2
>>> import fuzzy
>>> fuzzy.Soundex(8)('Test')
u''
Also below: sorry for the repetitions but this may help if you look at the the third attempt: the return value remains wrong but it doens't throw any error!
>>> sdx = fuzzy.Soundex(8)
>>> sdx('Test')
Traceback (most recent call last): File "
", line 1, in File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.call UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 0: ordinal not in range(128)
>>> sdx('Test')
Traceback (most recent call last): File "
", line 1, in File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.call UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
>>> sdx('Test')
u''
>>> sdx('Test')
Traceback (most recent call last): File "
", line 1, in File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.call UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
>>> sdx('Test')
Traceback (most recent call last): File "
", line 1, in File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.call UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
and so on...
I tested the sample code from the documentation with versions 1.0, 1.1, 1.2, 1.2.1 and 1.2.2 on a GoogleCloud Ubuntu 16.04 instance:
import fuzzy
soundex = fuzzy.Soundex(4)
print soundex('fuzzy')
print 'should be: F200'
Versions 1.0 and 1.1 produce the expected results 'F200'. Versions 1.2 onward produce empty strings.
It seems to me that two weeks ago, version 1.2.2 used to work for us --- but then something changed and the results are wrong. Also, the results are sporadic --- we get different error messages in different runs of the program. For the time being, we go back to version 1.1 -- but it is not clear whether this solves the problem.
I would think that a basic test for the Soundex function should not be marked as "expected to fail": If the test doesn't produce the correct answer, then there is some problem that needs to be corrected (and people should see that when they decide whether to use the package or not).
Hi, I'm porting my project on python3 and it seems that the library doesn't work as it should with Soundex, as @metaperture reported earlier.
Please see examples below:
python
Python 3.6.5 (default, Mar 30 2018, 06:42:10)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import fuzzy; sdx = fuzzy.Soundex(8)
>>> sdx('fuzzy')
"F2x('fuzzy')\n"
>>> sdx('Jéroboam')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/fuzzy.pyx", line 207, in fuzzy.Soundex.__call__
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 1: ordinal not in range(128)
The "test" string "fuzzy" does not give the expected result and a string containing accented character throws an exception.
Thank-you in advance for your help.
Regards,
Philippe
I've personally given up and used another implementation (https://pypi.org/project/soundex/)
As I'm not the original author, I have little visibility to the project, so I can give little guidance to what this library should be doing, so it's nice to know of the soundex library, as we can use that as a guide for what may or may not be correct.
Thinking about @pw717's comment above, it seems to me that on the mac, the behavior on 1.2 is more correct than that of 1.1, especially considering that the trailing zeros seem like padding, but also because the soundex lib doesn't render them either:
fuzzy master $ rwt soundex
Collecting soundex
Downloading https://files.pythonhosted.org/packages/f8/8f/37b9711595d007e82f70ae6f41b6ab6a1fda406a8321ccfc458fb5023b5f/soundex-1.1.3.tar.gz
Collecting silpa_common>=0.3 (from soundex)
Downloading https://files.pythonhosted.org/packages/8d/55/452f5103cb7071d188a818d9e2f12c19c4c8a12124a28aaa212eb6716a4d/silpa_common-0.3.tar.gz
Building wheels for collected packages: soundex, silpa-common
Running setup.py bdist_wheel for soundex ... done
Stored in directory: /Users/jaraco/Library/Caches/pip/wheels/b5/bb/e6/9a4b6be56c40aa707509bddaf6d414187461ded9db7a25a41a
Running setup.py bdist_wheel for silpa-common ... done
Stored in directory: /Users/jaraco/Library/Caches/pip/wheels/16/4f/ba/604a82bf904740f1a1d3ad88029c0df5c638bd8825a3cb972d
Successfully built soundex silpa-common
Installing collected packages: silpa-common, soundex
Successfully installed silpa-common-0.3 soundex-1.1.3
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 26 2018, 23:26:24)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import soundex
>>> ob = soundex.Soundex()
>>> ob.soundex('Test')
'T23'
There are several issues at play here. Let's set aside for the moment the issue that non-ascii characters are not yet supported (as the encoding for strings is declared to be ascii). I'll file that as a separate issue for clarity.
Excluding that issue, the tests pass on macOS.
What we need is someone to spend some time to understand the Cython code and dig into the details on a system where the tests are failing and devise a fix.
Soundex also has other bugs:
>>> import fuzzy
>>> soundex = fuzzy.Soundex(4)
>>> soundex('hello')
'H4'
>>> soundex('hi')
"Houndex('hi')\n"
This is on Python 3.7.0 (macOS 10.14) with Fuzzy 1.2.2
I am getting weird errors. Sometimes, I am getting blank strings with return carriage. Sometimes, I am getting this error.
Traceback (most recent call last):
File "