TextAnalysis.jl
TextAnalysis.jl copied to clipboard
remove_corrupt_utf8() not working
The function remove_corrupt_utf8() does not work under Julia v0.4.6.
The problem is the line zeros(Char, endof(s)+1) where it complains that
zero is not defined for type Char. When using UInt8 instead I could make it
run without error, but please check if this does what it is supposed to do.
function remove_corrupt_utf8(s::AbstractString)
r = zeros(UInt8, endof(s)+1)
i = 0
for chr in s
i += 1
r[i] = (chr != 0xfffd) ? chr : ' '
end
return utf8(r)
end
Note that on the return statement I got rid of the CharString() too.
If this is ok I can make another pull request.
Cheers, Andre
Sure, thanks. Looks OK. Note that utf8 is deprecated in 0.5, you'll need to use Compat.UTF8String. I've just fixed all the other deprecations on 0.5.
So in 0.5 I had to adapt further, due to
chr != 0xfffd being deprecated, however when doing
UInt8(chr) != 0xfffd there are InexactError() if the
character does not fit in UInt8, so I did try-catch.
Further not sure if the index stepping with i+1 was OK before,
so put in nextind(s,i)
function remove_corrupt_utf8(s::AbstractString)
r = zeros(UInt8, endof(s)+1)
i = 1
for chr in s
try
r[i] = (UInt8(chr) != 0xfffd) ? chr : ' '
catch
r[i] = ' '
end
i = nextind(s,i)
end
return Compat.UTF8String(r)
end
Seems reasonable?
r[i] = (UInt8(chr) != 0xfffd) ? chr : ' '
Not all unicode characters will fit in an UInt8. This line above will loose all non-ascii characters from the string, I think.
I'd use something like this:
function remove_corrupt_utf8(s::AbstractString)
r = IOBuffer()
i = 1
for chr in s
if chr != Char(0xfffd)
write(r, chr)
end
end
return takebuf_string(r)
end
Are there any tests for this?
Are there any updates/resolutions on this?
Should be working with Julia > 1.0 and implementation like:
function remove_corrupt_utf8(s::AbstractString)
return map(x->isvalid(x) ? x : ' ', s)
end