TextAnalysis.jl
TextAnalysis.jl copied to clipboard
remove_corrupt_utf8() not working
The function remove_corrupt_utf8()
does not work under Julia v0.4.6.
The problem is the line zeros(Char, endof(s)+1)
where it complains that
zero is not defined for type Char. When using UInt8 instead I could make it
run without error, but please check if this does what it is supposed to do.
function remove_corrupt_utf8(s::AbstractString)
r = zeros(UInt8, endof(s)+1)
i = 0
for chr in s
i += 1
r[i] = (chr != 0xfffd) ? chr : ' '
end
return utf8(r)
end
Note that on the return statement I got rid of the CharString()
too.
If this is ok I can make another pull request.
Cheers, Andre
Sure, thanks. Looks OK. Note that utf8
is deprecated in 0.5, you'll need to use Compat.UTF8String
. I've just fixed all the other deprecations on 0.5.
So in 0.5 I had to adapt further, due to
chr != 0xfffd
being deprecated, however when doing
UInt8(chr) != 0xfffd
there are InexactError()
if the
character does not fit in UInt8, so I did try-catch.
Further not sure if the index stepping with i+1
was OK before,
so put in nextind(s,i)
function remove_corrupt_utf8(s::AbstractString)
r = zeros(UInt8, endof(s)+1)
i = 1
for chr in s
try
r[i] = (UInt8(chr) != 0xfffd) ? chr : ' '
catch
r[i] = ' '
end
i = nextind(s,i)
end
return Compat.UTF8String(r)
end
Seems reasonable?
r[i] = (UInt8(chr) != 0xfffd) ? chr : ' '
Not all unicode characters will fit in an UInt8. This line above will loose all non-ascii characters from the string, I think.
I'd use something like this:
function remove_corrupt_utf8(s::AbstractString)
r = IOBuffer()
i = 1
for chr in s
if chr != Char(0xfffd)
write(r, chr)
end
end
return takebuf_string(r)
end
Are there any tests for this?
Are there any updates/resolutions on this?
Should be working with Julia > 1.0 and implementation like:
function remove_corrupt_utf8(s::AbstractString)
return map(x->isvalid(x) ? x : ' ', s)
end