TextAnalysis.jl icon indicating copy to clipboard operation
TextAnalysis.jl copied to clipboard

remove_corrupt_utf8() not working

Open abieler opened this issue 8 years ago • 4 comments

The function remove_corrupt_utf8() does not work under Julia v0.4.6. The problem is the line zeros(Char, endof(s)+1) where it complains that zero is not defined for type Char. When using UInt8 instead I could make it run without error, but please check if this does what it is supposed to do.

function remove_corrupt_utf8(s::AbstractString)
    r = zeros(UInt8, endof(s)+1)                                                                                          
    i = 0
    for chr in s
        i += 1
        r[i] = (chr != 0xfffd) ? chr : ' '
    end
    return utf8(r)
end

Note that on the return statement I got rid of the CharString() too.

If this is ok I can make another pull request.

Cheers, Andre

abieler avatar Sep 02 '16 18:09 abieler

Sure, thanks. Looks OK. Note that utf8 is deprecated in 0.5, you'll need to use Compat.UTF8String. I've just fixed all the other deprecations on 0.5.

aviks avatar Sep 03 '16 23:09 aviks

So in 0.5 I had to adapt further, due to

chr != 0xfffd being deprecated, however when doing UInt8(chr) != 0xfffd there are InexactError() if the character does not fit in UInt8, so I did try-catch.

Further not sure if the index stepping with i+1 was OK before, so put in nextind(s,i)

function remove_corrupt_utf8(s::AbstractString)
    r = zeros(UInt8, endof(s)+1)
    i = 1
    for chr in s
        try
          r[i] = (UInt8(chr) != 0xfffd) ? chr : ' '
        catch
          r[i] = ' '
        end
        i = nextind(s,i)
    end
    return Compat.UTF8String(r)
end

Seems reasonable?

abieler avatar Oct 08 '16 11:10 abieler

r[i] = (UInt8(chr) != 0xfffd) ? chr : ' '

Not all unicode characters will fit in an UInt8. This line above will loose all non-ascii characters from the string, I think.

I'd use something like this:

function remove_corrupt_utf8(s::AbstractString)
           r = IOBuffer()
           i = 1
           for chr in s
              if chr != Char(0xfffd)
                 write(r, chr)
               end
           end
           return takebuf_string(r)
       end

Are there any tests for this?

aviks avatar Oct 09 '16 10:10 aviks

Are there any updates/resolutions on this?

mirestrepo avatar Jul 28 '17 20:07 mirestrepo

Should be working with Julia > 1.0 and implementation like:

function remove_corrupt_utf8(s::AbstractString)
    return map(x->isvalid(x) ? x : ' ', s)
end

rssdev10 avatar Oct 24 '23 23:10 rssdev10