TextAnalysis.jl icon indicating copy to clipboard operation
TextAnalysis.jl copied to clipboard

StringIndexError when trying to create a StringDocument based on a UTF8 string

Open alexzandros opened this issue 4 years ago • 2 comments

I'm trying to create a StringDocument based on a string that contains utf-8 characters, and all i'm getting is a StringIndexError

My code is as follows

str = "Lo que tengamos que hacer, apoyar, enteegar el ❤️ y el alma por nuestro país. Ivan es el Man. 👏👏👏#Duquepresidente https://t.co/Dr1LdTa5yQ"
sd = StringDocument(str)

And I get the following error

Error showing value of type StringDocument{String}:
ERROR: StringIndexError: invalid index [50], valid nearby indices [48]=>'❤', [51]=>'️'

Followed by a stack trace.

So, I need to know what is the best practice for working with utf strings.

Thanks in advance.

alexzandros avatar Apr 26 '21 19:04 alexzandros

Can you paste the stack trace you saw? Looks like a bug on our side.

aviks avatar May 05 '21 11:05 aviks

I also experienced the same issue. The text in question contains Less likely working with code I don’t like and the stacktrace is

ERROR: LoadError: StringIndexError: invalid index [38], valid nearby indices [36]=>'’', [39]=>'t'
Stacktrace:
  [1] string_index_err(s::String, i::Int64)
    @ Base ./strings/string.jl:12
  [2] SubString{String}(s::String, i::Int64, j::Int64)
    @ Base ./strings/substring.jl:32
  [3] SubString
    @ ./strings/substring.jl:38 [inlined]
  [4] SubString
    @ ./strings/substring.jl:44 [inlined]
  [5] remove_patterns(s::SubString{String}, rex::Regex)
    @ TextAnalysis ~/.julia/packages/TextAnalysis/B0QxG/src/preprocessing.jl:486
  [6] remove_patterns!
    @ ~/.julia/packages/TextAnalysis/B0QxG/src/preprocessing.jl:508 [inlined]
  [7] remove_patterns!(crps::Corpus{StringDocument{SubString{String}}}, rex::Regex)
    @ TextAnalysis ~/.julia/packages/TextAnalysis/B0QxG/src/preprocessing.jl:534
  [8] prepare!(crps::Corpus{StringDocument{SubString{String}}}, flags::UInt32; skip_patterns::Set{AbstractString}, skip_words::Set{AbstractString})
    @ TextAnalysis ~/.julia/packages/TextAnalysis/B0QxG/src/preprocessing.jl:415
  [9] prepare!
    @ ~/.julia/packages/TextAnalysis/B0QxG/src/preprocessing.jl:406 [inlined]
 [10] summarize(d::StringDocument{String}; ns::Int64)
    @ TextAnalysis ~/.julia/packages/TextAnalysis/B0QxG/src/summarizer.jl:22
 [11] main()...

segunolulana avatar Jun 21 '22 06:06 segunolulana

Not reproducible with Julia 1.9 and TextAnalysis 0.8

rssdev10 avatar Oct 27 '23 16:10 rssdev10