SparseArrays.jl icon indicating copy to clipboard operation
SparseArrays.jl copied to clipboard

Sparse array of string types

Open orenbenkiki opened this issue 1 year ago • 17 comments

I tried to create a sparse vector of strings, and it failed due to zero(String) not being defined. I'm using the empty string "" as the zero value. Any chance of enhancing SparseArrays to support this?

orenbenkiki avatar Feb 15 '24 07:02 orenbenkiki

In theory the empty string should be one(String)

julia> one(String)
""

putianyi889 avatar Mar 02 '24 12:03 putianyi889

Interesting...

I think the empty string still makes sense in sparse string vectors though.

orenbenkiki avatar Mar 02 '24 14:03 orenbenkiki

You can always add these at your own risk as I have not idea what they will break.

julia> Base.zero(::String) = ""
julia> Base.iszero(x::String) = length(x) == 0
julia> sparse(["123", "", "234"])
3-element SparseVector{String, Int64} with 2 stored entries:
  [1]  =  "123"
  [3]  =  "234"

IMHO, it might make sense to have an internal zero and iszero so that it's possible to overwrite them for sparse arrays only.

SobhanMP avatar Mar 03 '24 04:03 SobhanMP

If I were writing an application, sure. But I'm writing a library, so changing Base behavior this way is probably not a good idea.

Having SparseArrays.zero and SparseArrays.iszero which by default forward to Base but allow overriding for additional types would be nice.

orenbenkiki avatar Mar 03 '24 10:03 orenbenkiki

I agree.

@ViralBShah @dkarrasch @LilithHafner Does it make sense to you?

SobhanMP avatar Mar 07 '24 03:03 SobhanMP

I'd like to have an overlay structure similar to https://github.com/Jutho/SparseArrayKit.jl/issues/13 . However SparseArrays.jl can't depend on FillArrays.jl and I don't know how to work around nicely.

putianyi889 avatar Mar 07 '24 06:03 putianyi889

Having SparseArrays.zero and SparseArrays.iszero which by default forward to Base but allow overriding for additional types would be nice.

That sounds very sensible. It provides more flexibility at seemingly no cost. I wonder though what applications one has in mind, if one defines zero as something that is not the neutral element w.r.t. addition. Is it just about storing sparse rectangular data in the sense that "most" elements equal a fixed ("trivial") element? Or does some sort of algebra also come into play?

dkarrasch avatar Mar 07 '24 09:03 dkarrasch

First, zero (whether for Base or specialized in SparseArrays) works based on the type of the array element. This makes it impractical for using it for the "most common value" of an array.

Second, SparseArrays code assumes that the zero value is zero w.r.t addition and multiplication, and breaking this assumptions would require modifying it in many places - not to mention, breaking client code which also makes this assumption.

Having CompressedArrays working similarly to SparseArrays and allow for an arbitrary "most common" value per array instance would have been awesome (and would have reduced the need for things like nullable/masked arrays). However, somehow this idea never caught on...

orenbenkiki avatar Mar 07 '24 11:03 orenbenkiki

The problem is of abstractions. The SparseMatrix data type and its numeric types are designed for numerical sparse matrix linear algebra calculations. So extending to non-numeric types is always going to feel unnatural. It would be nice to get as much as we can get for free within this design - I do get that.

For strings, isn't it better to use a different data structure like a dictionary?

ViralBShah avatar Mar 07 '24 14:03 ViralBShah

There is a proposal floating around somewhere to support sparse arrays with custom "zero" values. Implementing that (either here or in a different package) would make a sparse array of strings fit naturally into the abstraction.

LilithHafner avatar Mar 07 '24 15:03 LilithHafner

I don't like this

SparseArrays.zero(::Type{String}) == ""

because SparseArrays assumes zero * x == zero, and so a sparse array of strings will behave oddly under concatenation. A structural empty string concatenated with anything result in the empty string while stored empty strings concatenated with anything (x) will result in that thing (x).

LilithHafner avatar Mar 07 '24 15:03 LilithHafner

@ViralBShah:

It would be nice to get as much as we can get for free within this design - I do get that.

That's the idea.

For strings, isn't it better to use a different data structure like a dictionary?

Depends on the use case (of course). In my case, definitely not. Vectors have properties like length, and can be sliced, and masked, and so on. NamedArrays makes them dictionary-like, if that's needed.

@LilithHafner:

SparseArrays assumes zero * x == zero, and so a sparse array of strings will behave oddly under concatenation.

(As an aside, I would have been happier if string concatenation was considered to be addition rather than multiplication, commutativity be damned. E.g. sum(strings) would have worked for concatenating all the strings in a vector which makes sense. But that's just me.)

In what context does the above matter? Note that you can't add strings so trying to do linear algebra on strings would fail anyway.

There is a proposal floating around somewhere to support sparse arrays with custom "zero" values. Implementing that (either here or in a different package) would make a sparse array of strings fit naturally into the abstraction.

That would be nice.

orenbenkiki avatar Mar 07 '24 15:03 orenbenkiki

In what context does the above matter?

When performing broadcasted .*:

julia> x = sprandstring(10, .1)
10-element SparseVector{String, Int64} with 2 stored entries:
  [8 ]  =  "VlnHeUDF"
  [10]  =  "LumRtFcf"

julia> y = sprandstring(10, .1)
10-element SparseVector{String, Int64} with 1 stored entry:
  [7]  =  "SlumPG6z"

julia> julia> x .* y
10-element SparseVector{String, Int64} with 0 stored entries

That looks like a bug, and it would be hard to avoid without adding special cases to support the zero element not being a multiplicative zero.

(note: this example does not run, it is extrapolated from the results of sprand)

LilithHafner avatar Mar 07 '24 16:03 LilithHafner

I see - good point, this wouldn't be good :-(

(If string concatenation was addition, this wouldn't have been a problem...)

orenbenkiki avatar Mar 07 '24 16:03 orenbenkiki

There's a lot to consider when deciding whether string concatenation is +, *, or something else. When/if I develop a new language or stdlib spec from scratch this sort of thing will be worth considering. Within Julia, though, it's way to late to reconsider. You're right though. This would "just work" if that language design decision were different.

LilithHafner avatar Mar 07 '24 16:03 LilithHafner

Maybe the solution is adding a wrapper that does the job? something like

struct Wrap
    s::Union{Nothing,String}
end
Wrap()=Wrap(nothing)
Base.zero(::Wrap) = Wrap()
Base.iszero(x::Wrap) = x.s === nothing
sparse(Wrap.(["123", nothing, "234"]))

SobhanMP avatar Mar 09 '24 01:03 SobhanMP

Nice workaround, should have thought of that.

It would bubble up to everywhere in the client code (e.g. eltype would return Wrap instead of String) but that might be acceptable overhead.

orenbenkiki avatar Mar 09 '24 08:03 orenbenkiki