explorer icon indicating copy to clipboard operation
explorer copied to clipboard

Combining string columns

Open tomekowal opened this issue 2 years ago • 2 comments

Hi! Thanks for the great library!

Is there a way to combine two string columns e.g. by concatenating them? My use case is this.

  Polars[24 x 3]
  House number string ["15a", "15a", "15b", "15b", "17a", ...]
  Flat number string ["1", "2", "1", "2", "1", ...]
  Some data string ["47,005", "51,340", "22,710", "74,877", "28,398", ...]
>

I'd like to combine House number and flat number to one column to make a chart in liveview. E.g. instead of House number: "15a", Flat number: "1", I'd like to have Address: "15a1".

I believe DataFrame.mutate should help here. One of the examples in the docs is: Explorer.DataFrame.mutate(df, c: &Explorer.Series.add(&1["a"], &1["b"]))

However, no similar functions are working on strings.

tomekowal avatar Jul 21 '22 21:07 tomekowal

Hi @tomekowal :wave:

I think there no easy way to do this today without traversing the series. Even so, I think this is not an expensive task. Here is how I would do:

alias Explorer.DataFrame

df = DataFrame.new(%{"House" => ["15a", "15a", "15b", "15b", "17a"], "Flat" => ["1", "2", "1", "2", "1"]})

houses = df["Houses"]
flats = df["Flat"]

# Here we traverse until the end of the series and concatenate the strings
addresses = for idx <- 0..Explorer.Series.size(houses) - 1, do: houses[idx] <> flats[idx]

new_df = DataFrame.mutate(df, %{"Address" => addresses})

The result will be something like this:

#Explorer.DataFrame<
  Polars[5 x 3]
  Flat string ["1", "2", "1", "2", "1"]
  House string ["15a", "15a", "15b", "15b", "17a"]
  Address string ["15a1", "15a2", "15b1", "15b2", "17a1"]
>

There is also a way to traverse only one series using the Explorer.Series.transform/2 function. This won´t work for you now, but in the future we could accept a function of arity 2 and pass down the index of the row, so it would be easier to perform this kind of task. WDYT of this idea, @cigrainger?

philss avatar Jul 21 '22 22:07 philss

I'd be alright with that but there are some string functions we don't expose from polars, and this is one of them. There are others too like uppercase, lowercase, regex matches, etc. Probably worth me taking a run at exposing some of those. I think they'll be pretty widely supported with sql backend etc as well. https://docs.rs/polars/latest/polars/prelude/trait.Utf8NameSpaceImpl.html#method.concat

cigrainger avatar Jul 21 '22 23:07 cigrainger

We already have some basic operations on strings and we should definitely add this one. The only question is the name. We already have concat for concatenating series. We have some options:

  1. Rename the usage of concat in the codebase to append and use concat for strings
  2. Find a different name to concatenate strings, such as using <> directly, or string_concat or similar

Thoughts @cigrainger?

josevalim avatar Jan 11 '23 10:01 josevalim

Just giving my two cents: I prefer the first option. We also could use <> as an alias, specially in the Explorer.Query context. WDYT?

philss avatar Feb 08 '23 19:02 philss

I talked to @cigrainger and we decided to revert to bind_columns and bind_rows, so we can free concat for string concatenation.

josevalim avatar Feb 20 '23 14:02 josevalim

Oh, we also have Series.concat and we need to find a replacement for it. Perhaps Series.bind for consistency? Alternatively, Series.append.

josevalim avatar Feb 21 '23 09:02 josevalim

Ok, I have a suggestion, what if we introduce Series.join([list_of_series], sep \\ "") as our string concatenation?

Benefits:

  1. We don't have to replace concat
  2. It provides other functionality such as joining with a separator
  3. We can still have <> in queries

josevalim avatar Feb 21 '23 17:02 josevalim

@josevalim I like the idea of Series.join/2! :+1:

philss avatar Feb 21 '23 19:02 philss

Wouldn't it be too similar to DataFrame.join/2 and perform a completely non-related operation? I'm afraid people would instantly associate Series.join/2 with a merge operation.

kimjoaoun avatar Feb 22 '23 19:02 kimjoaoun

@kimjoaoun ah, the overlap with DataFrame.join is a good concern. We could remap DataFrame.join into left_join, right_join, inner_join, but it may not alleviate the concern. :S

josevalim avatar Feb 22 '23 20:02 josevalim

Here is another idea: Series.format(list). The list may be made of regular strings OR series. If the series is not a string, it will be cast to a string. This means we can do stuff like this:

mutate(
  date: format([df["year"], "/", df["month"], "/", df["day"]])
)

It is a more flexible version of join. The goal is to remember string formatting in other languages (although Elixir itself does not really have this feature). In Explorer.Query, we can still support <> or even support interpolation (but I would hold on that front a bit).

Thoughts?

josevalim avatar Feb 22 '23 20:02 josevalim

That's perfect, I think I prefer the Series.format option. It is similar to Rust's format! macro, which I greatly like and it is also clearer to the end user IMO.

kimjoaoun avatar Feb 22 '23 20:02 kimjoaoun

I like the idea of Series.format/1 too!

I think it's slightly more verbose then join([list], "sep"), but I think in the most applications this is better, because it is flexible.

philss avatar Feb 22 '23 21:02 philss

I'm comfortable with Series.format/1. It's more flexible at the cost of being more verbose. I think that's fine.

cigrainger avatar Feb 23 '23 11:02 cigrainger

I thought it would be Series.format/2 --> Series.format([s1, s2], sep \\ " "). A sep argument is something people would expect the function to have.

kimjoaoun avatar Feb 23 '23 17:02 kimjoaoun

I don’t think format needs a separator. I don’t think Python nor Rust nor Erlang have one.

josevalim avatar Feb 23 '23 17:02 josevalim