explorer
explorer copied to clipboard
Combining string columns
Hi! Thanks for the great library!
Is there a way to combine two string columns e.g. by concatenating them? My use case is this.
Polars[24 x 3]
House number string ["15a", "15a", "15b", "15b", "17a", ...]
Flat number string ["1", "2", "1", "2", "1", ...]
Some data string ["47,005", "51,340", "22,710", "74,877", "28,398", ...]
>
I'd like to combine House number and flat number to one column to make a chart in liveview. E.g. instead of House number: "15a", Flat number: "1"
, I'd like to have Address: "15a1"
.
I believe DataFrame.mutate
should help here. One of the examples in the docs is:
Explorer.DataFrame.mutate(df, c: &Explorer.Series.add(&1["a"], &1["b"]))
However, no similar functions are working on strings.
Hi @tomekowal :wave:
I think there no easy way to do this today without traversing the series. Even so, I think this is not an expensive task. Here is how I would do:
alias Explorer.DataFrame
df = DataFrame.new(%{"House" => ["15a", "15a", "15b", "15b", "17a"], "Flat" => ["1", "2", "1", "2", "1"]})
houses = df["Houses"]
flats = df["Flat"]
# Here we traverse until the end of the series and concatenate the strings
addresses = for idx <- 0..Explorer.Series.size(houses) - 1, do: houses[idx] <> flats[idx]
new_df = DataFrame.mutate(df, %{"Address" => addresses})
The result will be something like this:
#Explorer.DataFrame<
Polars[5 x 3]
Flat string ["1", "2", "1", "2", "1"]
House string ["15a", "15a", "15b", "15b", "17a"]
Address string ["15a1", "15a2", "15b1", "15b2", "17a1"]
>
There is also a way to traverse only one series using the Explorer.Series.transform/2
function. This won´t work for you now, but in the future we could accept a function of arity 2 and pass down the index of the row, so it would be easier to perform this kind of task. WDYT of this idea, @cigrainger?
I'd be alright with that but there are some string functions we don't expose from polars, and this is one of them. There are others too like uppercase, lowercase, regex matches, etc. Probably worth me taking a run at exposing some of those. I think they'll be pretty widely supported with sql backend etc as well. https://docs.rs/polars/latest/polars/prelude/trait.Utf8NameSpaceImpl.html#method.concat
We already have some basic operations on strings and we should definitely add this one. The only question is the name. We already have concat
for concatenating series. We have some options:
- Rename the usage of
concat
in the codebase toappend
and useconcat
for strings - Find a different name to concatenate strings, such as using
<>
directly, orstring_concat
or similar
Thoughts @cigrainger?
Just giving my two cents: I prefer the first option. We also could use <>
as an alias, specially in the Explorer.Query
context. WDYT?
I talked to @cigrainger and we decided to revert to bind_columns and bind_rows, so we can free concat
for string concatenation.
Oh, we also have Series.concat
and we need to find a replacement for it. Perhaps Series.bind
for consistency? Alternatively, Series.append
.
Ok, I have a suggestion, what if we introduce Series.join([list_of_series], sep \\ "")
as our string concatenation?
Benefits:
- We don't have to replace
concat
- It provides other functionality such as joining with a separator
- We can still have
<>
in queries
@josevalim I like the idea of Series.join/2
! :+1:
Wouldn't it be too similar to DataFrame.join/2
and perform a completely non-related operation? I'm afraid people would instantly associate Series.join/2
with a merge operation.
@kimjoaoun ah, the overlap with DataFrame.join
is a good concern. We could remap DataFrame.join
into left_join
, right_join
, inner_join
, but it may not alleviate the concern. :S
Here is another idea: Series.format(list)
. The list may be made of regular strings OR series. If the series is not a string, it will be cast to a string. This means we can do stuff like this:
mutate(
date: format([df["year"], "/", df["month"], "/", df["day"]])
)
It is a more flexible version of join
. The goal is to remember string formatting in other languages (although Elixir itself does not really have this feature). In Explorer.Query, we can still support <>
or even support interpolation (but I would hold on that front a bit).
Thoughts?
That's perfect, I think I prefer the Series.format
option. It is similar to Rust's format!
macro, which I greatly like and it is also clearer to the end user IMO.
I like the idea of Series.format/1
too!
I think it's slightly more verbose then join([list], "sep")
, but I think in the most applications this is better, because it is flexible.
I'm comfortable with Series.format/1
. It's more flexible at the cost of being more verbose. I think that's fine.
I thought it would be Series.format/2
--> Series.format([s1, s2], sep \\ " ")
. A sep
argument is something people would expect the function to have.
I don’t think format needs a separator. I don’t think Python nor Rust nor Erlang have one.