geni
geni copied to clipboard
ArityException with cookbook sample 4
- [x] I have read through the quick start and installation sections of the README.
Info
Info | Value |
---|---|
Operating System | macOS Big Sur 11.2.1 |
Geni Version | 0.0.38 |
Spark Java lib Version | org.apache.spark/spark-XXX_2.12 "3.1.0" |
JDK | openjdk version "1.8.0_282" |
Spark Version | 3.0.2 |
Problem / Steps to reproduce
Standard lein new geni …, bitnami/spark 3.0.2 docker, then used code from geni cookbook chapter 4.
The following code from cookbook example 4 fails with ArityException:
(def null-counts
(-> raw-weather-mar-2012
(g/agg (->> (g/column-names raw-weather-mar-2012)
(map #(vector % (g/null-count %)))
(into {})))
(g/first)))
Exception is
Execution error (AnalysisException) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt/failAnalysis (package.scala:42).
cannot resolve '`Precip. Amount (mm)`' given input columns: [Climate ID, Date/Time (LST), Day, Dew Point Temp (°C), Dew Point Temp Flag, Hmdx, Hmdx Flag, Latitude (y), Longitude (x), Month, Precip. Amount (mm), Precip. Amount Flag, Rel Hum (%), Rel Hum Flag, Station Name, Stn Press (kPa), Stn Press Flag, Temp (°C), Temp Flag, Time (LST), Visibility (km), Visibility Flag, Weather, Wind Chill, Wind Chill Flag, Wind Dir (10s deg), Wind Dir Flag, Wind Spd (km/h), Wind Spd Flag, Year];
'Aggregate [sum(cast(cast(isnull(Latitude (y)#13272) as int) as bigint)) AS Latitude (y)#16363L, sum(cast(cast(isnull(Stn Press Flag#13295) as int) as bigint)) AS Stn Press Flag#16364L, sum(cast(cast(isnull(Temp (°C)#13280) as int) as bigint)) AS Temp (°C)#16365L, sum(cast(cast(isnull(Wind Spd (km/h)#13290) as int) as bigint)) AS Wind Spd (km/h)#16366L, sum(cast(cast(isnull(Rel Hum Flag#13285) as int) as bigint)) AS Rel Hum Flag#16367L, sum(cast(cast(isnull(Date/Time (LST)#13275) as int) as bigint)) AS Date/Time (LST)#16368L, sum(cast(cast(isnull(Visibility Flag#13293) as int) as bigint)) AS Visibility Flag#16369L, sum(cast(cast(isnull(Visibility (km)#13292) as int) as bigint)) AS Visibility (km)#16370L, sum(cast(isnull('Precip. Amount (mm)) as int)) AS Precip. Amount (mm)#16371, sum(cast(cast(isnull(Dew Point Temp Flag#13283) as int) as bigint)) AS Dew Point Temp Flag#16372L, sum(cast(isnull('Precip. Amount Flag) as int)) AS Precip. Amount Flag#16373, sum(cast(cast(isnull(Station Name#13273) as int) as bigint)) AS Station Name#16374L, sum(cast(cast(isnull(Wind Chill Flag#13299) as int) as bigint)) AS Wind Chill Flag#16375L, sum(cast(cast(isnull(Longitude (x)#13271) as int) as bigint)) AS Longitude (x)#16376L, sum(cast(cast(isnull(Time (LST)#13279) as int) as bigint)) AS Time (LST)#16377L, sum(cast(cast(isnull(Dew Point Temp (°C)#13282) as int) as bigint)) AS Dew Point Temp (°C)#16378L, sum(cast(cast(isnull(Rel Hum (%)#13284) as int) as bigint)) AS Rel Hum (%)#16379L, sum(cast(cast(isnull(Wind Dir Flag#13289) as int) as bigint)) AS Wind Dir Flag#16380L, sum(cast(cast(isnull(Climate ID#13274) as int) as bigint)) AS Climate ID#16381L, sum(cast(cast(isnull(Wind Dir (10s deg)#13288) as int) as bigint)) AS Wind Dir (10s deg)#16382L, sum(cast(cast(isnull(Stn Press (kPa)#13294) as int) as bigint)) AS Stn Press (kPa)#16383L, sum(cast(cast(isnull(Year#13276) as int) as bigint)) AS Year#16384L, sum(cast(cast(isnull(Temp Flag#13281) as int) as bigint)) AS Temp Flag#16385L, sum(cast(cast(isnull(Hmdx#13296) as int) as bigint)) AS Hmdx#16386L, ... 6 more fields]
+- Project [Longitude (x)#13211 AS Longitude (x)#13271, Latitude (y)#13212 AS Latitude (y)#13272, Station Name#13213 AS Station Name#13273, Climate ID#13214 AS Climate ID#13274, Date/Time (LST)#13215 AS Date/Time (LST)#13275, Year#13216 AS Year#13276, Month#13217 AS Month#13277, Day#13218 AS Day#13278, Time (LST)#13219 AS Time (LST)#13279, Temp (°C)#13220 AS Temp (°C)#13280, Temp Flag#13221 AS Temp Flag#13281, Dew Point Temp (°C)#13222 AS Dew Point Temp (°C)#13282, Dew Point Temp Flag#13223 AS Dew Point Temp Flag#13283, Rel Hum (%)#13224 AS Rel Hum (%)#13284, Rel Hum Flag#13225 AS Rel Hum Flag#13285, Precip. Amount (mm)#13226 AS Precip. Amount (mm)#13286, Precip. Amount Flag#13227 AS Precip. Amount Flag#13287, Wind Dir (10s deg)#13228 AS Wind Dir (10s deg)#13288, Wind Dir Flag#13229 AS Wind Dir Flag#13289, Wind Spd (km/h)#13230 AS Wind Spd (km/h)#13290, Wind Spd Flag#13231 AS Wind Spd Flag#13291, Visibility (km)#13232 AS Visibility (km)#13292, Visibility Flag#13233 AS Visibility Flag#13293, Stn Press (kPa)#13234 AS Stn Press (kPa)#13294, ... 6 more fields]
+- Relation[Longitude (x)#13211,Latitude (y)#13212,Station Name#13213,Climate ID#13214,Date/Time (LST)#13215,Year#13216,Month#13217,Day#13218,Time (LST)#13219,Temp (°C)#13220,Temp Flag#13221,Dew Point Temp (°C)#13222,Dew Point Temp Flag#13223,Rel Hum (%)#13224,Rel Hum Flag#13225,Precip. Amount (mm)#13226,Precip. Amount Flag#13227,Wind Dir (10s deg)#13228,Wind Dir Flag#13229,Wind Spd (km/h)#13230,Wind Spd Flag#13231,Visibility (km)#13232,Visibility Flag#13233,Stn Press (kPa)#13234,... 6 more fields] csv
Also manual select with column named "Precip. Amount (mm)" does not work. It seems that they have backticks around them internally.
I tried to rename all columns with
(g/to-df weather-data
"Longitude (x)" "Latitude (y)" "Station Name" "Climate ID" "Date/Time (LST)" "Year" "Month" "Day"
"Time (LST)" "Temp (°C)" "Temp Flag" "Dew Point Temp (°C)" "Dew Point Temp Flag" "Rel Hum (%)"
"Rel Hum Flag" "Precip. Amount (mm)" "Precip. Amount Flag" "Wind Dir (10s deg)" "Wind Dir Flag"
"Wind Spd (km/h)" "Wind Spd Flag" "Visibility (km)" "Visibility Flag" "Stn Press (kPa)" "Stn Press Flag"
"Hmdx" "Hmdx Flag" "Wind Chill" "Wind Chill Flag" "Weather"))
but the problem persists, still backticks.
Crashes: (g/select raw-weather-mar-2012 "Precip. Amount (mm)")
Works: (g/select raw-weather-mar-2012 "`Precip. Amount (mm)`")
(g/column-names (g/select raw-weather-mar-2012 "`Precip. Amount (mm)`"))
yields "Precip. Amount (mm)" without backticks.
This lead me to believe that there is some issue in geni or spark with these column names.
Huh... That's very odd... Thank you for flagging this up! I've added this to my TODO list. I think I'll have time to look into it next week. I hope that's okay!
Hi Anthony,
thank you for your quick reply! I figured that I got to this due to a small glitch in Cookbook Chapter 4, where it reads the initial csv without {:kebab-columns true}. Just adding that "fixed" it and also is consistent with the chapter sample output. However, still interesting why it does not work without the kebap special char mangling, I believe it should.
Just started with spark and geni the other day, and I can already say it is one of the best APIs I used in years, congrats! Such a lot of fun! I will for sure continue to use it. It also seems to be quite fast even on my local machine.
Hope you find the cause of this quickly, have a good day!
Ciao
...Jochen
Am 25.02.2021 um 23:10 schrieb Anthony Khong [email protected]:
Huh... That's very odd... Thank you for flagging this up! I've added this to my TODO list. I think I'll have time to look into it next week. I hope that's okay!
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/zero-one-group/geni/issues/315#issuecomment-786261238, or unsubscribe https://github.com/notifications/unsubscribe-auth/AETGETQSM7C7BHKQARJFKPTTA3DEHANCNFSM4YEW56FQ.
Hi Jochen, thank you for the kind words!! Glad you’re enjoying it. Please let me know if you have any feature requests.
On this issue, I think this is related: https://mungingdata.com/pyspark/avoid-dots-periods-column-names/
Basically Spark doesn’t like column names with dots. One thing we can do is to auto-escape it, but I’m not sure if this is the best solution, because you lose the Spark correspondence. I’m leaning towards having a ‘safe-mode’ that is on by default in the read functions that basically scans the column names for dots, and replace it with underscores 🤔
Hi Anthony...
thank you for digging into this! Yes you are right I think, it looks like it is indeed the described issue.
My feeling is to just document this in the geni docs and recommend kebab-case in these cases. It fixes it very well in the Cookbook example :-).
Ciao
...Jochen
Am 26.02.2021 um 11:46 schrieb Anthony Khong [email protected]:
Hi Jochen, thank you for the kind words!! Glad you’re enjoying it. Please let me know if you have any feature requests.
On this issue, I think this is related: https://mungingdata.com/pyspark/avoid-dots-periods-column-names/ https://mungingdata.com/pyspark/avoid-dots-periods-column-names/ Basically Spark doesn’t like column names with dots. One thing we can do is to auto-escape it, but I’m not sure if this is the best solution, because you lose the Spark correspondence. I’m leaning towards having a ‘safe-mode’ that is on by default in the read functions that basically scans the column names for dots, and replace it with underscores 🤔
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/zero-one-group/geni/issues/315#issuecomment-786568182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AETGETV3GMO2F4B3XXXZRHDTA53XLANCNFSM4YEW56FQ.
Hi Anthony...
just a quick followup on our recent discussion regarding the dot/backticks column name issue.
I read up a little about this in spark docs, in Spark SQL docs I found Spark to be really restrictive, allowing just letters (a-zA-Z), digits and underscore. https://spark.apache.org/docs/latest/sql-ref-identifier.html https://spark.apache.org/docs/latest/sql-ref-identifier.html
So, your :kebab-columns is doing it perfect for clojure I think. An issue could be present for people exporting parquet files for use in other languages, that like underscores better.
Instead of :kebab-columns true one could use something like :snake-columns true. Using {:convert-column-names :kebab | :snake} might be a bit more elegant, but well, this would change the API :-).
I appended my repl code (reusing your data-sources code in case you would like to play with it?!
Have a good day Anthony!
Ciao
...Jochen
(require '[zero-one.geni.core.data-sources :as ds] '[zero-one.geni.interop :as interop] 'camel-snake-kebab.core)
(defn ->normalized-columns "Returns a new Dataset with all columns renamed using passed rename-fn." [dataset rename-fn] (let [remove-punctuations #'ds/remove-punctuations ; access privates deaccent #'ds/deaccent new-columns (->> dataset .columns (map remove-punctuations) (map deaccent) (map rename-fn))] (.toDF dataset (interop/->scala-seq new-columns))))
(comment
; plain
(-> (g/read-csv! (str "data/cookbook/weather/weather-2012-3.csv"))
(g/select "Precip. Amount (mm)
"))
; kebap-columns (-> (g/read-csv! (str "data/cookbook/weather/weather-2012-3.csv")) (->normalized-columns camel-snake-kebab.core/->kebab-case) (g/select "precip-amount-mm"))
; snake-columns (-> (g/read-csv! (str "data/cookbook/weather/weather-2012-3.csv")) (->normalized-columns camel-snake-kebab.core/->snake_case) (g/select "precip_amount_mm")) )
Am 26.02.2021 um 11:46 schrieb Anthony Khong [email protected]:
Hi Jochen, thank you for the kind words!! Glad you’re enjoying it. Please let me know if you have any feature requests.
On this issue, I think this is related: https://mungingdata.com/pyspark/avoid-dots-periods-column-names/ https://mungingdata.com/pyspark/avoid-dots-periods-column-names/ Basically Spark doesn’t like column names with dots. One thing we can do is to auto-escape it, but I’m not sure if this is the best solution, because you lose the Spark correspondence. I’m leaning towards having a ‘safe-mode’ that is on by default in the read functions that basically scans the column names for dots, and replace it with underscores 🤔
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/zero-one-group/geni/issues/315#issuecomment-786568182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AETGETV3GMO2F4B3XXXZRHDTA53XLANCNFSM4YEW56FQ.