bebe
bebe copied to clipboard
Functions that are in Spark SQL and not in Scala API to be implemented
We can use this issue to create a list of all the functions that are in Spark SQL, but not in the Scala API for whatever reason.
Here's the list that @nvander1 sent me so we can get started out. He already implemented approx_percentile
, so we're on our way!
- [x] approx_percentile
- [X] cardinality https://github.com/MrPowers/bebe/pull/24
- [ ] character_length https://github.com/MrPowers/bebe/pull/25 - in scala - python API is called length
- [ ] char_length - not sure if we need this if we have character_length - in scala - python API is called length
- [ ] chr https://github.com/MrPowers/bebe/pull/26
- [X] cot https://github.com/MrPowers/bebe/pull/28
- [X] count_if https://github.com/MrPowers/bebe/pull/29
- [ ] count_min_sketch
- [ ] cube - in scala API is an alternative to the method group_by of Dataset
- [ ] current_database - can be obtained from
sparkSession.catalog.currentDatabase
- [ ] date - called
to_date
- [ ] date_part -
dayofweek
,dayofyear
,second
,timestamp_seconds
, etc - [ ] day - dayofmonth
- [ ] decimal -
cast(DecimalType)
- [ ] div -
/
- [ ] double -
cast(DoubleType)
- [X] e
- [ ] elt - not needed, can use regular Array indexing to fetch items
- [ ] every - not needed, can use forall
- [ ] extract - its a router to
dayofweek
,dayofyear
,second
... and the element to extract can't be a expression. - [ ] find_in_set - not needed, can use Scala functions
- [ ] first_value - alias of
first
- [ ] float -
cast(FloatType)
- [ ] if - not needed, use when
- [X] ifnull
- [ ] in - can use
array_contains
- [X] inline
- [ ] inline_outer
- [ ] input_file_block_length
- [ ] input_file_block_start
- [ ] int -
cast(IntegerType)
- [ ] isnotnull - method of column in scala API
- [ ] java_method
- [ ] last_value - alias of
last
- [ ] lcase -
lower
- [X] left
- [X] like - method of column
- [ ] ln - column method
isin
- [X] make_date -
- [ ] make_interval - this one was added to Spark 😎
- [X] make_timestamp
- [ ] max_by
- [ ] min_by
- [ ] mod - alias of %
- [ ] named_struct
- [ ] negative -
- col
- [ ] now -
current_timestamp
- [ ] nullif
- [ ] nvl -
coalesce
- [X] nvl2
- [X] octet_length
- [ ] or - column method
|
andor
- [X] parse_url
- [X] percentile
- [ ] pi
- [ ] position -
locate
but maybe create an alternate locate that accepts all parameters as columns - [ ] positive
- [ ] power - alias of
pow
- [ ] printf -
format_string
- [ ] random - alias of
rand
- [ ] reflect - not implemented in scala API but is an alias of SQL function java_method
- [ ] replace -
- [X] right -
- [ ] like - method of column
- [ ] rollup - in scala API is an alternative to the method group_by of Dataset
- [X] sentences -
- [ ] sha - alias of
sha1
- [ ] shiftleft -
shiftLeft
- [ ] shiftright -
shiftRight
- [ ] shiftrightunsigned -
shiftRightUnsigned
- [ ] sign - doesn't seem like a useful function
- [ ] smallint - can use cast()
- [ ] some - array_exists
- [X] space - returns a string of n spaces, can be done with scala / python
- [X] stack - PR: https://github.com/MrPowers/bebe/pull/21
- [ ] std - Is the same of
stddev_pop
function? - [ ] string -
cast(StringType)
- [ ] str_to_map - in scala / python is much better to do a literal from a Map object
- [X] substr -
substring
- [ ] timestamp - cast(TimestampType)
- [ ] tinyint - can use cast
- [ ] to_unix_timestamp -
to_timestamp
- [ ] typeof - scala / python can check the type from the schema
- [ ] ucase -
upper
- [X] uuid
- [ ] version - returns the spark version, can be obtained from the spark session
- [X] weekday
A few things that I'm not sure about, some of these functions are in the spark API, but maybe not in the org.apache.spark.sql.functions
object
For example
- character_length: is called in functions length
- count_if: can be implemented with the spark API,but it can be a good addition to simplify
- cube: if it's the function to aggregate, it's here and I don't see how an external function could improve it.
- date: the definition of this function is a cast to date. Also there is the function to_date that has the same logic as the SQL, and also you can provide the date format to parse in yyyy-MM-dd format
- date_part: there are a few functions that make this easier, like dayofweek, dayofyear, second, timestamp_seconds, etc
- day: is called dayofmonth
- printf: it's called
format_string
but force the first parameter to be a string, not a column - substring: it's created in functions package, but the start and end parameters only are exposed as Int, and don't accept columns. The idea here is to create an alternative version that accepts also the start and end as columns?
@alfonsorr - good questions.
Feel free to update the list and just add something like "wont add" to the functions that shouldn't get implemented. This list was generated by a script @nvander1 wrote to compare the SQL functions and the Scala functions, so there are probably some that snuck in there that we don't need.
Yea, definitely want to define substring(str: Column, pos: Column, len: Column): Column
. I don't like the functions that take regular Scala types as arguments.
Feel free to go ahead and add the "wont add" annotation to any functions you think we don't need.
I've checked all the methods, and indicated the ones that are already implemented, or not useful in spark / python API