Javier Arturo Porras Luraschi issues

Results 61 issues of


                                            Javier Arturo Porras Luraschi

Matrix Support

See https://stanford.edu/~rezab/papers/linalg.pdf Wouldn't be much work to create a `sparkmatrix` extension with support for converting DataFrames into [IndexedRowMatrix](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix), [CoordinateMatrix](https://stackoverflow.com/questions/50946523/converting-from-org-apache-spark-sql-dataset-to-coordinatematrix) and [BlockMatrix](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.BlockMatrix). The mapping would be done in Scala similar to...

extensions

Consider supporting worker logs in `spark_log` and RStudio integration.

While `sparklyr::spark_log()` provides access to the driver logs, it does not provide access to executor logs, which customers have reported interest in having integrated in `sparklyr` and ideally also in...

featurerequest

Consider applying settings to shell and Spark Config automatically

User reports that in `yarn` and using `sparkR` one can specify `spark.driver.memory`: ```r sparkR.session( master="yarn", appName="app-name", sparkHome="/usr/lib/spark", sparkConfig =list( spark.driver.memory="10g", spark.sql.shuffle.partitions="5000", spark.driver.maxResultSize="5000", spark.dynamicAllocation.enabled="true" ) ) ``` `pyspark` does seem to...

spark settings

Consider making temp tables visible

Got feedback from user about not being able to easily list/see temp tables, we could consider a simple way to list tables but more importantly, adding support in the IDE...

featurerequest

data

Support for compute sinks

``` Foreach sink - Runs arbitrary computation on the records in the output. See later in the section for more details. ``` https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks

featurerequest

streaming

Support for future package

See https://github.com/HenrikBengtsson/future/issues/331

featurerequest

wishlist

H2O Pipelines

Support for Spark pipelines with H2O models: - https://databricks.com/session/productionizing-h2o-models-with-apache-spark - https://blog.rstudio.com/2018/05/14/sparklyr-0-8/

Support for Kinesis and Flume Streams

As of Spark 2.4.0, out of these sources, Kafka, Kinesis and Flume are available in the Python API. https://spark.apache.org/docs/2.2.0/streaming-kinesis-integration.html https://spark.apache.org/docs/2.2.0/streaming-flume-integration.html

featurerequest

streaming

NaN support in Arrow

Currently test disabled due to https://issues.apache.org/jira/browse/ARROW-3615 ``` test_that("'sdf_bind_rows' handles column type upcasting (#804)", { # Need support for NaN ARROW-3615 skip_on_arrow() ```

arrow

Consider creating spark_read_csv_options()

Maybe worth considering adding a `spark_options_csv(sep = NULL, encoding = NULL, etc)` helper function to properly document all the options users can use under `spark_read_csv()`, see https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/DataFrameReader.html So one could...