Javier Arturo Porras Luraschi
Javier Arturo Porras Luraschi
See https://stanford.edu/~rezab/papers/linalg.pdf Wouldn't be much work to create a `sparkmatrix` extension with support for converting DataFrames into [IndexedRowMatrix](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix), [CoordinateMatrix](https://stackoverflow.com/questions/50946523/converting-from-org-apache-spark-sql-dataset-to-coordinatematrix) and [BlockMatrix](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.BlockMatrix). The mapping would be done in Scala similar to...
While `sparklyr::spark_log()` provides access to the driver logs, it does not provide access to executor logs, which customers have reported interest in having integrated in `sparklyr` and ideally also in...
User reports that in `yarn` and using `sparkR` one can specify `spark.driver.memory`: ```r sparkR.session( master="yarn", appName="app-name", sparkHome="/usr/lib/spark", sparkConfig =list( spark.driver.memory="10g", spark.sql.shuffle.partitions="5000", spark.driver.maxResultSize="5000", spark.dynamicAllocation.enabled="true" ) ) ``` `pyspark` does seem to...
Got feedback from user about not being able to easily list/see temp tables, we could consider a simple way to list tables but more importantly, adding support in the IDE...
``` Foreach sink - Runs arbitrary computation on the records in the output. See later in the section for more details. ``` https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks
See https://github.com/HenrikBengtsson/future/issues/331
Support for Spark pipelines with H2O models: - https://databricks.com/session/productionizing-h2o-models-with-apache-spark - https://blog.rstudio.com/2018/05/14/sparklyr-0-8/
As of Spark 2.4.0, out of these sources, Kafka, Kinesis and Flume are available in the Python API. https://spark.apache.org/docs/2.2.0/streaming-kinesis-integration.html https://spark.apache.org/docs/2.2.0/streaming-flume-integration.html
Currently test disabled due to https://issues.apache.org/jira/browse/ARROW-3615 ``` test_that("'sdf_bind_rows' handles column type upcasting (#804)", { # Need support for NaN ARROW-3615 skip_on_arrow() ```
Maybe worth considering adding a `spark_options_csv(sep = NULL, encoding = NULL, etc)` helper function to properly document all the options users can use under `spark_read_csv()`, see https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/DataFrameReader.html So one could...