spark-sql-perf
spark-sql-perf copied to clipboard
Spark-sql-perf tutorial
Hi All,
I am new to Spark and Scala. I have the source code for Spark SQL Performance Tests and dsdgen . Can anyone tell me how to proceed next ? I am done with building by giving command bin/run --help. I am trying to execute bin/run --benchmark DatasetPerformance and it is giving me error. But before that it would be really great if someone can tell me how to start with this.I understand Readme is still under development. Is there any manual which I can follow ?
Can you paste the errors you are getting by running bin/run --benchmark DatasetPerformance ?
This is the default test suite/test case or benchmark class and once you are able to compile and run this, you will see static output.
Build is incomplete. it gives me entire log as an error messages so I am not able to figure what is going wrong in the build. Execution gets stuck after certain step. PFA log. spark-sql-perf-build-log.txt
I don't see any error.
Let the program run completely. This is not complete log.
Hi All, I am getting following error java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.createDataFrame(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/sql/types/StructType;)Lorg/apache/spark/sql/Dataset;
I am using Spark 1.6.1 and Scala 2.11.8 version . Do I need to change the version of scala to get it work ?
NoSuchMethodError usually means that you have incompatibility between libraries..... I think default scala for Spark 1.6.1 is 2.10 (you can try that)
I tried with both 2.10.4 and 2.10.5 . I am still facing the same issue.
Hi
I am facing below issues, when I am trying to run this code. Could anyone revert on these issues to go ahead.
- For this command, bin/run --benchmark DatasetPerformance-- its getting stuck for hours as in the log spark-sql-perf-build-log.txt attached by npaluskar above
- I am also facing the NoSuchMethodError issue with scala 2.10.4 version and spark 1.6.1. Please let us know the resolution if any. 3)If I am using spark 2.0.0 preview version, then I am able to generate data and create external tables . But getting stuck at val tpcds = new TPCDS (sqlContext = sqlContext) statement due to scala crash as mentioned in https://github.com/databricks/spark-sql-perf/issues/70
- For this command, bin/run --benchmark DatasetPerformance-- its getting stuck for hours as in the log spark-sql-perf-build-log.txt attached by npaluskar above --> This happened with me when I ran the command for second time I am not sure why this happens but it happens every time when you run the command for second time but when I ran it first time i had successful run . So you might want to restart the session and try again.
- I am also facing the NoSuchMethodError issue with scala 2.10.4 version and spark 1.6.1. Please let us know the resolution if any. --> I am still trying to figure it out. 3)If I am using spark 2.0.0 preview version, then I am able to generate data and create external tables . But getting stuck at val tpcds = new TPCDS (sqlContext = sqlContext) statement due to scala crash as mentioned in #70 --> I am not aware of this as I am still stuck at step 2
can you verify your TPCDS.scala class:
https://github.com/databricks/spark-sql-perf/blob/v0.4.3/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDS.scala
Are you using Spark2.0..
Yes . TPCDS.scala is same for me . I am using Spark 1.6.1
Yes chawla. I am using same file as you mentioned and its spark2.0.0 I am using.
There are more API's in spark 2.0 (esp for spark sql perf)...
from you spark-sql-perf-master directory try sbt it should give you command prompt ... then type compile and then run --benchmark DatasetPerformance
spark-sql-perf-master:> sbt
compile [warn].... [success] run --benchmark DatasetPerformance
or alternately, from spark-sql-perf-master directory try ./bin/run --benchmark DatasetPerformance
Yes. I used sbt to compile and created jar file for the spark-sql-perf-master and used the same to login to spark shell using command(bin/spark-shell --jars /home/cloudera/spark-sql-perf-master/target/scala-2.10/spark-sql-perf_2.10-0.4.8-SNAPSHOT.jar)
./bin/run --benchmark DatasetPerformance --ran well this time as suggested by nachiket and ran the below commands for the experiment:
import com.databricks.spark.sql.perf.tpcds.Tables val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val tables = new Tables(sqlContext, "/home/cloudera/tpcds-kit-master/tools/", 1) tables.genData("hdfs://192.168.126.130:8020/tmp/temp2", "parquet", false, false, false, false, false) tables.createExternalTables("hdfs://192.168.126.128:8020/tmp/temp2", "parquet", "sparkperf", false) // Setup TPC-DS experiment import com.databricks.spark.sql.perf.tpcds.TPCDS val tpcds = new TPCDS (sqlContext = sqlContext) This command slain the compiler and causing the spark shell 2.0.0 restart
Hi Nachiket, I tried with spark 2.0.0 preview,scala 2.11.8(changed the build.sbt in spark-sql-perf code and compiled it) and the commands ran fine. Thanks.
Hi, I have tried the spark-sql-perf with spark 2.0 as above, and it fails in val tpcds = new TPCDS (sqlContext = sqlContext) This command slain the compiler and causing the spark shell 2.0.0 restart Then I want to try to compile the jar with scala 2.11.8, change scalaVersion := "2.10.4" to "2.11.8" in build.sbt. But it fails at the libraryDependencies += "com.typesafe" %% "scalalogging-slf4j" % "1.1.0" The package cannot be found. Can anyone give a solution?
Hi Galvin,
Try using the code from tagv0.4.3 rather than from branch:master. It will work fine. And at the same time comment dbc_user_name related things in build.sbt to avoid errors. Latest branch contains ML code also.
Thanks.
Thanks for your answer,I have checked out v0.4.3 and comment the dbc related lines, then failed at compiling:
[info] Compiling 20 Scala sources to /data/ygmz/sparksqlperf/spark-sql-perf/target/scala-2.10/classes...
[warn] /data/ygmz/sparksqlperf/spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/CpuProfile.scala:107: non-variable type argument String in type pattern Seq[String] is unchecked since it is eliminated by erasure
[warn] case Row(stackLines: Seq[String], count: Long) => stackLines.map(toStackElement) -> count :: Nil
[warn] ^
[error] /data/ygmz/sparksqlperf/spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/DatasetPerformance.scala:102: object creation impossible, since:
[error] it has 2 unimplemented members.
[error] /** As seen from anonymous class $anon, the missing signatures are as follows.
[error] * For convenience, these are usable as stub implementations.
[error] */
[error] def bufferEncoder: org.apache.spark.sql.Encoder[com.databricks.spark.sql.perf.SumAndCount] = ???
[error] def outputEncoder: org.apache.spark.sql.Encoder[Double] = ???
[error] val average = new Aggregator[Long, SumAndCount, Double] {
[error] ^
[warn] one warning found
[error] one error found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 328 s, completed 2016-7-7 11:16:06
How to go through this?
I have execute the tpcds1_4 query with 92/99 passed. And write an instruction to use spark-sql-perf. Everyone can do as the instruction if you faced any problems, here's the link: https://galvinyang.github.io/2016/07/09/spark-sql-perf%20test/
Hi all:
I try to generate TPC-DS data by spark-perf parallelly, but spark throw exceptions like below: ... scala> tables.genData("hdfs://ocdpCluster/tpcds", "parquet", true, true, false, true, false) Pre-clustering with partitioning columns with query SELECT cs_sold_date_sk,cs_sold_time_sk,cs_ship_date_sk,cs_bill_customer_sk,cs_bill_cdemo_sk,cs_bill_hdemo_sk,cs_bill_addr_sk,cs_ship_customer_sk,cs_ship_cdemo_sk,cs_ship_hdemo_sk,cs_ship_addr_sk,cs_call_center_sk,cs_catalog_page_sk,cs_ship_mode_sk,cs_warehouse_sk,cs_item_sk,cs_promo_sk,cs_order_number,cs_quantity,cs_wholesale_cost,cs_list_price,cs_sales_price,cs_ext_discount_amt,cs_ext_sales_price,cs_ext_wholesale_cost,cs_ext_list_price,cs_ext_tax,cs_coupon_amt,cs_ext_ship_cost,cs_net_paid,cs_net_paid_inc_tax,cs_net_paid_inc_ship,cs_net_paid_inc_ship_tax,cs_net_profit FROM catalog_sales_text
DISTRIBUTE BY
cs_sold_date_sk
.
Generating table catalog_sales in database to hdfs://ocdpCluster/tpcds/catalog_sales with save mode Overwrite.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
java.io.FileNotFoundException: Path is not a file: /tpcds/catalog_sales/cs_sold_date_sk=2450815
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
...
How to resolve this?
Thanks
I use spark-sql-perf-0.4.3 .I got error when I gen data: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
Hi @GalvinYang , i saw your blog which is very helpful for me to understand the spark-sql-perf tool. Now i have a question to need your help. if i used spark 1.6.2 for TPC-DS benchmark, it mean that i can't use tags/v0.4.3 since the codes are based on Spark 2.0.0, so i have to used an older version(eg,tags/v0.3.2, also set scalaVersion := "2.10.4" with sparkVersion := "1.6.2" in build.sbt) to compile and got spark-sql-perf jar to launch spark-shell to test ..? Thanks in advance !
Hi Zhou, Sorry for late. I have tried it with spark 2.0 before because we need to verify the SQL support in spark 2.0. If you want to test it with spark 1.6., you can try as your method, if it cannot work, try different versions. After all, I think it won't be necessary to test on spark 1.6. since many people have done it before which you can find on google.
At 2016-09-29 14:58:14, "Yi Zhou" [email protected] wrote:
Hi @GalvinYang , i saw your blog which is very helpful for me to understand the spark-sql-perf tool. Now i have a question to need your help. if i used spark 1.6.2 for TPC-DS benchmark, it mean that i can't use tags/v0.4.3 since the codes are based on Spark 2.0.0, so i have to used an older version(eg,tags/v0.3.2, also set scalaVersion := "2.10.4" with sparkVersion := "1.6.2" in build.sbt) to compile and got spark-sql-perf jar to launch spark-shell to test ..? Thanks in advance !
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Hi @GalvinYang , Thanks a lot for your reply and blog ! now i can compile the spark-sql-perf jar with targs/v0.3.2 after reference your experiences in blog. Your blog is very helpful for us : )
Hi @GalvinYang Thanks a ton for your blog. It has been super helpful especially for someone who is starting off from scratch. But I am having trouble retrieving results if I follow the README file. tpcds.createResultsTable() gives me createResultsTable is not a member of com.databricks.spark.sql.perf.tpcds.TPCDS error sqlContext.table("sqlPerformance") gives me org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'sqlperformance' not found in database 'sparktest'. When I try to get results from a particular run by using - sqlContext.table("sqlPerformance").filter("timestamp = 1476844414082"), I get this -org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'sqlperformance' not found in database 'sparktest' This doesn't make sense because, at the very end of the experiment run, I got Results written to table: 'sqlPerformance' at /spark/sql/performance/timestamp=1476844414082. Do you have any idea how to solve this? Thanks in advance!
Hi experts, Now i am using the spark-sql-perf to generate TPC-DS 1TB data with enabling partitionTables like tables.genData("hdfs://ip:8020/tpctest", "parquet", true, true, false, false, false) . But found some of big tables(e.g., store_sales) got slower to be completed. I observed that firstly all data were put in /tpcds_1t/store_sales/_temporary/0, then move to /tpcds_1t/store_sales on HDFS, these 'move' on HDFS took a lot time to complete...If some guys came cross the same issue like me ? How to resolve it ?
Thanks in advance !
@GalvinYang hi: I am facing below issues, when I am trying to run this code. For this command tables.createExternalTables("file:///home/tpctest/", "parquet", "mydata", false) java.lang.RuntimeException: [1.1] failure: ``with'' expected but identifier CREATE found
CREATE DATABASE IF NOT EXISTS mydata ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36) at org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211) at org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114) at org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:113) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) ..............
I used spark-sql-perf-0.2.4 ,scala-2.10.5 spark-1.6.1; but this commend: tables.createTemporaryTables("file:///home/wl/tpctest/", "parquet") has no problem, and tpcds.createResultsTable() commend has the same with tables.createExternalTables() can you help me resove this problem?
Hello everyone,
Need some help to run the benchmark. While executing the below query I am getting the attached exception in the spark shell. Please help me resolve this.
val experiment = tpcds.runExperiment(tpcds.interactiveQueries)
Results written to table: 'sqlPerformance' at /spark/sql/performance/timestamp=1489665992654 17/03/16 17:37:07 ERROR FileOutputCommitter: Mkdirs failed to create file:/spark/sql/performance/timestamp=1489665992654/_temporary/0 17/03/16 17:37:07 WARN TaskSetManager: Stage 171 contains a task of very large size (330 KB). The maximum recommended task size is 100 KB. 17/03/16 17:37:07 WARN TaskSetManager: Lost task 0.0 in stage 171.0 (TID 5124, 10.6.45.231, executor 0): java.io.IOException: Mkdirs failed to create file:/spark/sql/performance/timestamp=1489665992654/_temporary/0/_temporary/attempt_20170316173707_0171_m_000000_0 (exists=false, cwd=file:/home/taniya/spark/spark-2.1.0-bin-hadoop2.7/work/app-20170316172533-0001/0) execution.docx
Attached is the full log.
**** The issue is resolved. The error was due to permission issue.
Thanks, Tania
@GalvinYang Thanks for your blog. It helped me a lot to get the test running! @reshragh I am also facing the similar issue viewing the results. Is it resolved for you?
While retrieving results using tpcds.createResultsTable() it gives me createResultsTable is not a member of com.databricks.spark.sql.perf.tpcds.TPCDS error. And I figured out from the source code, that there is no such method as createResultsTable in TPCDS.scala.
sqlContext.table("sqlPerformance") gives me org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'sqlperformance' not found in database 'xyz'. even though I got Results written to table: 'sqlPerformance' at /spark/sql/performance/timestamp=1489749887680.
I tried from the console to view the results by importing the json.
val df = spark.read.json("/spark/sql/performance/timestamp=1489749887680/part-00000-8d5f1472-0846-4ec5-81e1-358a7a271840.json")
df.show()
+--------------------+---------+--------------------+------+-------------+ | configuration|iteration| results| tags| timestamp| +--------------------+---------+--------------------+------+-------------+ |[8,[file:/home/ta...| 1|[[5.54E-4,Wrapped...|[true]|1489749887680| |[8,[file:/home/ta...| 2|[[5.55E-4,Wrapped...|[true]|1489749887680| |[8,[file:/home/ta...| 3|[[6.49E-4,Wrapped...|[true]|1489749887680| +--------------------+---------+--------------------+------+-------------+
But I am not able to interprete the results from here. Is there any other way to retrieve the results? Any help is highly appretiated.
Thanks in advance!
Hi @GalvinYang,
thanks for your blog, is this blog also available in english or any other blog like this if exist?
Thanks in advance