spark-sql-perf icon indicating copy to clipboard operation
spark-sql-perf copied to clipboard

Spark-sql-perf tutorial

Open npaluskar opened this issue 8 years ago • 29 comments

Hi All,

I am new to Spark and Scala. I have the source code for Spark SQL Performance Tests and dsdgen . Can anyone tell me how to proceed next ? I am done with building by giving command bin/run --help. I am trying to execute bin/run --benchmark DatasetPerformance and it is giving me error. But before that it would be really great if someone can tell me how to start with this.I understand Readme is still under development. Is there any manual which I can follow ?

npaluskar avatar Jun 16 '16 23:06 npaluskar

Can you paste the errors you are getting by running bin/run --benchmark DatasetPerformance ?

This is the default test suite/test case or benchmark class and once you are able to compile and run this, you will see static output.

hchawla1 avatar Jun 17 '16 11:06 hchawla1

Build is incomplete. it gives me entire log as an error messages so I am not able to figure what is going wrong in the build. Execution gets stuck after certain step. PFA log. spark-sql-perf-build-log.txt

npaluskar avatar Jun 17 '16 16:06 npaluskar

I don't see any error.

Let the program run completely. This is not complete log.

hchawla1 avatar Jun 17 '16 16:06 hchawla1

Hi All, I am getting following error java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.createDataFrame(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/sql/types/StructType;)Lorg/apache/spark/sql/Dataset;

I am using Spark 1.6.1 and Scala 2.11.8 version . Do I need to change the version of scala to get it work ?

npaluskar avatar Jun 20 '16 21:06 npaluskar

NoSuchMethodError usually means that you have incompatibility between libraries..... I think default scala for Spark 1.6.1 is 2.10 (you can try that)

hchawla1 avatar Jun 21 '16 12:06 hchawla1

I tried with both 2.10.4 and 2.10.5 . I am still facing the same issue.

npaluskar avatar Jun 21 '16 17:06 npaluskar

Hi

I am facing below issues, when I am trying to run this code. Could anyone revert on these issues to go ahead.

  1. For this command, bin/run --benchmark DatasetPerformance-- its getting stuck for hours as in the log spark-sql-perf-build-log.txt attached by npaluskar above
  2. I am also facing the NoSuchMethodError issue with scala 2.10.4 version and spark 1.6.1. Please let us know the resolution if any. 3)If I am using spark 2.0.0 preview version, then I am able to generate data and create external tables . But getting stuck at val tpcds = new TPCDS (sqlContext = sqlContext) statement due to scala crash as mentioned in https://github.com/databricks/spark-sql-perf/issues/70

sridharpothamsetti avatar Jun 23 '16 09:06 sridharpothamsetti

  1. For this command, bin/run --benchmark DatasetPerformance-- its getting stuck for hours as in the log spark-sql-perf-build-log.txt attached by npaluskar above --> This happened with me when I ran the command for second time I am not sure why this happens but it happens every time when you run the command for second time but when I ran it first time i had successful run . So you might want to restart the session and try again.
  2. I am also facing the NoSuchMethodError issue with scala 2.10.4 version and spark 1.6.1. Please let us know the resolution if any. --> I am still trying to figure it out. 3)If I am using spark 2.0.0 preview version, then I am able to generate data and create external tables . But getting stuck at val tpcds = new TPCDS (sqlContext = sqlContext) statement due to scala crash as mentioned in #70 --> I am not aware of this as I am still stuck at step 2

npaluskar avatar Jun 23 '16 16:06 npaluskar

can you verify your TPCDS.scala class:

https://github.com/databricks/spark-sql-perf/blob/v0.4.3/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDS.scala

Are you using Spark2.0..

hchawla1 avatar Jun 23 '16 17:06 hchawla1

Yes . TPCDS.scala is same for me . I am using Spark 1.6.1

npaluskar avatar Jun 23 '16 17:06 npaluskar

Yes chawla. I am using same file as you mentioned and its spark2.0.0 I am using.

sridharpothamsetti avatar Jun 23 '16 17:06 sridharpothamsetti

There are more API's in spark 2.0 (esp for spark sql perf)...

from you spark-sql-perf-master directory try sbt it should give you command prompt ... then type compile and then run --benchmark DatasetPerformance

spark-sql-perf-master:> sbt

compile [warn].... [success] run --benchmark DatasetPerformance

or alternately, from spark-sql-perf-master directory try ./bin/run --benchmark DatasetPerformance

hchawla1 avatar Jun 23 '16 17:06 hchawla1

Yes. I used sbt to compile and created jar file for the spark-sql-perf-master and used the same to login to spark shell using command(bin/spark-shell --jars /home/cloudera/spark-sql-perf-master/target/scala-2.10/spark-sql-perf_2.10-0.4.8-SNAPSHOT.jar)

./bin/run --benchmark DatasetPerformance --ran well this time as suggested by nachiket and ran the below commands for the experiment:

import com.databricks.spark.sql.perf.tpcds.Tables val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val tables = new Tables(sqlContext, "/home/cloudera/tpcds-kit-master/tools/", 1) tables.genData("hdfs://192.168.126.130:8020/tmp/temp2", "parquet", false, false, false, false, false) tables.createExternalTables("hdfs://192.168.126.128:8020/tmp/temp2", "parquet", "sparkperf", false) // Setup TPC-DS experiment import com.databricks.spark.sql.perf.tpcds.TPCDS val tpcds = new TPCDS (sqlContext = sqlContext) This command slain the compiler and causing the spark shell 2.0.0 restart

sridharpothamsetti avatar Jun 23 '16 17:06 sridharpothamsetti

Hi Nachiket, I tried with spark 2.0.0 preview,scala 2.11.8(changed the build.sbt in spark-sql-perf code and compiled it) and the commands ran fine. Thanks.

sridharpothamsetti avatar Jun 24 '16 18:06 sridharpothamsetti

Hi, I have tried the spark-sql-perf with spark 2.0 as above, and it fails in val tpcds = new TPCDS (sqlContext = sqlContext) This command slain the compiler and causing the spark shell 2.0.0 restart Then I want to try to compile the jar with scala 2.11.8, change scalaVersion := "2.10.4" to "2.11.8" in build.sbt. But it fails at the libraryDependencies += "com.typesafe" %% "scalalogging-slf4j" % "1.1.0" The package cannot be found. Can anyone give a solution?

GalvinYang avatar Jul 07 '16 02:07 GalvinYang

Hi Galvin,

Try using the code from tagv0.4.3 rather than from branch:master. It will work fine. And at the same time comment dbc_user_name related things in build.sbt to avoid errors. Latest branch contains ML code also.

Thanks.

sridharpothamsetti avatar Jul 07 '16 03:07 sridharpothamsetti

Thanks for your answer,I have checked out v0.4.3 and comment the dbc related lines, then failed at compiling:

[info] Compiling 20 Scala sources to /data/ygmz/sparksqlperf/spark-sql-perf/target/scala-2.10/classes...
[warn] /data/ygmz/sparksqlperf/spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/CpuProfile.scala:107: non-variable type argument String in type pattern Seq[String] is unchecked since it is eliminated by erasure
[warn]         case Row(stackLines: Seq[String], count: Long) => stackLines.map(toStackElement) -> count :: Nil
[warn]                              ^
[error] /data/ygmz/sparksqlperf/spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/DatasetPerformance.scala:102: object creation impossible, since:
[error] it has 2 unimplemented members.
[error] /** As seen from anonymous class $anon, the missing signatures are as follows.
[error]  *  For convenience, these are usable as stub implementations.
[error]  */
[error]   def bufferEncoder: org.apache.spark.sql.Encoder[com.databricks.spark.sql.perf.SumAndCount] = ???
[error]   def outputEncoder: org.apache.spark.sql.Encoder[Double] = ???
[error]   val average = new Aggregator[Long, SumAndCount, Double] {
[error]                     ^
[warn] one warning found
[error] one error found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 328 s, completed 2016-7-7 11:16:06

How to go through this?

GalvinYang avatar Jul 07 '16 03:07 GalvinYang

I have execute the tpcds1_4 query with 92/99 passed. And write an instruction to use spark-sql-perf. Everyone can do as the instruction if you faced any problems, here's the link: https://galvinyang.github.io/2016/07/09/spark-sql-perf%20test/

GalvinYang avatar Jul 09 '16 02:07 GalvinYang

Hi all:

I try to generate TPC-DS data by spark-perf parallelly, but spark throw exceptions like below: ... scala> tables.genData("hdfs://ocdpCluster/tpcds", "parquet", true, true, false, true, false) Pre-clustering with partitioning columns with query SELECT cs_sold_date_sk,cs_sold_time_sk,cs_ship_date_sk,cs_bill_customer_sk,cs_bill_cdemo_sk,cs_bill_hdemo_sk,cs_bill_addr_sk,cs_ship_customer_sk,cs_ship_cdemo_sk,cs_ship_hdemo_sk,cs_ship_addr_sk,cs_call_center_sk,cs_catalog_page_sk,cs_ship_mode_sk,cs_warehouse_sk,cs_item_sk,cs_promo_sk,cs_order_number,cs_quantity,cs_wholesale_cost,cs_list_price,cs_sales_price,cs_ext_discount_amt,cs_ext_sales_price,cs_ext_wholesale_cost,cs_ext_list_price,cs_ext_tax,cs_coupon_amt,cs_ext_ship_cost,cs_net_paid,cs_net_paid_inc_tax,cs_net_paid_inc_ship,cs_net_paid_inc_ship_tax,cs_net_profit FROM catalog_sales_text

DISTRIBUTE BY cs_sold_date_sk . Generating table catalog_sales in database to hdfs://ocdpCluster/tpcds/catalog_sales with save mode Overwrite. SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. java.io.FileNotFoundException: Path is not a file: /tpcds/catalog_sales/cs_sold_date_sk=2450815 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) ...

How to resolve this?

Thanks

baikai avatar Jul 12 '16 08:07 baikai

I use spark-sql-perf-0.4.3 .I got error when I gen data: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

lordk911 avatar Jul 27 '16 10:07 lordk911

Hi @GalvinYang , i saw your blog which is very helpful for me to understand the spark-sql-perf tool. Now i have a question to need your help. if i used spark 1.6.2 for TPC-DS benchmark, it mean that i can't use tags/v0.4.3 since the codes are based on Spark 2.0.0, so i have to used an older version(eg,tags/v0.3.2, also set scalaVersion := "2.10.4" with sparkVersion := "1.6.2" in build.sbt) to compile and got spark-sql-perf jar to launch spark-shell to test ..? Thanks in advance !

jameszhouyi avatar Sep 29 '16 06:09 jameszhouyi

Hi Zhou, Sorry for late. I have tried it with spark 2.0 before because we need to verify the SQL support in spark 2.0. If you want to test it with spark 1.6., you can try as your method, if it cannot work, try different versions. After all, I think it won't be necessary to test on spark 1.6. since many people have done it before which you can find on google.

At 2016-09-29 14:58:14, "Yi Zhou" [email protected] wrote:

Hi @GalvinYang , i saw your blog which is very helpful for me to understand the spark-sql-perf tool. Now i have a question to need your help. if i used spark 1.6.2 for TPC-DS benchmark, it mean that i can't use tags/v0.4.3 since the codes are based on Spark 2.0.0, so i have to used an older version(eg,tags/v0.3.2, also set scalaVersion := "2.10.4" with sparkVersion := "1.6.2" in build.sbt) to compile and got spark-sql-perf jar to launch spark-shell to test ..? Thanks in advance !

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

GalvinYang avatar Oct 08 '16 09:10 GalvinYang

Hi @GalvinYang , Thanks a lot for your reply and blog ! now i can compile the spark-sql-perf jar with targs/v0.3.2 after reference your experiences in blog. Your blog is very helpful for us : )

jameszhouyi avatar Oct 08 '16 13:10 jameszhouyi

Hi @GalvinYang Thanks a ton for your blog. It has been super helpful especially for someone who is starting off from scratch. But I am having trouble retrieving results if I follow the README file. tpcds.createResultsTable() gives me createResultsTable is not a member of com.databricks.spark.sql.perf.tpcds.TPCDS error sqlContext.table("sqlPerformance") gives me org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'sqlperformance' not found in database 'sparktest'. When I try to get results from a particular run by using - sqlContext.table("sqlPerformance").filter("timestamp = 1476844414082"), I get this -org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'sqlperformance' not found in database 'sparktest' This doesn't make sense because, at the very end of the experiment run, I got Results written to table: 'sqlPerformance' at /spark/sql/performance/timestamp=1476844414082. Do you have any idea how to solve this? Thanks in advance!

reshragh avatar Oct 19 '16 15:10 reshragh

Hi experts, Now i am using the spark-sql-perf to generate TPC-DS 1TB data with enabling partitionTables like tables.genData("hdfs://ip:8020/tpctest", "parquet", true, true, false, false, false) . But found some of big tables(e.g., store_sales) got slower to be completed. I observed that firstly all data were put in /tpcds_1t/store_sales/_temporary/0, then move to /tpcds_1t/store_sales on HDFS, these 'move' on HDFS took a lot time to complete...If some guys came cross the same issue like me ? How to resolve it ?

Thanks in advance !

jameszhouyi avatar Oct 20 '16 03:10 jameszhouyi

@GalvinYang hi: I am facing below issues, when I am trying to run this code. For this command tables.createExternalTables("file:///home/tpctest/", "parquet", "mydata", false) java.lang.RuntimeException: [1.1] failure: ``with'' expected but identifier CREATE found

CREATE DATABASE IF NOT EXISTS mydata ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36) at org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211) at org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114) at org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:113) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) ..............

I used spark-sql-perf-0.2.4 ,scala-2.10.5 spark-1.6.1; but this commend: tables.createTemporaryTables("file:///home/wl/tpctest/", "parquet") has no problem, and tpcds.createResultsTable() commend has the same with tables.createExternalTables() can you help me resove this problem?

wangli86 avatar Nov 25 '16 01:11 wangli86

Hello everyone,

Need some help to run the benchmark. While executing the below query I am getting the attached exception in the spark shell. Please help me resolve this.

val experiment = tpcds.runExperiment(tpcds.interactiveQueries)

Results written to table: 'sqlPerformance' at /spark/sql/performance/timestamp=1489665992654 17/03/16 17:37:07 ERROR FileOutputCommitter: Mkdirs failed to create file:/spark/sql/performance/timestamp=1489665992654/_temporary/0 17/03/16 17:37:07 WARN TaskSetManager: Stage 171 contains a task of very large size (330 KB). The maximum recommended task size is 100 KB. 17/03/16 17:37:07 WARN TaskSetManager: Lost task 0.0 in stage 171.0 (TID 5124, 10.6.45.231, executor 0): java.io.IOException: Mkdirs failed to create file:/spark/sql/performance/timestamp=1489665992654/_temporary/0/_temporary/attempt_20170316173707_0171_m_000000_0 (exists=false, cwd=file:/home/taniya/spark/spark-2.1.0-bin-hadoop2.7/work/app-20170316172533-0001/0) execution.docx

Attached is the full log.


**** The issue is resolved. The error was due to permission issue.

Thanks, Tania

ktania avatar Mar 16 '17 12:03 ktania

@GalvinYang Thanks for your blog. It helped me a lot to get the test running! @reshragh I am also facing the similar issue viewing the results. Is it resolved for you?

While retrieving results using tpcds.createResultsTable() it gives me createResultsTable is not a member of com.databricks.spark.sql.perf.tpcds.TPCDS error. And I figured out from the source code, that there is no such method as createResultsTable in TPCDS.scala.

sqlContext.table("sqlPerformance") gives me org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'sqlperformance' not found in database 'xyz'. even though I got Results written to table: 'sqlPerformance' at /spark/sql/performance/timestamp=1489749887680.

I tried from the console to view the results by importing the json.

val df = spark.read.json("/spark/sql/performance/timestamp=1489749887680/part-00000-8d5f1472-0846-4ec5-81e1-358a7a271840.json")

df.show()

+--------------------+---------+--------------------+------+-------------+ | configuration|iteration| results| tags| timestamp| +--------------------+---------+--------------------+------+-------------+ |[8,[file:/home/ta...| 1|[[5.54E-4,Wrapped...|[true]|1489749887680| |[8,[file:/home/ta...| 2|[[5.55E-4,Wrapped...|[true]|1489749887680| |[8,[file:/home/ta...| 3|[[6.49E-4,Wrapped...|[true]|1489749887680| +--------------------+---------+--------------------+------+-------------+

But I am not able to interprete the results from here. Is there any other way to retrieve the results? Any help is highly appretiated.

Thanks in advance!

ktania avatar Mar 17 '17 12:03 ktania

Hi @GalvinYang,

thanks for your blog, is this blog also available in english or any other blog like this if exist?

Thanks in advance

dreamerHarshit avatar Sep 06 '17 18:09 dreamerHarshit