spark-sql-perf
spark-sql-perf copied to clipboard
Supports new tpc-ds dsdgen tool
Hello,
I am using:
- the last version of spark-sql-perf, compiled for Spark 1.4.0.
- last dsdgen taken from TPC-DS website
- Spark 1.4.0
When using genData() and generating data for store_sales table, the application crash with the following stacktrace: (see next post).
The parameters using for genData are: HDFS folder, 'parquet', true, false, false, false, false.
Any suggestions?
org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:161) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.spark.sql.catalyst.expressions.GenericRow.apply(rows.scala:77) at org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:435) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:435) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:153) ... 8 more
On the workers I have the following:
DBGEN2 Population Generator (Version 1.4.0) Copyright Transaction Processing Performance Council (TPC) 2001 - 2015 15/09/23 12:39:38 ERROR InsertIntoHadoopFsRelation: Aborting task. java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.spark.sql.catalyst.expressions.GenericRow.apply(rows.scala:77) at org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:435) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:435) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:153) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:132) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
@Nosfe What is the version of dsdgen (when did you download it)? Can you try https://github.com/gregrahn/tpcds-kit?
@yhuai
Hi, first of all thanks for the reply. The dsdgen version is 1.4.0, I downloaded it yesterday.
No Exceptions raised if I use the version in the link. This might means that the new version of dsdgen has new fields or just a different schema. Any plans to support it?
When I have time I will try to fix the problem by myself and open a pull request, but I have no ETA at the moment.
Cheers
@Nosfe Thank you for trying the repo.
I have not tried the new tool. It will be good to investigate what has been changed. Since it is a standard benchmark, I guess the schemas of tables are not changed. Maybe the problem is that the text format has been changed or the new tool takes different parameters. I'd like to take a look (and also update our tool to generate all tables), but I do not have ETA on that as well. How about we change the title of this issue to something like "Supports new tpc-ds dsdgen tool"? Then, we can share what we find at here.
Sounds good. Already made the change in the title.
@yhuai I have taken a look at the new dsdgen and compared it with the one from the repo that you linked me. There are a couple of big changes.
- option -filter is no more, they changed "FILTER" to "_FILTER", i believe using -f can fix that.
- There is no more STDOUT output. This means that we have to let dsdgen write the data on disk, then read them and convert to the appropriate format.
I will keep you updated and try to fix the code during spare time.