metorikku
metorikku copied to clipboard
read hive data problem
I use metorikku reading data from hive, then write to hdfs with parquet format. But the out is always empty. I can not figure out what is wrong, can someone give me some advice? Thanks. The job conf: metrics:
- test_metric.yml output: file: dir: /tmp
test_metric conf: steps:
- dataFrameName: df1 sql: SELECT * FROM employee
output:
- dataFrameName: df1 outputType: parquet outputOptions: saveMode: overwrite path: df1.parquet
Can you share your spark submit as well?
Thank you! My spark and hive are in one cluster, my other spark programs can read hive table directly, so I submit metorikku without Hive metastore connection config. I tried two spark-submit, the results are same.
- spark-submit --class com.yotpo.metorikku.Metorikku metorikku.jar -c test_job.yaml
- spark-submit --conf spark.sql.catalogImplementation=hive --class com.yotpo.metorikku.Metorikku metorikku.jar -c test_job.yaml
if you're running spark-sql -e "select * from employee" you're seeing information?
Also are you seeing an empty parquet? Or nothing is being written?
spark-sql -e "select * from employee" prints some data, when submitting metorikku.jar, there is no output file.
I'm wondering if maybe it's writing to the local FS instead of HDFS, can you add the following to your job config:
showPreviewLines: 10
Can you see in the STDOUT the employee table output?
I add showPreviewLines: 42 and showQuery: true, but stdout does not print sql and select output. stdout: 19/05/31 11:24:45 INFO Client: Application report for application_1559031778312_14158 (state: RUNNING) 19/05/31 11:24:45 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: 10.202.116.71 ApplicationMaster RPC port: 0 queue: root.default start time: 1559273077996 final status: UNDEFINED tracking URL: http://10.202.77.200:54315/proxy/application_1559031778312_14158/ user: hive 19/05/31 11:24:45 INFO YarnClientSchedulerBackend: Application application_1559031778312_14158 has started running. 19/05/31 11:24:45 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57244. 19/05/31 11:24:45 INFO NettyBlockTransferService: Server created on 10.202.77.200:57244 19/05/31 11:24:45 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 19/05/31 11:24:45 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.202.77.200, 57244, None) 19/05/31 11:24:45 INFO BlockManagerMasterEndpoint: Registering block manager 10.202.77.200:57244 with 366.3 MB RAM, BlockManagerId(driver, 10.202.77.200, 57244, None) 19/05/31 11:24:45 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.202.77.200, 57244, None) 19/05/31 11:24:45 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.202.77.200, 57244, None) 19/05/31 11:24:45 INFO EventLoggingListener: Logging events to hdfs://test-cluster-log/sparkHistory/application_1559031778312_14158 19/05/31 11:24:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.202.116.78:51674) with ID 1 19/05/31 11:24:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.202.116.73:60676) with ID 2 19/05/31 11:24:50 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 19/05/31 11:24:50 INFO BlockManagerMasterEndpoint: Registering block manager CNSZ22PL0529:34413 with 2004.6 MB RAM, BlockManagerId(2, CNSZ22PL0529, 34413, None) 19/05/31 11:24:50 INFO SharedState: loading hive config file: file:/app/spark/conf/hive-site.xml 19/05/31 11:24:50 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/DATA1/home/hive/01379241/spark-warehouse/'). 19/05/31 11:24:50 INFO SharedState: Warehouse path is 'file:/DATA1/home/hive/01379241/spark-warehouse/'. 19/05/31 11:24:51 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint 19/05/31 11:24:51 INFO StreamingQueryMetricsListener$: Initialize stream listener
This is the entire output from the spark-submit? If so, it looks like it's not running any steps... malformed YAML? can you paste the job/metric YAML here with backticks so I can see if maybe it has incorrect formatting?
job and metrics config:
test_job.yml
metrics:
- test_metric.yml
output:
file:
dir: /tmp
explain: true
showPreviewLines: 42
showQuery: true
test_metric.yml
steps:
- dataFrameName: df1
sql:
SELECT * FROM employee
output:
- dataFrameName: df1
outputType: parquet
outputOptions:
saveMode: overwrite
path: df1.parquet
Sorry for the late reply... I think outputType: parquet should be outputType: Parquet
Please check if the files are created in this directory path: df1.parquet For me it generated files inside this directory . Previously I thought this is a file .
@hongtaox did you ever figure out the solution? I'm facing the same issue, Spark and Hive are in the same cluster. Think the issue is not having the inputs
section in the job configuration file, but like you said, "authentication" shouldn't be required if the program is run on "localhost".