metorikku icon indicating copy to clipboard operation
metorikku copied to clipboard

read hive data problem

Open hongtaox opened this issue 5 years ago • 12 comments

I use metorikku reading data from hive, then write to hdfs with parquet format. But the out is always empty. I can not figure out what is wrong, can someone give me some advice? Thanks. The job conf: metrics:

  • test_metric.yml output: file: dir: /tmp

test_metric conf: steps:

  • dataFrameName: df1 sql: SELECT * FROM employee

output:

  • dataFrameName: df1 outputType: parquet outputOptions: saveMode: overwrite path: df1.parquet

hongtaox avatar May 30 '19 10:05 hongtaox

Can you share your spark submit as well?

lyogev avatar May 30 '19 14:05 lyogev

Thank you! My spark and hive are in one cluster, my other spark programs can read hive table directly, so I submit metorikku without Hive metastore connection config. I tried two spark-submit, the results are same.

  1. spark-submit --class com.yotpo.metorikku.Metorikku metorikku.jar -c test_job.yaml
  2. spark-submit --conf spark.sql.catalogImplementation=hive --class com.yotpo.metorikku.Metorikku metorikku.jar -c test_job.yaml

hongtaox avatar May 30 '19 16:05 hongtaox

if you're running spark-sql -e "select * from employee" you're seeing information?

lyogev avatar May 30 '19 16:05 lyogev

Also are you seeing an empty parquet? Or nothing is being written?

lyogev avatar May 30 '19 16:05 lyogev

spark-sql -e "select * from employee" prints some data, when submitting metorikku.jar, there is no output file.

hongtaox avatar May 31 '19 01:05 hongtaox

I'm wondering if maybe it's writing to the local FS instead of HDFS, can you add the following to your job config: showPreviewLines: 10 Can you see in the STDOUT the employee table output?

lyogev avatar May 31 '19 03:05 lyogev

I add showPreviewLines: 42 and showQuery: true, but stdout does not print sql and select output. stdout: 19/05/31 11:24:45 INFO Client: Application report for application_1559031778312_14158 (state: RUNNING) 19/05/31 11:24:45 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: 10.202.116.71 ApplicationMaster RPC port: 0 queue: root.default start time: 1559273077996 final status: UNDEFINED tracking URL: http://10.202.77.200:54315/proxy/application_1559031778312_14158/ user: hive 19/05/31 11:24:45 INFO YarnClientSchedulerBackend: Application application_1559031778312_14158 has started running. 19/05/31 11:24:45 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57244. 19/05/31 11:24:45 INFO NettyBlockTransferService: Server created on 10.202.77.200:57244 19/05/31 11:24:45 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 19/05/31 11:24:45 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.202.77.200, 57244, None) 19/05/31 11:24:45 INFO BlockManagerMasterEndpoint: Registering block manager 10.202.77.200:57244 with 366.3 MB RAM, BlockManagerId(driver, 10.202.77.200, 57244, None) 19/05/31 11:24:45 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.202.77.200, 57244, None) 19/05/31 11:24:45 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.202.77.200, 57244, None) 19/05/31 11:24:45 INFO EventLoggingListener: Logging events to hdfs://test-cluster-log/sparkHistory/application_1559031778312_14158 19/05/31 11:24:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.202.116.78:51674) with ID 1 19/05/31 11:24:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.202.116.73:60676) with ID 2 19/05/31 11:24:50 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 19/05/31 11:24:50 INFO BlockManagerMasterEndpoint: Registering block manager CNSZ22PL0529:34413 with 2004.6 MB RAM, BlockManagerId(2, CNSZ22PL0529, 34413, None) 19/05/31 11:24:50 INFO SharedState: loading hive config file: file:/app/spark/conf/hive-site.xml 19/05/31 11:24:50 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/DATA1/home/hive/01379241/spark-warehouse/'). 19/05/31 11:24:50 INFO SharedState: Warehouse path is 'file:/DATA1/home/hive/01379241/spark-warehouse/'. 19/05/31 11:24:51 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint 19/05/31 11:24:51 INFO StreamingQueryMetricsListener$: Initialize stream listener

hongtaox avatar May 31 '19 03:05 hongtaox

This is the entire output from the spark-submit? If so, it looks like it's not running any steps... malformed YAML? can you paste the job/metric YAML here with backticks so I can see if maybe it has incorrect formatting?

lyogev avatar May 31 '19 03:05 lyogev

job and metrics config:

test_job.yml
metrics:
  - test_metric.yml
output:
    file:
        dir: /tmp

explain: true
showPreviewLines: 42
showQuery: true

test_metric.yml
steps:
- dataFrameName: df1
  sql:
    SELECT * FROM employee

output:
- dataFrameName: df1
  outputType: parquet
  outputOptions:
    saveMode: overwrite
    path: df1.parquet

hongtaox avatar May 31 '19 04:05 hongtaox

Sorry for the late reply... I think outputType: parquet should be outputType: Parquet

lyogev avatar Jun 05 '19 14:06 lyogev

Please check if the files are created in this directory path: df1.parquet For me it generated files inside this directory . Previously I thought this is a file .

sahumilan0 avatar Jul 01 '19 16:07 sahumilan0

@hongtaox did you ever figure out the solution? I'm facing the same issue, Spark and Hive are in the same cluster. Think the issue is not having the inputs section in the job configuration file, but like you said, "authentication" shouldn't be required if the program is run on "localhost".

jiax83 avatar Apr 28 '20 01:04 jiax83