zingg
zingg copied to clipboard
`exportModel` encounters `NullPointerException`
Describe the bug
Cannot generate csv of model because of NullPointerException
. Phase generateDocs
works just fine. From documentation: https://docs.zingg.ai/zingg/stepbystep/createtrainingdata/exportlabeleddata
To Reproduce
Steps to reproduce the behavior:
Run: (.venv) spark@496208741a60:/workspaces/foo-zingg-entity-resolution $ ~/zingg-0.4.0/scripts/zingg.sh --phase exportModel --conf /workspaces/foo-zingg-entity-resolution/datasets/trader/conf_no_bdid.json --location tmp --properties-file /workspaces/foo-zingg-entity-resolution/zingg.conf
Expected behavior Should be able to export a csv of the model.
Screenshots
24/04/12 15:35:03 INFO ClientOptions: --phase
24/04/12 15:35:03 INFO ClientOptions: exportModel
24/04/12 15:35:03 INFO ClientOptions: --conf
24/04/12 15:35:03 INFO ClientOptions: /workspaces/foo-zingg-entity-resolution/datasets/trader/conf_no_bdid.json
24/04/12 15:35:03 INFO ClientOptions: --location
24/04/12 15:35:03 INFO ClientOptions: tmp
24/04/12 15:35:03 INFO ClientOptions: --email
24/04/12 15:35:03 INFO ClientOptions: [email protected]
24/04/12 15:35:03 INFO ClientOptions: --license
24/04/12 15:35:03 INFO ClientOptions: zinggLicense.txt
24/04/12 15:35:03 WARN ArgumentsUtil: Config Argument is /workspaces/foo-zingg-entity-resolution/datasets/trader/conf_no_bdid.json
24/04/12 15:35:03 WARN ArgumentsUtil: phase is exportModel
24/04/12 15:35:03 INFO Client:
24/04/12 15:35:03 INFO Client: **************************************************************************
24/04/12 15:35:03 INFO Client: * ** Note about analytics collection by Zingg AI ** *
24/04/12 15:35:03 INFO Client: * *
24/04/12 15:35:03 INFO Client: * Please note that Zingg captures a few metrics about application's *
24/04/12 15:35:03 INFO Client: * runtime parameters. However, no user's personal data or application *
24/04/12 15:35:03 INFO Client: * data is captured. If you want to switch off this feature, please *
24/04/12 15:35:03 INFO Client: * set the flag collectMetrics to false in config. For details, please *
24/04/12 15:35:03 INFO Client: * refer to the Zingg docs (https://docs.zingg.ai/docs/security.html) *
24/04/12 15:35:03 INFO Client: **************************************************************************
24/04/12 15:35:03 INFO Client:
java.lang.NullPointerException
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Unknown Source)
at zingg.spark.core.executor.SparkZFactory.get(SparkZFactory.java:40)
at zingg.common.client.Client.setZingg(Client.java:68)
at zingg.common.client.Client.<init>(Client.java:46)
at zingg.spark.client.SparkClient.<init>(SparkClient.java:29)
at zingg.spark.client.SparkClient.getClient(SparkClient.java:68)
at zingg.common.client.Client.mainMethod(Client.java:185)
at zingg.spark.client.SparkClient.main(SparkClient.java:76)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.base/java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1029)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Desktop (please complete the following information):
- OS:
(.venv) spark@496208741a60:/workspaces/foo-zingg-entity-resolution $ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
- Browser [e.g. chrome, safari]
- Version [e.g. 22]
Smartphone (please complete the following information): N/A
Additional context
{
"fieldDefinition": [
{
"fieldName": "data_source",
"fields": "data_source",
"dataType": "string",
"matchType": "DONT_USE"
},
// other fields...
],
"output": [
{
"name": "output",
"format": "csv",
"props": {
"location": "/tmp/zinggOutput",
"delimiter": ",",
"header": true
}
}
],
"data": [{
"name": "salesforce",
"format": "jdbc",
"props": {
"url": "jdbc:redshift://my-redshift-server:5439/my-redshift-db",
"dbtable": "my_schema.my_table",
"driver": "com.amazon.redshift.jdbc42.Driver",
"user": "test",
"password": "password123"
}
}],
"labelDataSampleSize" : 0.15,
"numPartitions": 50,
"modelId": 101,
"zinggDir": "/workspaces/foo-zingg-entity-resolution/models"
}
thanks for reporting this. if you are struck, you can try reading the model folder at zinggDir/modelId/trainingData/marked using pyspark. this location will have your labeled data in parquet format
Will be handled along side SparkConnect change, putting on hold for now
For anyone who just wants to get their training data:
MODEL_PATH: str = "{your model folder}/{your model ID}"
OUTPUT_PATH: str = "output.csv"
from pathlib import Path
from pyspark.sql import SparkSession
context: SparkSession = SparkSession.builder.getOrCreate()
context.sparkContext.getConf().getAll()
df = context.read.parquet(str((Path(MODEL_PATH) / "trainingData/marked").absolute()))
print(df.toPandas())
# Save to CSV
df.toPandas().to_csv(Path(OUTPUT_PATH), header=True, index=False)
same null pointer error on zingg:0.4.0 from docker img
For anyone who just wants to get their training data:
MODEL_PATH: str = "{your model folder}/{your model ID}" OUTPUT_PATH: str = "output.csv" from pathlib import Path from pyspark.sql import SparkSession context: SparkSession = SparkSession.builder.getOrCreate() context.sparkContext.getConf().getAll() df = context.read.parquet(str((Path(MODEL_PATH) / "trainingData/marked").absolute())) print(df.toPandas()) # Save to CSV df.toPandas().to_csv(Path(OUTPUT_PATH), header=True, index=False)
Thanks havardox.
I'm running zingg from docker and new to spark. Wondering how can I export the model from docker?
Can you try running pyspark in the docker and the commands shared above by @havardox