Using text generator resulting in error
Expected Behavior
Should work without error
Current Behavior
Getting the following error
File "./dependencies.zip/dbldatagen/text_generators.py", line 881, in pandasGenerateText
results = self.generateText(rows, rows.size)
File "./dependencies.zip/dbldatagen/text_generators.py", line 768, in generateText
para_stats = np.clip(para_stats_raw, self._minValues, self._maxValues, out=stats_array)
File "/usr/local/lib64/python3.9/site-packages/numpy/_core/fromnumeric.py", line 2247, in clip
return _wrapfunc(a, 'clip', a_min, a_max, out=out, **kwargs)
File "/usr/local/lib64/python3.9/site-packages/numpy/_core/fromnumeric.py", line 66, in _wrapfunc
return _wrapit(obj, method, *args, **kwds)
File "/usr/local/lib64/python3.9/site-packages/numpy/_core/fromnumeric.py", line 46, in _wrapit
result = getattr(arr, method)(*args, **kwds)
File "/usr/local/lib64/python3.9/site-packages/numpy/_core/_methods.py", line 108, in _clip
return um.clip(a, min, max, out=out, **kwargs)
numpy._core._exceptions._UFuncOutputCastingError: Cannot cast ufunc 'clip' output from dtype('float64') to dtype('uint8') with casting rule 'same_kind'
Steps to Reproduce (for bugs)
Install dbldatagen using pip install dbldatagen
Generate a custom dataset with a text generator column
.withColumn("essay", text=dg.ILText(paragraphs=(1, 4), sentences=(2, 6)), random=True)
Context
Trying to create a regular dataset with a text column, it throws this error. Other type of columns work fine. I think AWS Emr serverless by default is using newer versions of numpy which is not compatible with dbldatagen.
Your Environment
-
dbldatagenversion used: 0.4.0 - Databricks Runtime version: Aws EMR serverless
- Cloud environment used: Aws
Is this on a Databricks runtime environment ? If so, the version of Numpy and Pandas used are determined by the Databricks runtime.
Which version of the Databricks runtime was being used ?
We validated this on Databricks Serverless on AWS - not sure if that was the runtime that you were using as you indicated EMR.
In any event, this should work on Databricks AWS environments. The data generator is only tested and verified against Databricks environments, and spark open source environments (to facilitate offline development of Databricks solutions).
While we dont specifically block other environments, the intent is for this to be used in conjunction with Databricks environments and the license states this.
Anyway if this issue occurred in Databricks environment, can you supply further details?
Not reproduced
I have problem, installed today latest verison databricks runtime 16.2 , spark 3.5.2
numpy 1.26.4
I have downgraded runtime to 15.4 and numpy 1.23.5 and error is gone.