dbldatagen Using text generator resulting in error

Expected Behavior

Should work without error

Current Behavior

Getting the following error

  File "./dependencies.zip/dbldatagen/text_generators.py", line 881, in pandasGenerateText
    results = self.generateText(rows, rows.size)
  File "./dependencies.zip/dbldatagen/text_generators.py", line 768, in generateText
    para_stats = np.clip(para_stats_raw, self._minValues, self._maxValues, out=stats_array)
  File "/usr/local/lib64/python3.9/site-packages/numpy/_core/fromnumeric.py", line 2247, in clip
    return _wrapfunc(a, 'clip', a_min, a_max, out=out, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/numpy/_core/fromnumeric.py", line 66, in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
  File "/usr/local/lib64/python3.9/site-packages/numpy/_core/fromnumeric.py", line 46, in _wrapit
    result = getattr(arr, method)(*args, **kwds)
  File "/usr/local/lib64/python3.9/site-packages/numpy/_core/_methods.py", line 108, in _clip
    return um.clip(a, min, max, out=out, **kwargs)
numpy._core._exceptions._UFuncOutputCastingError: Cannot cast ufunc 'clip' output from dtype('float64') to dtype('uint8') with casting rule 'same_kind'

Steps to Reproduce (for bugs)

Install dbldatagen using pip install dbldatagen

Generate a custom dataset with a text generator column

 .withColumn("essay", text=dg.ILText(paragraphs=(1, 4), sentences=(2, 6)), random=True)

Context

Trying to create a regular dataset with a text column, it throws this error. Other type of columns work fine. I think AWS Emr serverless by default is using newer versions of numpy which is not compatible with dbldatagen.

Your Environment

dbldatagen version used: 0.4.0
Databricks Runtime version: Aws EMR serverless
Cloud environment used: Aws

Sep 05 '24 09:09 anoopnarang

Is this on a Databricks runtime environment ? If so, the version of Numpy and Pandas used are determined by the Databricks runtime.

Which version of the Databricks runtime was being used ?

Sep 19 '24 22:09 ronanstokes-db

We validated this on Databricks Serverless on AWS - not sure if that was the runtime that you were using as you indicated EMR.

In any event, this should work on Databricks AWS environments. The data generator is only tested and verified against Databricks environments, and spark open source environments (to facilitate offline development of Databricks solutions).

While we dont specifically block other environments, the intent is for this to be used in conjunction with Databricks environments and the license states this.

Anyway if this issue occurred in Databricks environment, can you supply further details?

Dec 02 '24 22:12 ronanstokes-db

Not reproduced

Dec 02 '24 22:12 ronanstokes-db

I have problem, installed today latest verison databricks runtime 16.2 , spark 3.5.2

numpy 1.26.4

Apr 08 '25 15:04 Palkers76

I have downgraded runtime to 15.4 and numpy 1.23.5 and error is gone.

Apr 08 '25 16:04 Palkers76