dbldatagen icon indicating copy to clipboard operation
dbldatagen copied to clipboard

Using text generator resulting in error

Open anoopnarang opened this issue 1 year ago • 1 comments

Expected Behavior

Should work without error

Current Behavior

Getting the following error

  File "./dependencies.zip/dbldatagen/text_generators.py", line 881, in pandasGenerateText
    results = self.generateText(rows, rows.size)
  File "./dependencies.zip/dbldatagen/text_generators.py", line 768, in generateText
    para_stats = np.clip(para_stats_raw, self._minValues, self._maxValues, out=stats_array)
  File "/usr/local/lib64/python3.9/site-packages/numpy/_core/fromnumeric.py", line 2247, in clip
    return _wrapfunc(a, 'clip', a_min, a_max, out=out, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/numpy/_core/fromnumeric.py", line 66, in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
  File "/usr/local/lib64/python3.9/site-packages/numpy/_core/fromnumeric.py", line 46, in _wrapit
    result = getattr(arr, method)(*args, **kwds)
  File "/usr/local/lib64/python3.9/site-packages/numpy/_core/_methods.py", line 108, in _clip
    return um.clip(a, min, max, out=out, **kwargs)
numpy._core._exceptions._UFuncOutputCastingError: Cannot cast ufunc 'clip' output from dtype('float64') to dtype('uint8') with casting rule 'same_kind'

Steps to Reproduce (for bugs)

Install dbldatagen using pip install dbldatagen

Generate a custom dataset with a text generator column

 .withColumn("essay", text=dg.ILText(paragraphs=(1, 4), sentences=(2, 6)), random=True)

Context

Trying to create a regular dataset with a text column, it throws this error. Other type of columns work fine. I think AWS Emr serverless by default is using newer versions of numpy which is not compatible with dbldatagen.

Your Environment

  • dbldatagen version used: 0.4.0
  • Databricks Runtime version: Aws EMR serverless
  • Cloud environment used: Aws

anoopnarang avatar Sep 05 '24 09:09 anoopnarang

Is this on a Databricks runtime environment ? If so, the version of Numpy and Pandas used are determined by the Databricks runtime.

Which version of the Databricks runtime was being used ?

ronanstokes-db avatar Sep 19 '24 22:09 ronanstokes-db

We validated this on Databricks Serverless on AWS - not sure if that was the runtime that you were using as you indicated EMR.

In any event, this should work on Databricks AWS environments. The data generator is only tested and verified against Databricks environments, and spark open source environments (to facilitate offline development of Databricks solutions).

While we dont specifically block other environments, the intent is for this to be used in conjunction with Databricks environments and the license states this.

Anyway if this issue occurred in Databricks environment, can you supply further details?

ronanstokes-db avatar Dec 02 '24 22:12 ronanstokes-db

Not reproduced

ronanstokes-db avatar Dec 02 '24 22:12 ronanstokes-db

I have problem, installed today latest verison databricks runtime 16.2 , spark 3.5.2

numpy 1.26.4

Palkers76 avatar Apr 08 '25 15:04 Palkers76

I have downgraded runtime to 15.4 and numpy 1.23.5 and error is gone.

Palkers76 avatar Apr 08 '25 16:04 Palkers76