spark-nlp icon indicating copy to clipboard operation
spark-nlp copied to clipboard

T5 Transformer Error

Open tzhan0909 opened this issue 1 year ago • 3 comments

Description

Hi, I am new to spark NLP and run into an issue hope you can help me resolve. I am trying to use T5 transformer to summarize a long document, I downloaded the offline model "t5_base_en_2.7.1_2.4_1610133506835" and I was able to load the model successfully but when I tried to transform the document, I got an error: java.util.NoSuchElementException: Param batchSize does not exist";

I have also downloaded and tried "t5_base_en_2.7.1_2.4_1610133506835" but still got the same error My Spark NLP version is 4.1.0 Apache Spark version: 3.2.1 DBR: 10.3/10.5

I have spent a few days trying to figure out and thought might be easier to ask. Thanks for your help


Py4JJavaError Traceback (most recent call last) in ----> 1 pipeline_model = pipeline.fit(sample_df).transform(sample_df)

/databricks/python_shell/dbruntime/MLWorkloadsInstrumentation/_pyspark.py in patched_method(self, *args, **kwargs) 28 call_succeeded = False 29 try: ---> 30 result = original_method(self, *args, **kwargs) 31 call_succeeded = True 32 return result

/databricks/spark/python/pyspark/ml/base.py in transform(self, dataset, params) 215 return self.copy(params)._transform(dataset) 216 else: --> 217 return self._transform(dataset) 218 else: 219 raise TypeError("Params must be a param map but got %s." % type(params))

/databricks/spark/python/pyspark/ml/pipeline.py in _transform(self, dataset) 276 def _transform(self, dataset): 277 for t in self.stages: --> 278 dataset = t.transform(dataset) 279 return dataset 280

/databricks/python_shell/dbruntime/MLWorkloadsInstrumentation/_pyspark.py in patched_method(self, *args, **kwargs) 28 call_succeeded = False 29 try: ---> 30 result = original_method(self, *args, **kwargs) 31 call_succeeded = True 32 return result

/databricks/spark/python/pyspark/ml/base.py in transform(self, dataset, params) 215 return self.copy(params)._transform(dataset) 216 else: --> 217 return self._transform(dataset) 218 else: 219 raise TypeError("Params must be a param map but got %s." % type(params))

/databricks/spark/python/pyspark/ml/wrapper.py in _transform(self, dataset) 347 348 def _transform(self, dataset): --> 349 self._transfer_params_to_java() 350 return DataFrame(self._java_obj.transform(dataset._jdf), dataset.sql_ctx) 351

/databricks/spark/python/pyspark/ml/wrapper.py in _transfer_params_to_java(self) 144 self._java_obj.set(pair) 145 if self.hasDefault(param): --> 146 pair = self._make_java_param_pair(param, self._defaultParamMap[param]) 147 pair_defaults.append(pair) 148 if len(pair_defaults) > 0:

/databricks/spark/python/pyspark/ml/wrapper.py in _make_java_param_pair(self, param, value) 130 sc = SparkContext._active_spark_context 131 param = self._resolveParam(param) --> 132 java_param = self._java_obj.getParam(param.name) 133 java_value = _py2java(sc, value) 134 return java_param.w(java_value)

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in call(self, *args) 1302 1303 answer = self.gateway_client.send_command(command) -> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 1306

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 115 def deco(*a, **kw): 116 try: --> 117 return f(*a, **kw) 118 except py4j.protocol.Py4JJavaError as e: 119 converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o8476.getParam. : java.util.NoSuchElementException: Param batchSize does not exist. at org.apache.spark.ml.param.Params.$anonfun$getParam$2(params.scala:705) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.ml.param.Params.getParam(params.scala:705) at org.apache.spark.ml.param.Params.getParam$(params.scala:703) at org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:41) at sun.reflect.GeneratedMethodAccessor286.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748)

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context

Your Environment

  • Spark NLP version sparknlp.version(): 4.1.0
  • Apache NLP version spark.version: 3.2.1
  • Java version java -version:
  • Setup and installation (Pypi, Conda, Maven, etc.):
  • Operating System and version: Databricks
  • Link to your project (if any):

tzhan0909 avatar Sep 17 '22 03:09 tzhan0909

Hi,

I don't recommend downloading models/pipelines offline without knowing the exact version compatibility. (The error you are facing happens where there are more than 1 versions for that model) I highly recommend to use T5Transformer.pretrained("name of the model") and it will find/download the most recent and compatible model for you. Then you can look at your home directory under cache_pretrained what is the full name of that model and you can download or reuse that downloaded model offline.

maziyarpanahi avatar Sep 17 '22 06:09 maziyarpanahi

Thank you! The challenge that I have is my environment does not allow online download from outside AWS bucket so I have to download offline version...

tzhan0909 avatar Sep 17 '22 06:09 tzhan0909

@tzhan0909 you are welcome. Spark NLP is 100% compatible to be 100% offline. I would say you can do something like this to find the compatible model(s) for your installed Spark NLP / PySpark libraries (obviously you can use the already downloaded/extracted model and copy it somewhere else, but in case you only need the full name):

https://colab.research.google.com/drive/1o7DxloOpC67oOAW8aj-XBvPbAnMqkOwI?usp=sharing

It's pretty easy, you can do it outside with the Internet and then that offline model would work without any issue.

maziyarpanahi avatar Sep 19 '22 07:09 maziyarpanahi

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions[bot] avatar Mar 19 '23 00:03 github-actions[bot]