spark-nlp
spark-nlp copied to clipboard
T5 Transformer Error
Description
Hi, I am new to spark NLP and run into an issue hope you can help me resolve. I am trying to use T5 transformer to summarize a long document, I downloaded the offline model "t5_base_en_2.7.1_2.4_1610133506835" and I was able to load the model successfully but when I tried to transform the document, I got an error: java.util.NoSuchElementException: Param batchSize does not exist";
I have also downloaded and tried "t5_base_en_2.7.1_2.4_1610133506835" but still got the same error My Spark NLP version is 4.1.0 Apache Spark version: 3.2.1 DBR: 10.3/10.5
I have spent a few days trying to figure out and thought might be easier to ask. Thanks for your help
Py4JJavaError Traceback (most recent call last)
/databricks/python_shell/dbruntime/MLWorkloadsInstrumentation/_pyspark.py in patched_method(self, *args, **kwargs) 28 call_succeeded = False 29 try: ---> 30 result = original_method(self, *args, **kwargs) 31 call_succeeded = True 32 return result
/databricks/spark/python/pyspark/ml/base.py in transform(self, dataset, params) 215 return self.copy(params)._transform(dataset) 216 else: --> 217 return self._transform(dataset) 218 else: 219 raise TypeError("Params must be a param map but got %s." % type(params))
/databricks/spark/python/pyspark/ml/pipeline.py in _transform(self, dataset) 276 def _transform(self, dataset): 277 for t in self.stages: --> 278 dataset = t.transform(dataset) 279 return dataset 280
/databricks/python_shell/dbruntime/MLWorkloadsInstrumentation/_pyspark.py in patched_method(self, *args, **kwargs) 28 call_succeeded = False 29 try: ---> 30 result = original_method(self, *args, **kwargs) 31 call_succeeded = True 32 return result
/databricks/spark/python/pyspark/ml/base.py in transform(self, dataset, params) 215 return self.copy(params)._transform(dataset) 216 else: --> 217 return self._transform(dataset) 218 else: 219 raise TypeError("Params must be a param map but got %s." % type(params))
/databricks/spark/python/pyspark/ml/wrapper.py in _transform(self, dataset) 347 348 def _transform(self, dataset): --> 349 self._transfer_params_to_java() 350 return DataFrame(self._java_obj.transform(dataset._jdf), dataset.sql_ctx) 351
/databricks/spark/python/pyspark/ml/wrapper.py in _transfer_params_to_java(self) 144 self._java_obj.set(pair) 145 if self.hasDefault(param): --> 146 pair = self._make_java_param_pair(param, self._defaultParamMap[param]) 147 pair_defaults.append(pair) 148 if len(pair_defaults) > 0:
/databricks/spark/python/pyspark/ml/wrapper.py in _make_java_param_pair(self, param, value) 130 sc = SparkContext._active_spark_context 131 param = self._resolveParam(param) --> 132 java_param = self._java_obj.getParam(param.name) 133 java_value = _py2java(sc, value) 134 return java_param.w(java_value)
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in call(self, *args) 1302 1303 answer = self.gateway_client.send_command(command) -> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 1306
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 115 def deco(*a, **kw): 116 try: --> 117 return f(*a, **kw) 118 except py4j.protocol.Py4JJavaError as e: 119 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o8476.getParam. : java.util.NoSuchElementException: Param batchSize does not exist. at org.apache.spark.ml.param.Params.$anonfun$getParam$2(params.scala:705) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.ml.param.Params.getParam(params.scala:705) at org.apache.spark.ml.param.Params.getParam$(params.scala:703) at org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:41) at sun.reflect.GeneratedMethodAccessor286.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748)
Expected Behavior
Current Behavior
Possible Solution
Steps to Reproduce
Context
Your Environment
- Spark NLP version
sparknlp.version()
: 4.1.0 - Apache NLP version
spark.version
: 3.2.1 - Java version
java -version
: - Setup and installation (Pypi, Conda, Maven, etc.):
- Operating System and version: Databricks
- Link to your project (if any):
Hi,
I don't recommend downloading models/pipelines offline without knowing the exact version compatibility. (The error you are facing happens where there are more than 1 versions for that model) I highly recommend to use T5Transformer.pretrained("name of the model") and it will find/download the most recent and compatible model for you. Then you can look at your home directory under cache_pretrained what is the full name of that model and you can download or reuse that downloaded model offline.
Thank you! The challenge that I have is my environment does not allow online download from outside AWS bucket so I have to download offline version...
@tzhan0909 you are welcome. Spark NLP is 100% compatible to be 100% offline. I would say you can do something like this to find the compatible model(s) for your installed Spark NLP / PySpark libraries (obviously you can use the already downloaded/extracted model and copy it somewhere else, but in case you only need the full name):
https://colab.research.google.com/drive/1o7DxloOpC67oOAW8aj-XBvPbAnMqkOwI?usp=sharing
It's pretty easy, you can do it outside with the Internet and then that offline model would work without any issue.
This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days