spark-nlp
spark-nlp copied to clipboard
MultiDateMatcher only returning 1 element
Is there an existing issue for this?
- [X] I have searched the existing issues and did not find a match.
Who can help?
No response
What are you working on?
Finding dates in a string.
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
date = MultiDateMatcher()
.setInputCols("document")
.setOutputCol("date")
.setAnchorDateYear(2020)
.setAnchorDateMonth(1)
.setAnchorDateDay(11)
.setOutputFormat("yyyy/MM/dd")
pipeline = Pipeline().setStages([
documentAssembler,
date
])
data = spark.createDataFrame([["Nov 29 2023, Dec 1 2024"]])
.toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(date) as dates").show(truncate=False)
Current Behavior
Currently when I pass in the following to MultiDateMatcher ["Nov 29 2023, Dec 1 2024"] It only returns 11/29/23 instead of both dates.
+-----------------------------------------------+ |dates | +-----------------------------------------------+ |{date, 10, 20, 2023/11/29, {sentence -> 0}, []}| +-----------------------------------------------+
Expected Behavior
Get both dates
Steps To Reproduce
https://colab.research.google.com/drive/1xGE1MqqcsjOL9kyOoOwkiqnMa4LabETK?usp=sharing
I just copied and paste the example code off doc and add the dates(Nov 29 2023, Dec 1 2024) in.
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
date = MultiDateMatcher()
.setInputCols("document")
.setOutputCol("date")
.setAnchorDateYear(2020)
.setAnchorDateMonth(1)
.setAnchorDateDay(11)
.setOutputFormat("yyyy/MM/dd")
pipeline = Pipeline().setStages([
documentAssembler,
date
])
data = spark.createDataFrame([["Nov 29 2023, Dec 1 2024"]])
.toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(date) as dates").show(truncate=False)
Spark NLP version and Apache Spark
5.1.4 3.5.0
Type of Spark Application
Python Application
Java Version
openjdk version "11.0.21" 2023-10-17 OpenJDK Runtime Environment (build 11.0.21+9-post-Ubuntu-0ubuntu122.04) OpenJDK 64-Bit Server VM (build 11.0.21+9-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)
Java Home Directory
N/A
Setup and installation
Google collab
Operating System and Version
Google Collab(ubuntu linux)
Link to your project (if available)
https://colab.research.google.com/drive/1xGE1MqqcsjOL9kyOoOwkiqnMa4LabETK?usp=sharing
Additional Information
https://sparknlp.org/api/com/johnsnowlabs/nlp/annotators/MultiDateMatcher$.html
This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days