spark-nlp icon indicating copy to clipboard operation
spark-nlp copied to clipboard

MultiDateMatcher only returning 1 element

Open TommyDong1998 opened this issue 1 year ago • 1 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

Finding dates in a string.

import sparknlp from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document") date = MultiDateMatcher()
.setInputCols("document")
.setOutputCol("date")
.setAnchorDateYear(2020)
.setAnchorDateMonth(1)
.setAnchorDateDay(11)
.setOutputFormat("yyyy/MM/dd") pipeline = Pipeline().setStages([ documentAssembler, date ]) data = spark.createDataFrame([["Nov 29 2023, Dec 1 2024"]])
.toDF("text") result = pipeline.fit(data).transform(data) result.selectExpr("explode(date) as dates").show(truncate=False)

Current Behavior

Currently when I pass in the following to MultiDateMatcher ["Nov 29 2023, Dec 1 2024"] It only returns 11/29/23 instead of both dates.

+-----------------------------------------------+ |dates | +-----------------------------------------------+ |{date, 10, 20, 2023/11/29, {sentence -> 0}, []}| +-----------------------------------------------+

Expected Behavior

Get both dates

Steps To Reproduce

https://colab.research.google.com/drive/1xGE1MqqcsjOL9kyOoOwkiqnMa4LabETK?usp=sharing

I just copied and paste the example code off doc and add the dates(Nov 29 2023, Dec 1 2024) in.

import sparknlp from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document") date = MultiDateMatcher()
.setInputCols("document")
.setOutputCol("date")
.setAnchorDateYear(2020)
.setAnchorDateMonth(1)
.setAnchorDateDay(11)
.setOutputFormat("yyyy/MM/dd") pipeline = Pipeline().setStages([ documentAssembler, date ]) data = spark.createDataFrame([["Nov 29 2023, Dec 1 2024"]])
.toDF("text") result = pipeline.fit(data).transform(data) result.selectExpr("explode(date) as dates").show(truncate=False)

Spark NLP version and Apache Spark

5.1.4 3.5.0

Type of Spark Application

Python Application

Java Version

openjdk version "11.0.21" 2023-10-17 OpenJDK Runtime Environment (build 11.0.21+9-post-Ubuntu-0ubuntu122.04) OpenJDK 64-Bit Server VM (build 11.0.21+9-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)

Java Home Directory

N/A

Setup and installation

Google collab

Operating System and Version

Google Collab(ubuntu linux)

Link to your project (if available)

https://colab.research.google.com/drive/1xGE1MqqcsjOL9kyOoOwkiqnMa4LabETK?usp=sharing

Additional Information

https://sparknlp.org/api/com/johnsnowlabs/nlp/annotators/MultiDateMatcher$.html

TommyDong1998 avatar Dec 08 '23 08:12 TommyDong1998

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions[bot] avatar Jun 17 '24 00:06 github-actions[bot]