spark-nlp icon indicating copy to clipboard operation
spark-nlp copied to clipboard

Wrong Documentation example

Open zdposter opened this issue 2 years ago • 2 comments

Hello I to get working POS following example on https://nlp.johnsnowlabs.com/2021/03/23/pos_snk_sk.html

import sparknlp

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")

pos = PerceptronModel.pretrained("pos_snk", "sk")\
.setInputCols(["document", "token"])\
.setOutputCol("pos")

pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
posTagger
])

example = spark.createDataFrame([['Potom Maju nežne pohladila po hlávke a vraví : Spoznáš krásny veľký svet , Maja , hrejivé slniečko a nádherné lúky plné kvetov .']], ["text"])
result = pipeline.fit(example).transform(example)

It fails with error: NameError: name 'posTagger' is not defined

When rename pos to posTagger (pposTaggeros = PerceptronModel.pretrained("pos_snk", "sk")\) I receive another error:

IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in REGEX_TOKENIZER_6249f09e1595.

Current inputCols: sentence. Dataset's columns:
(column_name=text,is_nlp_annotator=false)
(column_name=document,is_nlp_annotator=true,type=document).
Make sure such annotators exist in your pipeline, with the right output names and that they have following annotator types: document

As far as I see the same problem is for other languages/ POS models python examples.

zdposter avatar Sep 10 '22 14:09 zdposter

Thanks for reporting this, they forgot to add Tokenizer() annotator to the example codes. I'll ask someone to go through those models and add the missing Tokenizer.

@ahmedlone127 Could you please have a look at models here (POS especially) and add the missing Tokenizer to both Python and Scala examples: https://github.com/JohnSnowLabs/spark-nlp/tree/master/docs/_posts/dcecchini

maziyarpanahi avatar Sep 10 '22 14:09 maziyarpanahi

Thank you for quick response. I confirm that these modified examples works:

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

spark = sparknlp.start()

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

posTagger = PerceptronModel.pretrained("pos_snk", "sk")\
    .setInputCols(["document", "token"])\
    .setOutputCol("pos")

pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, posTagger])

example = spark.createDataFrame([[
    'Potom Maju nežne pohladila po hlávke a vraví : Spoznáš krásny veľký svet , Maja , hrejivé slniečko a nádherné lúky plné kvetov .'
]], ["text"])
result = pipeline.fit(example).transform(example)
result.select("token.result", "pos.result").show(truncate=80)

zdposter avatar Sep 11 '22 11:09 zdposter