spacyr icon indicating copy to clipboard operation
spacyr copied to clipboard

entity_extract() giving inconsistent results

Open aterhorst opened this issue 3 years ago • 0 comments

Hello! I am new to spacyr.

I plan to use the spacyr to perform named entity recognition across several news articles. My goal is to build signed networks from named entities. However, spacyr is not recognising common entities as expected:

library(quanteda)
library(spacyr)

text <- data.frame(doc_id = c(1:5),
                   sentence = c("Brightmark LLC, the global waste solutions provider, and Florida Keys National Marine Sanctuary (FKNMS), today announced a new plastic recycling partnership that will reduce landfill waste and amplify concerns about ocean plastics.",
                                "Brightmark is launching a nationwide site search for U.S. locations suitable for its next set of advanced recycling facilities, which will convert hundreds of thousands of tons of post-consumer plastics into new products, including fuels, wax, and other products.",
                                "Brightmark will be constructing the facility in partnership with the NSW government, as part of its commitment to drive economic growth and prosperity in regional NSW.",
                                "Macon-Bibb County, the Macon-Bibb County Industrial Authority, and Brightmark have mutually agreed to end discussions around building a plastic recycling plant in Macon",
                                "Global petrochemical company SK Global Chemical and waste solutions provider Brightmark have signed a memorandum of understanding to create a partnership that aims to take the lead in the circular economy of plastic by construction of a commercial scale plastics renewal plant in South Korea"))

corpus <- corpus(text, text_field = "sentence")

spacy_initialize(model = "en_core_web_sm")

parsed <- spacy_parse(corpus)

entity <- entity_extract(parsed)

I expect the company "Brightmark" to be recognised in all 5 sentences. However this is what I get:

entity

  doc_id sentence_id                                 entity entity_type
1      1           1 Florida_Keys_National_Marine_Sanctuary         ORG
2      1           1                                  FKNMS         ORG
3      2           1                                   U.S.         GPE
4      3           1                                    NSW         ORG
5      4           1                    Macon_-_Bibb_County         ORG
6      4           1                             Brightmark         ORG
7      4           1                                  Macon         GPE
8      5           1                     SK_Global_Chemical         ORG
9      5           1                            South_Korea         GPE

"Brightmark" only appears as an ORG entity type in the 4th sentence. The "NSW Government" does not appear at all (although NSW is recognised as an organisation).

I am still figuring out spaCy and spacyr. Perhaps someone can advise me why this is happening and what steps I should take to remedy this issue. Perhaps my example sentences are too short - I should train a model on complete articles, not sentences. I want to extract entities at a sentence level because I am using the sentimentr package to compute sentiment at the sentence level. The idea is to use sentiment scores to sign relations between two entities appearing in a sentence.

aterhorst avatar Sep 27 '22 02:09 aterhorst