spacymoji
spacymoji copied to clipboard
💙 Emoji handling and meta data for spaCy with custom extension attributes
spacymoji: emoji for spaCy
spaCy extension and pipeline component
for adding emoji meta data to Doc
objects. Detects emoji consisting of one
or more unicode characters, and can optionally merge multi-char emoji (combined
pictures, emoji with skin tone modifiers) into one token. Human-readable emoji
descriptions are added as a custom attribute, and an optional lookup table can
be provided for your own descriptions. The extension sets the custom Doc
,
Token
and Span
attributes ._.is_emoji
, ._.emoji_desc
, ._.has_emoji
and ._.emoji
. You can read more about custom pipeline components and extension attributes here.
Emoji are matched using spaCy's PhraseMatcher
, and looked up in the data
table provided by the emoji
package.
⏳ Installation
spacymoji
requires spacy
v3.0.0 or higher. For spaCy v2.x, instally spacymoji==2.0.0
.
pip install spacymoji
☝️ Usage
Import the component and add it anywhere in your pipeline using the string
name of the "emoji"
component factory:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("emoji", first=True)
doc = nlp("This is a test 😻 👍🏿")
assert doc._.has_emoji is True
assert doc[2:5]._.has_emoji is True
assert doc[0]._.is_emoji is False
assert doc[4]._.is_emoji is True
assert doc[5]._.emoji_desc == "thumbs up dark skin tone"
assert len(doc._.emoji) == 2
assert doc._.emoji[1] == ("👍🏿", 5, "thumbs up dark skin tone")
spacymoji
only cares about the token text, so you can use it on a blank
Language
instance (it should work for all
available languages!), or in
a pipeline with a loaded pipeline. If your pipeline
includes a tagger, parser and entity recognizer, make sure to add the emoji
component as first=True
, so the spans are merged right after tokenization,
and before the document is parsed. If your text contains a lot of emoji, this
might even give you a nice boost in parser accuracy.
Available attributes
The extension sets attributes on the Doc
, Span
and Token
. You can
change the attribute names (and other parameters of the Emoji component) by passing
them via the config
parameter in the nlp.add_pipe(...)
method. For more details
on custom components and attributes, see the
processing pipelines documentation.
Attribute | Type | Description |
---|---|---|
Token._.is_emoji |
bool | Whether the token is an emoji. |
Token._.emoji_desc |
str | A human-readable description of the emoji. |
Doc._.has_emoji |
bool | Whether the document contains emoji. |
Doc._.emoji |
List[Tuple[str, int, str]] | (emoji, index, description) tuples of the document's emoji. |
Span._.has_emoji |
bool | Whether the span contains emoji. |
Span._.emoji |
List[Tuple[str, int, str]] | (emoji, index, description) tuples of the span's emoji. |
Settings
You can configure the emoji
factory by setting any of the following parameters in
the config
dictionary:
Setting | Type | Description |
---|---|---|
attrs |
Tuple[str, str, str, str] | Attributes to set on the ._ property. Defaults to ('has_emoji', 'is_emoji', 'emoji_desc', 'emoji') . |
pattern_id |
str | ID of match pattern, defaults to 'EMOJI' . Can be changed to avoid ID conflicts. |
merge_spans |
bool | Merge spans containing multi-character emoji, defaults to True . Will only merge combined emoji resulting in one icon, not sequences. |
lookup |
Dict[str, str] | Optional lookup table that maps emoji strings to custom descriptions, e.g. translations or other annotations. |
emoji_config = {"attrs": ("has_e", "is_e", "e_desc", "e"), lookup={"👨🎤": "David Bowie"})
nlp.add_pipe(emoji, first=True, config=emoji_config)
doc = nlp("We can be 👨🎤 heroes")
assert doc[3]._.is_e
assert doc[3]._.e_desc == "David Bowie"
If you're training a pipeline, you can define the component config in your config.cfg
:
[nlp]
pipeline = ["emoji", "ner"]
# ...
[components.emoji]
factory = "emoji"
merge_spans = false