spacy-experimental
spacy-experimental copied to clipboard
🧪 Cutting-edge experimental spaCy components and features
spacy-experimental: Cutting-edge experimental spaCy components and features
This package includes experimental components and features for spaCy v3.x, for example model architectures, pipeline components and utilities.
Installation
Install with pip
:
python -m pip install -U pip setuptools wheel
python -m pip install spacy-experimental
Using spacy-experimental
Components and features may be modified or removed in any release, so always specify the exact version as a package requirement if you're experimenting with a particular component, e.g.:
spacy-experimental==0.147.0
Then you can add the experimental components to your config or import from
spacy_experimental
:
[components.experimental_char_ner_tokenizer]
factory = "experimental_char_ner_tokenizer"
Components
Trainable character-based tokenizers
Two trainable tokenizers represent tokenization as a sequence tagging problem over individual characters and use the existing spaCy tagger and NER architectures to perform the tagging.
In the spaCy pipeline, a simple "pretokenizer" is applied as the pipeline
tokenizer to split each doc into individual characters and the trainable
tokenizer is a pipeline component that retokenizes the doc. The pretokenizer
needs to be configured manually in the config or with spacy.blank()
:
nlp = spacy.blank(
"en",
config={
"nlp": {
"tokenizer": {"@tokenizers": "spacy-experimental.char_pretokenizer.v1"}
}
},
)
The two tokenizers currently reset any existing tag or entity annotation respectively in the process of retokenizing.
Character-based tagger tokenizer
In the tagger version experimental_char_tagger_tokenizer
, the tagging problem
is represented internally with character-level tags for token start (T
),
token internal (I
), and outside a token (O
). This representation comes from
Elephant: Sequence Labeling for Word and Sentence
Segmentation (Evang et al., 2013).
This is a sentence.
TIIIOTIOTOTIIIIIIIT
With the option annotate_sents
, S
replaces T
for the first token in each
sentence and the component predicts both token and sentence boundaries.
This is a sentence.
SIIIOTIOTOTIIIIIIIT
A config excerpt for experimental_char_tagger_tokenizer
:
[nlp]
pipeline = ["experimental_char_tagger_tokenizer"]
tokenizer = {"@tokenizers":"spacy-experimental.char_pretokenizer.v1"}
[components]
[components.experimental_char_tagger_tokenizer]
factory = "experimental_char_tagger_tokenizer"
annotate_sents = true
scorer = {"@scorers":"spacy-experimental.tokenizer_senter_scorer.v1"}
[components.experimental_char_tagger_tokenizer.model]
@architectures = "spacy.Tagger.v1"
nO = null
[components.experimental_char_tagger_tokenizer.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"
[components.experimental_char_tagger_tokenizer.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 128
attrs = ["ORTH","LOWER","IS_DIGIT","IS_ALPHA","IS_SPACE","IS_PUNCT"]
rows = [1000,500,50,50,50,50]
include_static_vectors = false
[components.experimental_char_tagger_tokenizer.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 128
depth = 4
window_size = 4
maxout_pieces = 2
Character-based NER tokenizer
In the NER version, each character in a token is part of an entity:
T B-TOKEN
h I-TOKEN
i I-TOKEN
s I-TOKEN
O
i B-TOKEN
s I-TOKEN
O
a B-TOKEN
O
s B-TOKEN
e I-TOKEN
n I-TOKEN
t I-TOKEN
e I-TOKEN
n I-TOKEN
c I-TOKEN
e I-TOKEN
. B-TOKEN
A config excerpt for experimental_char_ner_tokenizer
:
[nlp]
pipeline = ["experimental_char_ner_tokenizer"]
tokenizer = {"@tokenizers":"spacy-experimental.char_pretokenizer.v1"}
[components]
[components.experimental_char_ner_tokenizer]
factory = "experimental_char_ner_tokenizer"
scorer = {"@scorers":"spacy-experimental.tokenizer_scorer.v1"}
[components.experimental_char_ner_tokenizer.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null
[components.experimental_char_ner_tokenizer.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"
[components.experimental_char_ner_tokenizer.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 128
attrs = ["ORTH","LOWER","IS_DIGIT","IS_ALPHA","IS_SPACE","IS_PUNCT"]
rows = [1000,500,50,50,50,50]
include_static_vectors = false
[components.experimental_char_ner_tokenizer.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 128
depth = 4
window_size = 4
maxout_pieces = 2
The NER version does not currently support sentence boundaries, but it would be
easy to extend using a B-SENT
entity type.
Biaffine parser
A biaffine dependency parser, similar to that proposed in [Deep Biaffine Attention for Neural Dependency Parsing](Deep Biaffine Attention for Neural Dependency Parsing) (Dozat & Manning, 2016). The parser consists of two parts: an edge predicter and an edge labeler. For example:
[components.experimental_arc_predicter]
factory = "experimental_arc_predicter"
[components.experimental_arc_labeler]
factory = "experimental_arc_labeler"
The arc predicter requires that a previous component (such as senter
) sets
sentence boundaries during training. Therefore, such a component must be
added to annotating_components
:
[training]
annotating_components = ["senter"]
The biaffine parser sample project provides an example biaffine parser pipeline.
Span Finder
The SpanFinder is a new experimental component that identifies span boundaries by tagging potential start and end tokens. It's an ML approach to suggest candidate spans with higher precision.
SpanFinder
uses the following parameters:
-
threshold
: Probability threshold for predicted spans. -
predicted_key
: Name of the SpanGroup the predicted spans are saved to. -
training_key
: Name of the SpanGroup the training spans are read from. -
max_length
: Max length of the predicted spans. No limit when set to0
. Defaults to0
. -
min_length
: Min length of the predicted spans. No limit when set to0
. Defaults to0
.
Here is a config excerpt for the SpanFinder
together with a SpanCategorizer
:
[nlp]
lang = "en"
pipeline = ["tok2vec","span_finder","spancat"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3
[components.span_finder]
factory = "experimental_span_finder"
threshold = 0.35
predicted_key = "span_candidates"
training_key = ${vars.spans_key}
min_length = 0
max_length = 0
[components.span_finder.scorer]
@scorers = "spacy-experimental.span_finder_scorer.v1"
predicted_key = ${components.span_finder.predicted_key}
training_key = ${vars.spans_key}
[components.span_finder.model]
@architectures = "spacy-experimental.SpanFinder.v1"
[components.span_finder.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO=2
[components.span_finder.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
[components.spancat]
factory = "spancat"
max_positive = null
spans_key = ${vars.spans_key}
threshold = 0.5
[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"
[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128
[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null
[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
[components.spancat.suggester]
@misc = "spacy-experimental.span_finder_suggester.v1"
predicted_key = ${components.span_finder.predicted_key}
This package includes a spaCy project which shows how to train and use the SpanFinder
together with SpanCategorizer
.
Architectures
None currently.
Other
Tokenizers
-
spacy-experimental.char_pretokenizer.v1
: Tokenize a text into individual characters.
Scorers
-
spacy-experimental.tokenizer_scorer.v1
: Score tokenization. -
spacy-experimental.tokenizer_senter_scorer.v1
: Score tokenization and sentence segmentation.
Misc
Suggester functions for spancat:
Subtree suggester: Uses dependency annotation to suggest tokens with their syntactic descendants.
-
spacy-experimental.subtree_suggester.v1
-
spacy-experimental.ngram_subtree_suggester.v1
Chunk suggester: Suggests noun chunks using the noun chunk iterator, which requires POS and dependency annotation.
-
spacy-experimental.chunk_suggester.v1
-
spacy-experimental.ngram_chunk_suggester.v1
Sentence suggester: Uses sentence boundaries to suggest sentence spans.
-
spacy-experimental.sentence_suggester.v1
-
spacy-experimental.ngram_sentence_suggester.v1
The package also contains a merge_suggesters
function which can be used to combine suggestions from multiple suggesters.
Here are two config excerpts for using the subtree suggester
with and without the ngram functionality:
[components.spancat.suggester]
@misc = "spacy-experimental.subtree_suggester.v1"
[components.spancat.suggester]
@misc = "spacy-experimental.ngram_subtree_suggester.v1"
sizes = [1, 2, 3]
Note that all the suggester functions are registered in @misc
.
Bug reports and issues
Please report bugs in the spaCy issue tracker or open a new thread on the discussion board for other issues.
Older documentation
See the READMEs in earlier tagged versions for details about components in earlier releases.