contextualSpellCheck [BUG] Sentence context greater than 512 character

[BUG] Sentence context greater than 512 character

Open xei opened this issue 2 years ago • 1 comments

I tried to correct spelling mistakes in a large text.

import spacy
import contextualSpellCheck

spacy_nlp = spacy.load(
    'en_core_web_sm',
    # disable=['ner']
    disable=['parser', 'ner'] # disable extra componens for efficiency
)
contextualSpellCheck.add_to_pipe(spacy_nlp)

corpus_spacy = [spacy_nlp(doc) for doc in corpus_raw]

At first, I faced this error: ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe('sentencizer'). Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting doc[i].is_sent_start.

So, I added the sentencizer component to the pipeline.

import spacy
import contextualSpellCheck

spacy_nlp = spacy.load(
    'en_core_web_sm',
    # disable=['ner']
    disable=['parser', 'ner'] # disable extra componens for efficiency
)
spacy_nlp.add_pipe('sentencizer')
contextualSpellCheck.add_to_pipe(spacy_nlp)

corpus_spacy = [spacy_nlp(doc) for doc in corpus_raw]

This time I faced this error: RuntimeError: The expanded size of the tensor (837) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 837]. Tensor sizes: [1, 512]

I guess this is due to the limitations of BERT. However, I believe that there should be a way to catch this error and bypass the spell check.

Aug 13 '21 13:08 xei

Thanks @xei for reporting this issue. I know BERT has a limit of 512 characters and the model currently being used for inference was trained with maximum 512 characters REF.

Also, I am not sure how corpus_raw looks like. But 512 character should work for most cases as the spell check only considers a sentence for the context to spell checking and not the entire corpus.

For Example:

>>> import spacy
>>> import contextualSpellCheck
>>> spacy_nlp = spacy.load(
    'en_core_web_sm',
    # disable=['ner']
    disable=['parser', 'ner'] # disable extra componens for efficiency
)
>>> spacy_nlp.add_pipe('sentencizer')
<spacy.pipeline.sentencizer.Sentencizer object at 0x7fb10c509f40>
>>> contextualSpellCheck.add_to_pipe(spacy_nlp)
>>> corpus_raw="""The train from the west that bore Bert Bryant to New York was two
hours late, for all the way from Clinton, Ohio, where Bert lived, the
snow had been from four inches to a foot in depth. Consequently he had
missed the one o’clock train for Mt. Pleasant and had spent an hour
with his face glued to a waiting-room window watching the bustle and
confusion of New York. Now, at four o’clock, he was seated in a sleigh,
his suit-case between his feet, winding up the long, snowy road to Mt.
Pleasant Academy. In the front seat was the fur-clad driver and beside
him was Bert’s small trunk.

It was very cold and fast growing dark. It seemed to Bert that they
had been driving for miles and miles, and he wanted to ask the driver
how much farther they had to go. But the man in the old bearskin coat
was cross and taciturn, and so Bert buried his hands still deeper in
his pockets and wondered whether his nose and ears were getting white.
And just when he had decided that they were the sleigh left the main
road with a sudden lurch, that almost toppled the trunk off, and turned
through a gate and up a curving drive lined with snow-laden evergreens.
Then the academy came into view, a rambling, comfortable-looking
building with many cheerfully lighted windows looking out in welcome.
At one of the windows two faces appeared in response to the warning
of the sleigh bells and peered curiously down. The sleigh pulled up
in front of a broad stone step and Bert clambered out, bag in hand.
The driver lifted the trunk, opened the big oak door without ceremony,
deposited his burden just inside and growled: “Fifty cents.”"""
>>> doc = spacy_nlp(corpus_raw)
>>> doc._.suggestions_spellCheck
{Bert: 'Bert', Bryant: 'back', York: 'York', Clinton: 'Canton', Ohio: 'Ohio', Bert: 'he', bustle: 'noise', York: 'York', sleigh: 'seat', snowy: 'dusty', Bert: 'Ben', Bert: 'Bond', bearskin: 'black', taciturn: 'stern', Bert: 'he', sleigh: 'pair', lurch: 'turn', toppled: 'ripped', evergreens: 'trees', rambling: 'big', cheerfully: 'carefully', lighted: 'painted', sleigh: 'church', sleigh: 'coach', Bert: 'Ben', clambered: 'climbed'}

As you can see above the entire text moved through the spacy pipeline without any error. The sample text is taken from The Project Gutenberg eBook of The Junior Trophy, by Ralph Henry Barbour REF.

There is another thing which I wanted to point was contextualSpellCheck would require both parser and ner as mentioned here:

We require NER to identify if a token is a PERSON also require parser because we use Token.sent for context

Please let me know if you have any questions. I think your suggestion is great, and I will have to try to think of a solution to either split a large sentence (> max_position_embeddings) or bypass spell check altogether. If you would like to contribute this feature feel free to create a PR!

Aug 15 '21 07:08 R1j1t

contextualSpellCheck contextualSpellCheck copied to clipboard

[BUG] Sentence context greater than 512 character

contextualSpellCheck
contextualSpellCheck copied to clipboard