pySBD icon indicating copy to clipboard operation
pySBD copied to clipboard

Update PySBD component to support spaCy v3

Open nipunsadvilkar opened this issue 3 years ago • 4 comments

PySBD component using Language.factory

nipunsadvilkar avatar Jun 29 '22 10:06 nipunsadvilkar

Codecov Report

Merging #114 (e07808a) into master (5905f13) will decrease coverage by 0.08%. The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master     #114      +/-   ##
==========================================
- Coverage   98.43%   98.35%   -0.09%     
==========================================
  Files          38       39       +1     
  Lines        1150     1153       +3     
==========================================
+ Hits         1132     1134       +2     
- Misses         18       19       +1     
Flag Coverage Δ
unittests 98.35% <50.00%> (-0.09%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pysbd/utils.py 73.33% <42.85%> (-2.53%) :arrow_down:
pysbd/about.py 100.00% <100.00%> (ø)
pysbd/__init__.py 100.00% <0.00%> (ø)

:mega: Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

codecov-commenter avatar Jun 29 '22 10:06 codecov-commenter

Are you still working on this? Otherwise I could have a look.

davidberenstein1957 avatar Oct 22 '22 13:10 davidberenstein1957

Hey @davidberenstein1957, sure you can take a look at it. But I'm not sure what would be best way since I want to keep pysbd lightweight and to support psybd with spacy v3 with Language.factory is needed and which would make me add spacy as dependency.

Let me know if you happen to work on the recommendations suggested by @rmitsch above.

nipunsadvilkar avatar Oct 26 '22 20:10 nipunsadvilkar

here would be an option to update the factory method and not require spacey as a hard requirement to pysbd.

from typing import Any
try:
    from spacy.language import Language
    langfac = Language.factory
except ImportError:
    def langfac(*args:Any,**kwargs:Any):
        def decorator(function:Any):
            def wrapper(*args:Any, **kwargs:Any):
                pass
            return wrapper
        return decorator
@langfac(name="pysbd",default_config={"language": 'en'})
class PySBDFactory(object):
    """pysbd as a spacy component through entrypoints"""

    def __init__(self, nlp, name,language='en'):
        self.nlp = nlp
        self.name = name
        self.seg = pysbd.Segmenter(language=language, clean=False,
                                   char_span=True)

    def __call__(self, doc):
        sents_char_spans = self.seg.segment(doc.text_with_ws)
        start_token_ids = [sent.start for sent in sents_char_spans]
        for token in doc:
            token.is_sent_start = (True if token.idx
                                   in start_token_ids else False)
        return doc

`

rbroderi avatar Mar 02 '24 03:03 rbroderi