dehyphen icon indicating copy to clipboard operation
dehyphen copied to clipboard

[solved] Usage example with CUDA

Open Technologicat opened this issue 5 months ago • 0 comments

Thanks for the library. Nice approach.

The documentation is currently missing a complete working usage example, as well as instructions how to run the scorer on GPU (with CUDA).

Here's my solution. Feel free to use in any way you like.

This isn't the shortest possible code, but rather an IMHO minimal clean solution that can be quickly adapted for real-world uses.

Note that because dehyphen only analyzes the two lines across the paragraph seam, the paragraph joiner may combine when it shouldn't. If you have known-good paragraphs, it's better to dehyphenate each of them separately.

Tested on Python 3.10, with CUDA.

TL;DR: You need to know how to tell Flair to load on the GPU, as well as how to load, on recent Torch versions, a .pt that fails to load in weights-only mode. The main() function below shows how. If you try to instantiate a FlairScorer on a recent Torch without forcing Torch into no-weights-only load mode, it will fail, complaining about pickle protocol versions, and that the (nowadays default) weights-only mode is not supported for protocol 4. With the code below, Torch will emit a UserWarning (for good reason!), but this will allow Flair to load successfully.

"""Usage example for `dehyphen` package, with CUDA."""

import contextlib
import copy
import os
import textwrap
import threading
from typing import List, Union

import torch

import flair
import dehyphen

# Utility for cleanly loading *only* Flair in "no weights only" mode
_environ_lock = threading.Lock()
@contextlib.contextmanager
def environ_override(**bindings):
    """Context manager: Temporarily override OS environment variable(s).

    When the `with` block exits, the previous state of the environment is restored.

    Thread-safe, but blocks if the lock is already taken - only one set of overrides
    can be active at any one time.
    """
    with _environ_lock:
        # remember old values, if any
        old_bindings = {key: os.environ[key] for key in bindings.keys() if key in os.environ}
        try:
            # apply overrides
            for key, value in bindings.items():
                os.environ[key] = value
            # let the caller do its thing
            yield
        finally:
            # all done - restore old environment
            for key in bindings.keys():
                if key in old_bindings:  # restore old value
                    os.environ[key] = old_bindings[key]
                else:  # this key wasn't there in the previous state, so pop it
                    os.environ.pop(key)

def _join_paragraphs(scorer: dehyphen.FlairScorer, candidate_paragraphs: List[str]) -> List[str]:
    """Internal function; input/output format is as produced by `dehyphen.text_to_format`.

    Essentially, `[[lines, of paragraph, one], [lines, of, paragraph, two], ...]`, where each of the lines is a string.
    """
    if len(candidate_paragraphs) >= 2:
        out = []
        candidate1 = candidate_paragraphs[0]
        j = 1

        # handle blank lines at beginning of input
        while not len(candidate1):  # no lines in this paragraph?
            candidate1 = candidate_paragraphs[j]
            j += 1

            # all of input is blank lines?
            if j == len(candidate_paragraphs):
                out.append(candidate1)
                return out

        while True:
            candidate2 = candidate_paragraphs[j]
            combined = scorer.is_split_paragraph(candidate1, candidate2)  # -> combined paragraph or `None`

            if j == len(candidate_paragraphs) - 1:  # end of text: commit whatever we have left
                if combined is None:  # candidate1 is a complete paragraph (candidate2 starts a new paragraph)
                    out.append(candidate1)
                    out.append(candidate2)
                else:
                    out.append(combined)
                break
            else:  # general case: commit only when a paragraph is completed
                if combined is None:  # candidate1 is a complete paragraph (candidate2 starts a new paragraph)
                    out.append(candidate1)
                    candidate1 = candidate2
                else:  # keep combining
                    candidate1 = combined
                j += 1
    else:
        out = copy.copy(candidate_paragraphs)
    return out

def dehyphenate(scorer: dehyphen.FlairScorer, text: Union[str, List[str]]) -> Union[str, List[str]]:
    """High-level API for dehyphenation.

    Returns `str` (one input) or `list` of `str` (more inputs).
    """
    def doit(text: str) -> str:
        # Don't send if the input is a single character, to avoid crashing `dehyphen`.
        if len(text) == 1:
            return text
        data = dehyphen.text_to_format(text)
        data = scorer.dehyphen(data)
        data = _join_paragraphs(scorer, data)
        paragraphs = [dehyphen.format_to_paragraph(lines) for lines in data]
        output_text = "\n\n".join(paragraphs)
        return output_text
    if isinstance(text, list):
        output_text = [doit(item) for item in text]
    else:  # str
        output_text = doit(text)
    return output_text

def main():
    # How to set CPU/GPU mode for Flair (used by `dehyphen`).
    # This needs to be done *before* instantiating the model.
    #   https://github.com/flairNLP/flair/issues/464
    flair.device = torch.device("cuda:0")

    # Flair requires "no weights only load" mode for Torch; but this is insecure,
    # so only enable it temporarily while loading the Flair model.
    #   https://github.com/flairNLP/flair/issues/3263
    #   https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1443
    with environ_override(TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD="1"):
        scorer = dehyphen.FlairScorer(lang="multi")    

    # Neumann & Gros 2023, https://arxiv.org/abs/2210.00849
    input_text = textwrap.dedent("""
                 The recent observation of neural power-law scaling relations has made a signifi-
                 cant impact in the field of deep learning. A substantial amount of attention has
                 been dedicated as a consequence to the description of scaling laws, although
                 mostly for supervised learning and only to a reduced extent for reinforcement
                 learning frameworks. In this paper we present an extensive study of performance
                 scaling for a cornerstone reinforcement learning algorithm, AlphaZero. On the ba-
                 sis of a relationship between Elo rating, playing strength and power-law scaling,
                 we train AlphaZero agents on the games Connect Four and Pentago and analyze
                 their performance. We find that player strength scales as a power law in neural
                 network parameter count when not bottlenecked by available compute, and as a
                 power of compute when training optimally sized agents. We observe nearly iden-
                 tical scaling exponents for both games. Combining the two observed scaling laws
                 we obtain a power law relating optimal size to compute similar to the ones ob-
                 served for language models. We find that the predicted scaling of optimal neural
                 network size fits our data for both games. We also show that large AlphaZero
                 models are more sample efficient, performing better than smaller models with the
                 same amount of training data.
    """).strip()

    output_text = dehyphenate(scorer, input_text)

    print("=" * 80)
    print(input_text)
    print("-" * 80)
    print(output_text)
    print("-" * 80)
    
if __name__ == "__main__":
    main()

Technologicat avatar Jul 30 '25 13:07 Technologicat