langflow icon indicating copy to clipboard operation
langflow copied to clipboard

feat(components): integrate Spacy NLP toolset into Langflow component…

Open raphaelchristi opened this issue 1 year ago • 0 comments

SpaCy Components Integration

This PR integrates SpaCy's powerful NLP capabilities into Langflow through a comprehensive set of components, enabling advanced text processing and analysis workflows.

🎯 Core Components

Language Model Management

  • SpacyModel
    • Base component for SpaCy language models
    • Supports 20+ languages including English, German, French, Spanish, etc.
    • Automatic model download and initialization
    • Multiple model sizes (sm, md, lg) per language
    • Configurable entity merging
    • Pipeline component management

Entity Processing

  • EntityRecognizer

    • Named Entity Recognition (NER)
    • Built-in entity types (PERSON, ORG, DATE, etc.)
    • Entity context extraction
    • Sentence-level entity tracking
    • Confidence scoring
    • Detailed entity metadata
  • EntityRuler

    • Pattern-based entity recognition
    • Custom rule definition
    • Regex pattern support
    • Phrase pattern matching
    • Entity pattern priorities
    • Rule-based entity labeling

Text Analysis

  • DependencyMatcher

    • Syntactic pattern matching
    • Relationship extraction
    • Subject-Verb-Object detection
    • Custom dependency rules
    • Active/Passive voice identification
    • Complex pattern definitions
  • TextCategorizer

    • Single-label classification (textcat)
    • Multi-label classification (textcat_multilabel)
    • Configurable threshold settings
    • Confidence scoring
    • Custom category management
    • Binary and multi-class support

Text Processing

  • Lemmatizer

    • Rule-based and lookup lemmatization
    • Custom abbreviation handling
    • Multiple lemmatization modes
    • Whitespace preservation
    • Part-of-speech aware lemmatization
    • Custom dictionary support
  • Sentencizer

    • Advanced sentence segmentation
    • RAG-optimized chunking
    • Automatic abbreviation detection
    • Custom punctuation rules
    • Quote-aware segmentation
    • Multi-language support
  • Tagger

    • Part-of-speech tagging (POS)
    • Fine-grained tags (TAG)
    • Dependency parsing (DEP)
    • Morphological analysis
    • Custom tag sets
    • Detailed token attributes

🔍 Example Flows

Lemmatizer Flow

Lemmatizer Flow Test text:

The researchers were running multiple groundbreaking studies while the automated 
systems continuously processed the incoming data. Children's toys scattered 
across the floor were quickly gathered by the cleaning robots, which had been 
programmed to recognize various objects.

Download Lemmatizer Flow JSON

Dependency Matcher Flow

Dependency Matcher Flow Pattern Example:

[
    {
        "RIGHT_ID": "verb",
        "RIGHT_ATTRS": {"POS": "VERB"}
    },
    {
        "LEFT_ID": "verb",
        "REL_OP": ">",
        "RIGHT_ID": "subject",
        "RIGHT_ATTRS": {"DEP": "nsubj"}
    }
]

Download Dependency Matcher Flow JSON

Sentencizer Flow

Sentencizer Flow Features:

Text Categorizer Flow

Text Categorizer Flow Supports:

Tagger Flow

Tagger Flow Tag types:

Entity Ruler Flow

Entity Ruler Flow Pattern types:

Entity Recognizer Flow

Entity Recognizer Flow Entity types:

🛠️ Technical Details

Implementation Features

  • Full integration with Langflow's component architecture
  • Comprehensive error handling and validation
  • Efficient batch processing capabilities
  • Dynamic configuration options
  • Extensive type checking
  • Memory-efficient processing

📊 Sample Data

🔗 Related Resources

  • SpaCy Documentation: https://spacy.io/
  • Langflow Components Guide
  • SpaCy Models: https://spacy.io/models
  • Example Datasets

👥 Contributors

  • @raphaelchristi

📃 License

  • MIT License (same as Langflow)

raphaelchristi avatar Nov 20 '24 15:11 raphaelchristi