langflow
langflow copied to clipboard
feat(components): integrate Spacy NLP toolset into Langflow component…
SpaCy Components Integration
This PR integrates SpaCy's powerful NLP capabilities into Langflow through a comprehensive set of components, enabling advanced text processing and analysis workflows.
🎯 Core Components
Language Model Management
- SpacyModel
- Base component for SpaCy language models
- Supports 20+ languages including English, German, French, Spanish, etc.
- Automatic model download and initialization
- Multiple model sizes (sm, md, lg) per language
- Configurable entity merging
- Pipeline component management
Entity Processing
-
EntityRecognizer
- Named Entity Recognition (NER)
- Built-in entity types (PERSON, ORG, DATE, etc.)
- Entity context extraction
- Sentence-level entity tracking
- Confidence scoring
- Detailed entity metadata
-
EntityRuler
- Pattern-based entity recognition
- Custom rule definition
- Regex pattern support
- Phrase pattern matching
- Entity pattern priorities
- Rule-based entity labeling
Text Analysis
-
DependencyMatcher
- Syntactic pattern matching
- Relationship extraction
- Subject-Verb-Object detection
- Custom dependency rules
- Active/Passive voice identification
- Complex pattern definitions
-
TextCategorizer
- Single-label classification (textcat)
- Multi-label classification (textcat_multilabel)
- Configurable threshold settings
- Confidence scoring
- Custom category management
- Binary and multi-class support
Text Processing
-
Lemmatizer
- Rule-based and lookup lemmatization
- Custom abbreviation handling
- Multiple lemmatization modes
- Whitespace preservation
- Part-of-speech aware lemmatization
- Custom dictionary support
-
Sentencizer
- Advanced sentence segmentation
- RAG-optimized chunking
- Automatic abbreviation detection
- Custom punctuation rules
- Quote-aware segmentation
- Multi-language support
-
Tagger
- Part-of-speech tagging (POS)
- Fine-grained tags (TAG)
- Dependency parsing (DEP)
- Morphological analysis
- Custom tag sets
- Detailed token attributes
🔍 Example Flows
Lemmatizer Flow
Test text:
The researchers were running multiple groundbreaking studies while the automated
systems continuously processed the incoming data. Children's toys scattered
across the floor were quickly gathered by the cleaning robots, which had been
programmed to recognize various objects.
Dependency Matcher Flow
Pattern Example:
[
{
"RIGHT_ID": "verb",
"RIGHT_ATTRS": {"POS": "VERB"}
},
{
"LEFT_ID": "verb",
"REL_OP": ">",
"RIGHT_ID": "subject",
"RIGHT_ATTRS": {"DEP": "nsubj"}
}
]
Download Dependency Matcher Flow JSON
Sentencizer Flow
Features:
- Automatic abbreviation detection
- Custom punctuation rules
- Quote-aware segmentation Download Sentencizer Flow JSON
Text Categorizer Flow
Supports:
- Single-label classification
- Multi-label classification
- Custom thresholds Download Text Categorizer Flow JSON
Tagger Flow
Tag types:
- POS (Part of Speech)
- TAG (Detailed tags)
- DEP (Dependencies) Download Tagger Flow JSON
Entity Ruler Flow
Pattern types:
- Phrase patterns
- Token patterns
- Regex patterns Download Entity Ruler Flow JSON
Entity Recognizer Flow
Entity types:
- PERSON, ORG, GPE
- DATE, TIME
- MONEY, PERCENT
- Custom entities Download Entity Recognizer Flow JSON
🛠️ Technical Details
Implementation Features
- Full integration with Langflow's component architecture
- Comprehensive error handling and validation
- Efficient batch processing capabilities
- Dynamic configuration options
- Extensive type checking
- Memory-efficient processing
📊 Sample Data
- Sample PDF for testing
- Test cases included in JSON flows
- Example patterns and rules
- Benchmark datasets
🔗 Related Resources
- SpaCy Documentation: https://spacy.io/
- Langflow Components Guide
- SpaCy Models: https://spacy.io/models
- Example Datasets
👥 Contributors
- @raphaelchristi
📃 License
- MIT License (same as Langflow)