srx-english
srx-english copied to clipboard
English sentence segmentation rules based on SRX standard.
== srx-english
- https://github.com/apohllo/srx-english
= DESCRIPTION
'srx-english' is a Ruby library containing English sentence and word segmentation rules. The sentence segementation rules are based on rules defined by Marcin Miłkowski: http://morfologik.blogspot.com/2009/11/talking-about-srx-in-lt-during-ltc.html
= FEATURES/PROBLEMS
- this library is generated by 'srx2ruby' which has some limitations and might be not 100% SRX standard compliant.
= INSTALL
Standard rubygems installation:
$ gem install srx-english
= BASIC USAGE
The library defines the SRX::English::Sentence class allowing to iterate over the matched sentences:
require 'srx/english/sentence_splitter'
text =<<-END This is e.g. Mr. Smith, who talks slowly... And this is another sentence. END
splitter = SRX::English::SentenceSplitter.new(text) splitter.each do |sentence| puts sentence.gsub(/\n|\r/,"") end
This is e.g. Mr. Smith, who talks slowly...
And this is another sentence.
require 'srx/english/word_splitter'
sentence = 'My home is my castle.' splitter = SRX::English::WordSplitter.new(sentence) splitter.each do |word,type,start_offset,end_offset| puts "'#{word}' #{type}" end
'My' word
' ' other
'home' word
' ' other
'is' word
' ' other
'my' word
' ' other
'castle' word
'.' punct
== LICENSE
Copyright (C) 2011 Aleksander Pohl, Marcin Miłkowski, Jarosław Lipski
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
== FEEDBACK
- mailto:[email protected]