natural
natural copied to clipboard
Support for german stemmer
Would be great to have a stemmer for the german language. Maybe this is a good starting point? https://gist.github.com/2199965
i agree! one of my highest priorities for natural before fall 2012 is non-English stemmers. i personally was going to look into doing French as I can likely handle that completely, but was hoping to get native speakers to help me at least verify my work with other languages.
would you either be able to handle either the implementation or at least help me verify its accuracy?
the algorithm you've attached, have you played with it much? are you aware if there are any licensing restrictions with it?
Hi, I could try to do a simple base implementation based on top of the Gist i provided. but i have to check the license first. Otherwise i can i help you in testing yours.
But great to hear that this is on your top priorities list. :)
Feel free to take a stab at it!
oops! i did not mean to close this.
+1 for Dutch stemming. Hopefully I can help out in some sort of way in the future.
You can use the JS Snowball port to do so:
https://github.com/fortnightlabs/snowball-js
It does change the capital letter U to lowercase though: http://code.google.com/p/urim/issues/detail?id=3
Added Porter Stemmer for Dutch. I should say that the Porter algorithm makes mistakes in Dutch and that my implementation fails in 305 cases of 45669 in the snowball file. That is less than 1% failure. Also the Snowball file contains wrong examples; for instance afvalstortplaats
is stemmed as afvalstortplat
, which is wrong, it should be afvalstortplaats
.
Hugo
News?
Hi, I could try to do a simple base implementation based on top of the Gist i provided. but i have to check the license first.
I checked the license:
/*
* Original author: Joder Illi
*
* Copyright (c) 2010, FormBlitz AG
* All rights reserved.
* Implementation of the stemming algorithm from http://snowball.tartarus.org/algorithms/german/stemmer.html
* Copyright of the algorithm is: Copyright (c) 2001, Dr Martin Porter and can be found at http://snowball.tartarus.org/license.php
*
* Redistribution and use in source and binary forms, with or without
* modification, is covered by the standard BSD license.
*
*/
As I see it BSD licensed code can be integrated in a MIT licensed code base as long as the the added code has the original (BSD) license.
I am also considering jsSnowball transpiled from Java sources. It is licensed with BSD 3.0 which can be combined with MIT license as well.
Source can be found here: https://github.com/mazko/jssnowball
See #663 for progress
#663 is merged.