inflections-hs icon indicating copy to clipboard operation
inflections-hs copied to clipboard

A question about titleize

Open dogweather opened this issue 8 years ago • 5 comments

Hi, before I get too far into the library, could you tell me if can handle text with non-text characters like this?

it "handles Unicode double quotes" $ do
  titleize "“ABUSE OR NEGLECT” DEFINED." `shouldBe` "“Abuse or Neglect” Defined."

I was going down the exact same design path of inflections-hs; separating the parsing from the transformations. And then I discovered this library.

dogweather avatar Sep 16 '17 21:09 dogweather

Off the top of my head, I'm not sure - maybe one of the other authors has a better idea. However, it should be easy to add and run the test that you suggested, and it would be a nice test to have anyway if we don't have it yet. Dependencies shouldn't be an issue since we use Stackage. Want to try and open a pull request?

jsl avatar Sep 16 '17 21:09 jsl

Great, I'll give it a shot. I'm a little new to Haskell. I'm using it to parse and cleanup statutes.

dogweather avatar Sep 17 '17 01:09 dogweather

All tests pass right away! Nice. But I'm unsure, how would the library be used to titleize a phrase? I.e., what would the high-level code look like to titleize a Text like "the cat in the hat"?

My guess is, First use something like words to break on space. Then apply mkWord and mkAcronym on each item in the list, using some kind of lookup table of common acronyms in the problem domain, and finally calling titleize?

dogweather avatar Sep 17 '17 03:09 dogweather

I.e., this test fails with an InflectionInvalidWord because a word can only contain alpha characters. I'm not sure how or if the library handles punctuation. ?

  it "converts a word with punctuation correctly" $ do
    neglect <- SomeWord <$> mkWord "“NEGLECT”"
    titleize [neglect] `shouldBe` "“Neglect”"

dogweather avatar Sep 17 '17 03:09 dogweather

I think that your intuition about what you need to do sounds reasonable, based on the fact that you'd need a dictionary of acronyms for your domain. A naive example (without acronym lookups etc.) given the string you want to convert could be the following:

λ: import qualified Data.Text as T
λ: import Control.Monad (liftM)
λ: liftM titleize $  mapM (fmap SomeWord . mkWord) (T.words (T.pack "the cat in the hat"))
"The Cat In The Hat"

In a quick read through the library, I didn't see a way to deal with punctuation, either - i.e., something that is not an acronym or a word containing only alpha characters. If it's correct that that behavior isn't implemented, you may be able to do so by defining a different WordType of Ignored or Punctation (see here). You could perhaps then create a more intelligent function to tag the different Words with their correct types and selectively apply the titleize function to those.

It has been a while since I looked at this code (and it has changed drastically since I wrote the initial version) so I may be off here. Hope this is somewhat helpful, though! Good luck and please keep us updated on what you come up with!

jsl avatar Sep 17 '17 22:09 jsl

Closing this issue for inactivity, but if the discussion is still relevant feel free to open a new issue 😄

Centeno448 avatar Dec 12 '23 12:12 Centeno448