messydates icon indicating copy to clipboard operation
messydates copied to clipboard

Expand the types of dates that can be extracted from texts with `as_messydate()`

Open henriquesposito opened this issue 3 years ago • 7 comments

Although we already do a very good job in expanding and converting dates from text, additional (more complex) types of dates in text (i.e. negative and/or historical dates, approximate dates, date ranges, and sets of dates) as {unstruwwel} does should also be added. The obvious choice would be to rely on the {unstruwwel} package, but they are not on CRAN...

https://github.com/stefanieschneider/unstruwwel

henriquesposito avatar May 11 '22 11:05 henriquesposito

Yes, let’s try and keep this a pretty independent package, and of course we cannot rely on GitHub-only packages. Perhaps make each of these extensions an additional issue to better keep track of them? Or add them as checkboxes here:

  • [ ] add text extraction for historical dates
  • [ ] add text extraction for BC/BCE dates
  • [ ] add text extraction for approximate dates
  • [ ] add text extraction for date ranges
  • [ ] add text extraction for sets of dates

jhollway avatar May 11 '22 21:05 jhollway

  • [ ] add date inferences for text extraction (e.g. "signed on the last day of February 2004")

henriquesposito avatar May 13 '22 08:05 henriquesposito

  • [ ] extract multiple dates from text (currently only extracts first one per row)

henriquesposito avatar May 16 '22 07:05 henriquesposito

I think this is a watershed feature for this package. Do we want to offer this as part of the package when we write up the paper, or is it a non-core addition?

jhollway avatar May 19 '22 00:05 jhollway

I tend to agree, and I am not sure this is a core addition (I am not sure this would be getting into the paper). I think, since we are already getting spelled dates in text very well, we can think about adding or not these features at a later stage... For the future, maybe, we should contact the developer for unstruwwel and see if there are any plans to get the package on CRAN before starting to extend these functions.

henriquesposito avatar May 19 '22 07:05 henriquesposito

  • [ ] extract date in roman numerals

henriquesposito avatar May 20 '22 08:05 henriquesposito

  • [ ] test for false positives
  • [ ] consider other languages

henriquesposito avatar Feb 22 '24 09:02 henriquesposito