Expand the types of dates that can be extracted from texts with `as_messydate()`
Although we already do a very good job in expanding and converting dates from text, additional (more complex) types of dates in text (i.e. negative and/or historical dates, approximate dates, date ranges, and sets of dates) as {unstruwwel} does should also be added. The obvious choice would be to rely on the {unstruwwel} package, but they are not on CRAN...
https://github.com/stefanieschneider/unstruwwel
Yes, let’s try and keep this a pretty independent package, and of course we cannot rely on GitHub-only packages. Perhaps make each of these extensions an additional issue to better keep track of them? Or add them as checkboxes here:
- [ ] add text extraction for historical dates
- [ ] add text extraction for BC/BCE dates
- [ ] add text extraction for approximate dates
- [ ] add text extraction for date ranges
- [ ] add text extraction for sets of dates
- [ ] add date inferences for text extraction (e.g. "signed on the last day of February 2004")
- [ ] extract multiple dates from text (currently only extracts first one per row)
I think this is a watershed feature for this package. Do we want to offer this as part of the package when we write up the paper, or is it a non-core addition?
I tend to agree, and I am not sure this is a core addition (I am not sure this would be getting into the paper). I think, since we are already getting spelled dates in text very well, we can think about adding or not these features at a later stage... For the future, maybe, we should contact the developer for unstruwwel and see if there are any plans to get the package on CRAN before starting to extend these functions.
- [ ] extract date in roman numerals
- [ ] test for false positives
- [ ] consider other languages