newspaper4k
newspaper4k copied to clipboard
regex issue while parsing date from the url
Issue by vashis
Thu May 17 09:49:16 2018
Originally opened as https://github.com/codelucas/newspaper/issues/566
Ex: https://www.sciencedaily.com/releases**/2018/05/180515105704**.htm is fetching date from url as 2018/05/18 which is not correct, by making below changes, we can restrict that.
STRICT_DATE_REGEX = '(?<=\W)([\./\-]{0,1}(19|20)\d{2})[\./\-]{0,1}(([0-3]{0,1}[0-9][\./\-])|(\w{3,5}[\./\-]))([0-3]{0,1}[0-9][\./\-]{1})?'
better to use {1} in the end instead of {0,1}