datefinder icon indicating copy to clipboard operation
datefinder copied to clipboard

Start and End indexes don't match found dates

Open Nogginboink opened this issue 6 years ago • 2 comments

I'm seeing, for instance, that the text 'September' is correctly found as a date, but the match returns a 9-character string with start and end indices 12 characters apart.

'September' is a 9-character string. The returned start and end indices appear to be wrong.

My sample text is:

I was going to go to the dance in September, but on 3/15 I got a message that the event is on Tuesday instead of the original 2018 schedule. Mark wanted to go on Jan. 13 but Sally thought 2/4/17 was better than 2018 March 3. He wasn't as sure as she was about the things in his bag.

We thought it would start at 10PM, but the actual start was at 9:48:32, or 2148 if you're a military person.

It all happened on Christmas, believe it or not!

Results

I get my first match on 'September' at startIndex 33 and endIndex 45. It should be startIndex 34 and endIndex 43.

Nogginboink avatar Mar 21 '18 15:03 Nogginboink

I haven't looked at this thoroughly yet, but I suspect the problem is here:

sanitize date string

        ## replace unhelpful whitespace characters with single whitespace
        match_str = re.sub('[\n\t\s\xa0]+', ' ', match_str)
        match_str = match_str.strip(self.STRIP_CHARS)

        ## Save sanitized source string
        yield match_str, indices, captures

The found string is stripped, but the start and end indexes are not adjusted.

Nogginboink avatar Mar 21 '18 15:03 Nogginboink

I've found that the following changes appear to work for me. If I can figure out this whole Git thing, I'll submit a pull request later:

sanitize date string

        ## replace unhelpful whitespace characters with single whitespace
        #match_str = re.sub('[\n\t\s\xa0]+', ' ', match_str)
        preStripLength = len(match_str)
        match_str = match_str.lstrip(self.STRIP_CHARS)
        indices = (indices[0] + preStripLength - len(match_str), indices[1])
        
        preStripLength = len(match_str)
        match_str = match_str.rstrip(self.STRIP_CHARS)
        indices = (indices[0], indices[1] -  (preStripLength - len(match_str)))

Nogginboink avatar Mar 21 '18 16:03 Nogginboink