citation icon indicating copy to clipboard operation
citation copied to clipboard

Multiple hits on a single index

Open whitesided opened this issue 11 years ago • 4 comments

Maybe this isn't a bug but an intended behavior? I'm not sure why that would be, unless it's a presentation of ambiguity in section identification to let the end user decide?

in 113hr2642eh we get three hits (and the same index) on the same string:

[
  {
    "type": "usc",
    "match": "7 U.S.C. 950aaa-2(d)",
    "index": 811965,
    "usc": {
      "title": "7",
      "section": "950aaa-2",
      "subsections": [
        "d"
      ],
      "id": "usc/7/950aaa-2/d",
      "section_id": "usc/7/950aaa-2"
    }
  },
  {
    "type": "usc",
    "match": "7 U.S.C. 950aaa-2(d)",
    "index": 811965,
    "usc": {
      "title": "7",
      "section": "950aaa",
      "subsections": [],
      "id": "usc/7/950aaa",
      "section_id": "usc/7/950aaa"
    }
  },
  {
    "type": "usc",
    "match": "7 U.S.C. 950aaa-2(d)",
    "index": 811965,
    "usc": {
      "title": "7",
      "section": "2",
      "subsections": [
        "d"
      ],
      "id": "usc/7/2/d",
      "section_id": "usc/7/2"
    }
  }
]

I'm just going to skip subsequent hits on the same index and use the first one I find to cope with this for the moment.

whitesided avatar Dec 05 '13 02:12 whitesided

This doesn't look intended to me - looks like a very nice bug that deserves a very nice test case. Your workaround sounds right to me in the interim. Thanks for filing this, I'll work out a fix to this soon. I'm hoping to spend a bunch of time tomorrow or Friday on many of this project's open tickets.

konklone avatar Dec 05 '13 02:12 konklone

Never make a promise on timeline on a Github ticket! I got swamped, mostly with the /licensing repo. I'll get to this, and the other tickets, soon.

konklone avatar Dec 07 '13 00:12 konklone

So this is actually expected behavior. When it detects cites for which it can't know whether or not it's a single section with a hyphen, or two sections -- it returns all of them, erring on the side of letting the user decide.

Here's the code dealing with establishing ambiguity and parsing of ranges: https://github.com/unitedstates/citation/blob/master/citations/usc.js#L48-L65

This is because I built it originally to support a search engine, where you'd want to turn up too many results instead of too few. For a markup tool, I can see why you'd want to be stricter about it. But ultimately, the problem is that we can't always be certain whether a hyphen indicates two sections or one.

There are a couple unambiguous situations -- if there's a double section symbol (§§), it assumes a range. If there's a parenthesis before the hyphen, it assumes a range (because the parenthesis denotes a subsection, so it's stopped describing the section-level identifier).

@phearlez, how do you want to handle ambiguous sections? We could add an option that gets passed into the usc citator that instructs the processor whether to be generous or strict with ambiguous ranges. Or, you could make a client-side decision about it. And/or, the library could return an ambiguous: true flag on detected cites where it wasn't certain.

konklone avatar May 10 '14 16:05 konklone

For my purposes it's sufficient to have the more aggressive hit listed first; I've coped with this already by simply always using the first hit, basically assuming that the more "greedy" hit will be ordered first. An option to avoid ambiguity would be fine as well.

Our auto-tagging runs on an assumption that there will still be some human eyes on things eventually; it's a helping hand, not a replacement for involvement. So I'm open to either way.

whitesided avatar May 12 '14 13:05 whitesided