BLINK icon indicating copy to clipboard operation
BLINK copied to clipboard

No entity detection for lowercased entities?

Open Zoher15 opened this issue 3 years ago • 4 comments

image How do I use BLINK for lowercased entities?

Zoher15 avatar Jun 08 '21 23:06 Zoher15

@Zoher15 Hi, BLINK uses FLAIR for entity detection and it's not working very well for lowercased entities. Do you have cased data?

ledw avatar Jun 09 '21 16:06 ledw

@ledw So my data is not always well cased (as is data on Twitter etc). There was a paper about this from Dan Roth's group at UPenn: link. This might become a limitation for the amazing tool that BLINK looks to be

Zoher15 avatar Jun 09 '21 16:06 Zoher15

@ledw Are there any plans to fix this? I know ELQ handles lowercase, does but it is limited to only 512 tokens. Also I would train it myself by converting a portion of the training entities to lowercase using the methodology in this paper, but the training data is really resource intensive.

Zoher15 avatar Jul 15 '21 04:07 Zoher15

Since ELQ handles lower-case very accurately, why not just split your documents into 512 token or less chunks? This isn't a problem where you need the entire document to make a decision. The entities are resolved using a much smaller local context window anyway, so I can't imagine you'd lose much accuracy. There might be a small accuracy hit for entities whose required resolution context is in a different chunk, but I would think that'd be a rare edge case.

shellshock1911 avatar Jul 28 '21 04:07 shellshock1911