swift-coreml-transformers icon indicating copy to clipboard operation
swift-coreml-transformers copied to clipboard

COREML BERT Crashing on long text

Open heysaik opened this issue 6 years ago • 6 comments

For documents with lots of words, BERT ends up crashing outputting the error Fatal error: 'try!' expression unexpectedly raised an error: App.TokenizerError.tooLong("Token indices sequence length is longer than the specified maximum\nsequence length for this BERT model (784 > 512. Running this\nsequence through BERT will result in indexing errors\".format(len(ids), self.max_len)")

How do you solve this or is BERT only available for paragraphs which a less number of words? Can we increase the maxLen to 1024 or even 2048 or would that not work?

heysaik avatar Aug 30 '19 22:08 heysaik

Increasing the maxLen wouldn't work as it's dependent on the model itself.

One way to work around this would be to split your paragraph into slices of up to maxLen, potentially overlapping.

julien-c avatar Aug 30 '19 22:08 julien-c

If I do that, then won't I get a bunch of answers for a particular question based on each paragraph? How would I know which answer to choose from?

heysaik avatar Aug 30 '19 22:08 heysaik

You can just compare the output logits values and take the max

julien-c avatar Aug 31 '19 00:08 julien-c

How do you get these values? prediction only outputs start, end, tokens, and answer.

Sorry for all the questions, I'm not a huge expert in the neural nets of machine learning. 😅

heysaik avatar Sep 01 '19 06:09 heysaik

Hmm, yeah, you would need to dive into the code and implement it. It's not going to work out of the box unfortunately.

julien-c avatar Sep 01 '19 13:09 julien-c

Has anyone made a method for doing this? I have looked online and have been unable to find anything

mbalfakeih avatar Oct 17 '19 13:10 mbalfakeih