joni
joni copied to clipboard
how to interrupt hanging thread?
call thread.interrupt() and nothing happened. so how to stop the hanging thread?
Charset _charset = Charset.forName("GB18030");
/* text containing irregular binary data will make thread hang */
Thread thread = new Thread(new Runnable() {
@Override
public void run() {
try {
String key = "a"; // any character
byte[] pattern = key.getBytes(_charset);
Regex regex = new Regex(pattern, 0, pattern.length, Option.IGNORECASE, GB18030Encoding.INSTANCE);
byte[] source = new byte[]{0x2f, 0x2f, (byte) 0xaf}; // text content.
/* Encoded by GB18030, It reads "//�" where � means that "0xaf" is wrong or unsupported? */
System.out.println(new String(source, _charset));
Matcher matcher = regex.matcher(source);
// search Interruptible ?
int idx=matcher.searchInterruptible(0, source.length, Option.DEFAULT);
System.out.println(idx+"");
} catch (InterruptedException e) {
System.out.println("InterruptedException");
e.printStackTrace();
}
}
});
thread.start();
new Thread(new Runnable() {
@Override
public void run() {
try {
Thread.sleep(500);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("interrupt !!! ");
thread.interrupt(); // called but not working.
}
}).start();
v2.1.30
Hmmm I would have expected that to work. Trying locally.
Ok, so a bit of background here. The original intent of the interruptibility was to ensure long-running regular expressions could be stopped. By long running here we mean things like regular expressions with complex alternations... not regular expressions that get stuck because of bad character encoding.
This case, and likely many others, are not interruptible because they get stuck in a tight inner loop trying endlessly to advance a byte offset for a broken multibyte characters. In this case, the code get stuck here:
https://github.com/jruby/joni/blob/495de017b9a034f9f48dab891db06f1afb3f4b10/src/org/joni/Search.java#L204-L207
Adding an interrupt check here would add a lot of overhead to this search logic, so I'm reluctant to do that. We have discussed making joni more robust in the presence of bad character data, however, since there's several places like this that can get stuck.
So unfortunately right now there's no workaround to make interrupting work for your case. It should really be an error. We'll discuss a bit and see if we can figure out a good path forward.
what about this ?
int max = enc.maxLength();
if(max==1) {
while (s < end) {
if (lowerCaseMatch(target, targetP, targetEnd, text, s, textEnd, enc, buf, regex.caseFoldFlag)) return s;
s += enc.length(text, s, textEnd);
}
} else {
int count = s;
while (s < textRange) {
if (lowerCaseMatch(target, targetP, targetEnd, text, s, textEnd, enc, buf, regex.caseFoldFlag)) return s;
int val = enc.length(text, s, textEnd);
if (val > 0) {
s += val;
count += val;
} else {
if (max + s + val < count) {
s += 1;
count = s;
} else { //I believe it shouldn't lag too much behind ?
s += val;
count += 1;
}
}
}
}
besides SLOW_IC_FORWARD, I've also modified MAP_FORWARD,they are similiar.
I wonder is it bad though.
This looks like a decent change! We were discussing on the JRuby matrix how to fix this, and I believe a similar fix was suggested. @lopex What do you think?
I'm reluctant to such changes since they will introduce inconsistencies and diverge us from onigmo even more. The problem is explained more thoroughly in https://github.com/jruby/jcodings/issues/26
Also, https://github.com/jruby/jcodings/issues/26 shows that those infinite loops will exist in other places and it would not be ideal to randomly patch susceptible callsites.
For this very case, a quick fix would be to add an unsafe GB18030 version in the same fashion https://github.com/jruby/jcodings/tree/unsafe-encoding aims to.