tt4j icon indicating copy to clipboard operation
tt4j copied to clipboard

Support text contain XML/SGML tags

Open simonmeoni opened this issue 10 years ago • 6 comments

I have a problem when I execute this code, I have just delete the sgml args and it's only this argument that it cause problem when it is not present . The processus never terminates his execution and the program enter on a infinite loop when it execute the function process. The infinite loop is on the line 591 on TreeTaggerWrapper.class file. I try to debug them but no sucess ... Do you have any idea where is the problem ? Thanks in advance, Simon

simonmeoni avatar Dec 11 '15 11:12 simonmeoni

TT4J wraps the actual text in tags like "<This-is-the-start-of-the-text />" and "<This-is-the-end-of-the-text />". If it doesn't see these tags again as they are on the output, then it will hang. Cf. line 969 in TreeTaggerWrapper 1.2.1.

reckart avatar Dec 11 '15 11:12 reckart

The problem is due to this two variables :

private static final String STARTOFTEXT = "<This-is-the-start-of-the-text />";
private static final String ENDOFTEXT = "<This-is-the-end-of-the-text />"

TreeTagger needs to ignore this sgml tag to works correctly with the wrapper. It is possible to don't send this two String ? I think the problem come from (line 1120 of TreeTaggerWrapper.class):

    void run()
    {
        try {
            final OutputStream os = _proc.getOutputStream();

            _pw = new PrintWriter(new BufferedWriter(
                new OutputStreamWriter(os, _model.getEncoding())));

            send(STARTOFTEXT);

            while (tokenIterator.hasNext()) {
                O token = tokenIterator.next();
                _lastTokenWritten = token;
                _tokensWritten++;
                send(getText(token));
            }

            send(ENDOFTEXT);
            send(_model.getFlushSequence());
        }
        catch (final Throwable e) {
            _exception = e;
        }
    }

Thanks in advance, Simon

simonmeoni avatar Dec 11 '15 13:12 simonmeoni

I have found the solution. I replace the line 969 by this on the TreeTaggerWrapper.class :

                if (outRecord.contains(STARTOFTEXT)) {
                    inText = true;
                    if (TRACE) {
                        System.err.println("["+TreeTaggerWrapper.this+
                                "|TRACE] ("+_tokensRead+") START ["+outRecord+"]");
                    }
                    continue;
                }

                if (outRecord.contains(ENDOFTEXT)) {
                    if (TRACE) {
                        System.err.println("["+TreeTaggerWrapper.this+
                                "|TRACE] ("+_tokensRead+") COMPLETE ["+outRecord+"]");
                    }
                    break;
                }

and it's working when I don't have the -sgml option :).

simonmeoni avatar Dec 11 '15 16:12 simonmeoni

Thanks for testing this. I'll implement a different solution though that doesn't change existing behavior. What I will do is: check if the "-sgml" flag is present (the default). If the flag is present, continue with the present code. If the flag is not present, try checking specifically if the token text is the start/end marker, probably using "startsWith" instead of "contains".

reckart avatar Dec 17 '15 20:12 reckart

@Alpha34587 could you please check if the changes I made work for you as well?

reckart avatar Dec 17 '15 21:12 reckart

Yes the change sounds good for me :) Thanks !

simonmeoni avatar Dec 18 '15 07:12 simonmeoni