indextank-service
indextank-service copied to clipboard
indexing docid with non-ascii characters causes 503 error
Trying to index a doc whose docid contains a "high ascii" or Unicode character above 127 causes the following exception in restapi:
17669 05/02-00.50.12 RPC:ERRO Unexpected failure to run send_batch, reconnecting once @rpc.py:87
Traceback (most recent call last):
File "../api/rpc.py", line 77, in wrap
return att(*args, **kwargs)
File "../gen-py/flaptor/indextank/rpc/LogWriter.py", line 39, in send_batch
self.send_send_batch(batch)
File "../gen-py/flaptor/indextank/rpc/LogWriter.py", line 46, in send_send_batch
args.write(self._oprot)
File "../gen-py/flaptor/indextank/rpc/LogWriter.py", line 139, in write
self.batch.write(oprot)
File "../gen-py/flaptor/indextank/rpc/ttypes.py", line 1679, in write
iter138.write(oprot)
File "../gen-py/flaptor/indextank/rpc/ttypes.py", line 1441, in write
oprot.writeString(self.docid)
File "../api/thrift/protocol/TBinaryProtocol.py", line 123, in writeString
self.trans.write(str)
File "../api/thrift/transport/TTransport.py", line 164, in write
self.__wbuf.write(buf)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 0: ordinal not in range(128)
To reproduce, do this in python:
from indextank.client import ApiClient
c = ApiClient('<YOUR_API_URL>')
idx = c.create_index('testascii')
idx.add_document("â", { "text":"a"})
I think it's ok to reject docids with non-latin1 or non-ascii characters, but I think it should return an HTTP 400 instead of 503 "service unavailable". (Or maybe docids are supposed to accept non-ascii characters?)
Also, this seems to be related but I'm not sure yet: when indexing in batches when this happened, it seemed to cause some problem with the LogWriter, with the following stack trace:
ERROR [pool-1-thread-32] org.apache.thrift.server.TThreadPoolServer - [Error occurred during processing of message.] 2012-02-04 10:27:15,724
java.lang.IllegalStateException: Can't insert records to the live log without defining the index code
at com.google.common.base.Preconditions.checkState(Preconditions.java:145)
at com.flaptor.indextank.storage.RawLog.write(RawLog.java:61)
at com.flaptor.indextank.storage.LogWriterServer.send_batch(LogWriterServer.java:87)
at com.flaptor.indextank.rpc.LogWriter$Processor$send_batch.process(LogWriter.java:214)
at com.flaptor.indextank.rpc.LogWriter$Processor.process(LogWriter.java:193)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Finally, after this all happened, the LogWriter (slave) was taking all the CPU when no docs were being written, like it was in a spin loop. I did a kill -3 to get a thread stack dump, and one or two threads were RUNNABLE at this line:
at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:129)
at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60)
at com.flaptor.indextank.rpc.LogRecord.read(LogRecord.java:900)
...
I can create a separate issue for the LogWriter stuff if you want. But I'm not sure exactly what reproduces it yet.
Let me know if I can provide any more details.
We should support unicode docids. Actually, __validate_docid on api/restapi.py.
Check the code at https://github.com/linkedin/indextank-service/blob/master/api/restapi.py#L45
So it seems the code sending the update to the LogStorage is not supporting non-ascii docids ..
I'm not a python expert, but I dug around, I noticed that thrift uses StringIO, and I found this in the python docs:
The StringIO object can accept either Unicode or 8-bit strings, but mixing the two may take some care. If both are used, 8-bit strings that cannot be interpreted as 7-bit ASCII (that use the 8th bit) will cause a UnicodeError to be raised when getvalue() is called.
from http://docs.python.org/library/stringio.html
Maybe this is what's happening (and why it's happening in the middle of a thrift call)?
It also seems that when batch indexing, a single document causing this issue in the batch can cause the whole batch to fail and return a 503.