nboost
nboost copied to clipboard
nboost-index in docker container fails with `'ascii' codec can't decode byte 0xe2`
Hi, I tried to run both elastic & nboost as docker containers as follows
docker run -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" elasticsearch:7.4.2
docker run -d -p 8000:8000 koursaros/nboost:latest-pt --uhost host.docker.internal --uport 9200
However, when I tried to index travel.csv
within the container it fails with the error
docker exec -it <nboost-container-nameorid> nboost-index --host=host.docker.internal --file /opt/conda/lib/python3.6/site-packages/nboost/resources/travel.csv --index_name travel --delim ,
I:ESIndexer:[es.:ind: 29]:Setting up Elasticsearch index...
I:ESIndexer:[es.:ind: 32]:Creating index travel...
I:ESIndexer:[es.:ind: 37]:Indexing /opt/conda/lib/python3.6/site-packages/nboost/resources/travel.csv...
I:ESIndexer:[bas:csv: 59]:Estimating completion size...
Traceback (most recent call last):
File "/opt/conda/bin/nboost-index", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.6/site-packages/nboost/indexers/cli.py", line 47, in main
indexer(**args).index()
File "/opt/conda/lib/python3.6/site-packages/nboost/indexers/es.py", line 39, in index
bulk(elastic, actions=act)
File "/opt/conda/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 310, in bulk
for ok, item in streaming_bulk(client, actions, *args, **kwargs):
File "/opt/conda/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 222, in streaming_bulk
actions, chunk_size, max_chunk_bytes, client.transport.serializer
File "/opt/conda/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 73, in _chunk_actions
for action, data in actions:
File "/opt/conda/lib/python3.6/site-packages/nboost/indexers/es.py", line 38, in <genexpr>
act = (self.format(passage, cid=cid) for cid, passage in self.csv_generator())
File "/opt/conda/lib/python3.6/site-packages/nboost/indexers/base.py", line 60, in csv_generator
num_lines = count_lines(path)
File "/opt/conda/lib/python3.6/site-packages/nboost/helpers.py", line 117, in count_lines
count = sum(1 for _ in fileobj)
File "/opt/conda/lib/python3.6/site-packages/nboost/helpers.py", line 117, in <genexpr>
count = sum(1 for _ in fileobj)
File "/opt/conda/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 839: ordinal not in range(128)
But, If I run nboost on my macos the indexer for the same file works just fine.
Any ideas, what might went wrong ?
Check what the default and preferred encodings are.
Run a python repl and the following lines
import sys
import locale
print(sys.getdefaultencoding())
print(locale.getpreferredencoding())
Compare your MacOS to Docker. I still haven't figured out why, but it doesn't give me the error when I manually set the encoding using this:
with open(filename, 'r', encoding='encoding_name') as f:
You'll have to modify helpers.py and maybe one other file.
I'd try something else because this repo looks abandoned