paper-qa
paper-qa copied to clipboard
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1980: character maps to <undefined>
Hi there,
I've just installed Paper-QA2 and tried to test it with CLI. By asking a simple question I run into UnicodeDecodeError. The same error happened when I tried to ask the same question via Jupyter notebook
I'm on Win 11, python 3.12.9
> pqa ask 'What manufacturing challenges are unique to bispecific antibodies?'
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Scripts\pqa.exe\__main__.py", line 4, in <module>
File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\paperqa\__init__.py", line 3, in <module>
from lmi import (
File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\lmi\__init__.py", line 1, in <module>
from .constants import (
File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\lmi\constants.py", line 3, in <module>
import litellm
File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\litellm\__init__.py", line 762, in <module>
from .cost_calculator import completion_cost
File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\litellm\cost_calculator.py", line 19, in <module>
from litellm.litellm_core_utils.llm_cost_calc.utils import (
File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\litellm\litellm_core_utils\llm_cost_calc\utils.py", line 9, in <module>
from litellm.utils import get_model_info
File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\litellm\utils.py", line 188, in <module>
json_data = json.load(f)
^^^^^^^^^^^^
File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\json\__init__.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^
File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1980: character maps to <undefined>
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from paperqa import Settings, ask
3 answer_response = ask(
4 "What manufacturing challenges are unique to bispecific antibodies?",
5 settings=Settings(temperature=0.5, paper_directory="Test"),
6 )
File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\paperqa\__init__.py:3
1 import warnings
----> 3 from lmi import (
4 EmbeddingModel,
5 HybridEmbeddingModel,
6 LiteLLMEmbeddingModel,
7 LiteLLMModel,
8 LLMModel,
9 LLMResult,
10 SentenceTransformerEmbeddingModel,
11 SparseEmbeddingModel,
12 embedding_model_factory,
13 )
15 from paperqa.agents import ask
16 from paperqa.agents.main import agent_query
File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\lmi\__init__.py:1
----> 1 from .constants import (
2 CHARACTERS_PER_TOKEN_ASSUMPTION,
3 EXTRA_TOKENS_FROM_USER_ROLE,
4 MODEL_COST_MAP,
5 )
6 from .cost_tracker import GLOBAL_COST_TRACKER, cost_tracking_ctx, enable_cost_tracking
7 from .embeddings import (
8 EmbeddingModel,
9 EmbeddingModes,
(...) 14 embedding_model_factory,
15 )
File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\lmi\constants.py:3
1 from sys import version_info
----> 3 import litellm
5 # Estimate from OpenAI's FAQ
6 # https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
7 CHARACTERS_PER_TOKEN_ASSUMPTION: float = 4.0
File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\litellm\__init__.py:762
759 openai_image_generation_models = ["dall-e-2", "dall-e-3"]
761 from .timeout import timeout
--> 762 from .cost_calculator import completion_cost
763 from litellm.litellm_core_utils.litellm_logging import Logging, modify_integration
764 from litellm.litellm_core_utils.get_llm_provider_logic import get_llm_provider
File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\litellm\cost_calculator.py:19
12 from litellm.constants import (
13 DEFAULT_MAX_LRU_CACHE_SIZE,
14 DEFAULT_REPLICATE_GPU_PRICE_PER_SECOND,
15 )
16 from litellm.litellm_core_utils.llm_cost_calc.tool_call_cost_tracking import (
17 StandardBuiltInToolCostTracking,
18 )
---> 19 from litellm.litellm_core_utils.llm_cost_calc.utils import (
20 _generic_cost_per_character,
21 generic_cost_per_token,
22 )
23 from litellm.llms.anthropic.cost_calculation import (
24 cost_per_token as anthropic_cost_per_token,
25 )
26 from litellm.llms.azure.cost_calculation import (
27 cost_per_token as azure_openai_cost_per_token,
28 )
File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\litellm\litellm_core_utils\llm_cost_calc\utils.py:9
7 from litellm import verbose_logger
8 from litellm.types.utils import ModelInfo, Usage
----> 9 from litellm.utils import get_model_info
12 def _is_above_128k(tokens: float) -> bool:
13 if tokens > 128000:
File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\litellm\utils.py:188
183 try:
184 # Python 3.9+
185 with resources.files("litellm.litellm_core_utils.tokenizers").joinpath(
186 "anthropic_tokenizer.json"
187 ).open("r") as f:
--> 188 json_data = json.load(f)
189 except (ImportError, AttributeError, TypeError):
190 with resources.open_text(
191 "litellm.litellm_core_utils.tokenizers", "anthropic_tokenizer.json"
192 ) as f:
File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\json\__init__.py:293, in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
274 def load(fp, *, cls=None, object_hook=None, parse_float=None,
275 parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
276 """Deserialize ``fp`` (a ``.read()``-supporting file-like object containing
277 a JSON document) to a Python object.
278
(...) 291 kwarg; otherwise ``JSONDecoder`` is used.
292 """
--> 293 return loads(fp.read(),
294 cls=cls, object_hook=object_hook,
295 parse_float=parse_float, parse_int=parse_int,
296 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1980: character maps to <undefined>
Thanks in advance for your help
Hello @Dobrokhotov1989! I'm here to help you with bugs, answer your questions, and assist you in becoming a contributor. While you wait for a human maintainer, feel free to share more about your issue! 🛠️
To resolve the UnicodeDecodeError related to the 'charmap' codec when using Paper-QA2 on Windows 11 with Python 3.12.9, you can modify the file reading process to specify the encoding explicitly. The parse_text function in paperqa/readers.py already includes a mechanism to handle this error by using utf-8 encoding and ignoring errors. Here is the relevant code snippet:
try:
with path.open() as f:
text = list(f) if split_lines else f.read()
except UnicodeDecodeError:
with path.open(encoding="utf-8", errors="ignore") as f:
text = f.read()
This code attempts to read the file normally, and if a UnicodeDecodeError occurs, it retries with utf-8 encoding and ignores any errors. This should resolve the issue you are encountering [1].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
This isn't really a PaperQA issue, it's a litellm issue. What is your litellm version?
It seems you are hitting https://github.com/BerriAI/litellm/issues/10272
I had litellm 1.67.2. It was installed along with other dependencies in a fresh environment.
Following suggestions in the litellm issue thread, I downgraded to 1.67.1, and it solved the problem.
Thanks for the bug report! Hopefully litellm fixes it soon