paper-qa icon indicating copy to clipboard operation
paper-qa copied to clipboard

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1980: character maps to <undefined>

Open Dobrokhotov1989 opened this issue 7 months ago • 1 comments

Hi there,

I've just installed Paper-QA2 and tried to test it with CLI. By asking a simple question I run into UnicodeDecodeError. The same error happened when I tried to ask the same question via Jupyter notebook

I'm on Win 11, python 3.12.9

> pqa ask 'What manufacturing challenges are unique to bispecific antibodies?'                                                                   
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Scripts\pqa.exe\__main__.py", line 4, in <module>
  File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\paperqa\__init__.py", line 3, in <module>
    from lmi import (
  File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\lmi\__init__.py", line 1, in <module>
    from .constants import (
  File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\lmi\constants.py", line 3, in <module>
    import litellm
  File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\litellm\__init__.py", line 762, in <module>
    from .cost_calculator import completion_cost
  File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\litellm\cost_calculator.py", line 19, in <module>
    from litellm.litellm_core_utils.llm_cost_calc.utils import (
  File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\litellm\litellm_core_utils\llm_cost_calc\utils.py", line 9, in <module>
    from litellm.utils import get_model_info
  File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\litellm\utils.py", line 188, in <module>
    json_data = json.load(f)
                ^^^^^^^^^^^^
  File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
                 ^^^^^^^^^
  File "C:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1980: character maps to <undefined>
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[1], line 1
----> 1 from paperqa import Settings, ask
      3 answer_response = ask(
      4     "What manufacturing challenges are unique to bispecific antibodies?",
      5     settings=Settings(temperature=0.5, paper_directory="Test"),
      6 )

File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\paperqa\__init__.py:3
      1 import warnings
----> 3 from lmi import (
      4     EmbeddingModel,
      5     HybridEmbeddingModel,
      6     LiteLLMEmbeddingModel,
      7     LiteLLMModel,
      8     LLMModel,
      9     LLMResult,
     10     SentenceTransformerEmbeddingModel,
     11     SparseEmbeddingModel,
     12     embedding_model_factory,
     13 )
     15 from paperqa.agents import ask
     16 from paperqa.agents.main import agent_query

File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\lmi\__init__.py:1
----> 1 from .constants import (
      2     CHARACTERS_PER_TOKEN_ASSUMPTION,
      3     EXTRA_TOKENS_FROM_USER_ROLE,
      4     MODEL_COST_MAP,
      5 )
      6 from .cost_tracker import GLOBAL_COST_TRACKER, cost_tracking_ctx, enable_cost_tracking
      7 from .embeddings import (
      8     EmbeddingModel,
      9     EmbeddingModes,
   (...)     14     embedding_model_factory,
     15 )

File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\lmi\constants.py:3
      1 from sys import version_info
----> 3 import litellm
      5 # Estimate from OpenAI's FAQ
      6 # https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
      7 CHARACTERS_PER_TOKEN_ASSUMPTION: float = 4.0

File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\litellm\__init__.py:762
    759 openai_image_generation_models = ["dall-e-2", "dall-e-3"]
    761 from .timeout import timeout
--> 762 from .cost_calculator import completion_cost
    763 from litellm.litellm_core_utils.litellm_logging import Logging, modify_integration
    764 from litellm.litellm_core_utils.get_llm_provider_logic import get_llm_provider

File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\litellm\cost_calculator.py:19
     12 from litellm.constants import (
     13     DEFAULT_MAX_LRU_CACHE_SIZE,
     14     DEFAULT_REPLICATE_GPU_PRICE_PER_SECOND,
     15 )
     16 from litellm.litellm_core_utils.llm_cost_calc.tool_call_cost_tracking import (
     17     StandardBuiltInToolCostTracking,
     18 )
---> 19 from litellm.litellm_core_utils.llm_cost_calc.utils import (
     20     _generic_cost_per_character,
     21     generic_cost_per_token,
     22 )
     23 from litellm.llms.anthropic.cost_calculation import (
     24     cost_per_token as anthropic_cost_per_token,
     25 )
     26 from litellm.llms.azure.cost_calculation import (
     27     cost_per_token as azure_openai_cost_per_token,
     28 )

File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\litellm\litellm_core_utils\llm_cost_calc\utils.py:9
      7 from litellm import verbose_logger
      8 from litellm.types.utils import ModelInfo, Usage
----> 9 from litellm.utils import get_model_info
     12 def _is_above_128k(tokens: float) -> bool:
     13     if tokens > 128000:

File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\site-packages\litellm\utils.py:188
    183 try:
    184     # Python 3.9+
    185     with resources.files("litellm.litellm_core_utils.tokenizers").joinpath(
    186         "anthropic_tokenizer.json"
    187     ).open("r") as f:
--> 188         json_data = json.load(f)
    189 except (ImportError, AttributeError, TypeError):
    190     with resources.open_text(
    191         "litellm.litellm_core_utils.tokenizers", "anthropic_tokenizer.json"
    192     ) as f:

File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\json\__init__.py:293, in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    274 def load(fp, *, cls=None, object_hook=None, parse_float=None,
    275         parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
    276     """Deserialize ``fp`` (a ``.read()``-supporting file-like object containing
    277     a JSON document) to a Python object.
    278 
   (...)    291     kwarg; otherwise ``JSONDecoder`` is used.
    292     """
--> 293     return loads(fp.read(),
    294         cls=cls, object_hook=object_hook,
    295         parse_float=parse_float, parse_int=parse_int,
    296         parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)

File c:\Users\darkc\Documents\R_projects\PapQA2_space\envs\PapQA2_py3_12\Lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
     22 def decode(self, input, final=False):
---> 23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1980: character maps to <undefined>

Thanks in advance for your help

Dobrokhotov1989 avatar Apr 26 '25 18:04 Dobrokhotov1989

Hello @Dobrokhotov1989! I'm here to help you with bugs, answer your questions, and assist you in becoming a contributor. While you wait for a human maintainer, feel free to share more about your issue! 🛠️

To resolve the UnicodeDecodeError related to the 'charmap' codec when using Paper-QA2 on Windows 11 with Python 3.12.9, you can modify the file reading process to specify the encoding explicitly. The parse_text function in paperqa/readers.py already includes a mechanism to handle this error by using utf-8 encoding and ignoring errors. Here is the relevant code snippet:

try:
    with path.open() as f:
        text = list(f) if split_lines else f.read()
except UnicodeDecodeError:
    with path.open(encoding="utf-8", errors="ignore") as f:
        text = f.read()

This code attempts to read the file normally, and if a UnicodeDecodeError occurs, it retries with utf-8 encoding and ignores any errors. This should resolve the issue you are encountering [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

dosubot[bot] avatar Apr 26 '25 18:04 dosubot[bot]

This isn't really a PaperQA issue, it's a litellm issue. What is your litellm version?

It seems you are hitting https://github.com/BerriAI/litellm/issues/10272

jamesbraza avatar Apr 27 '25 07:04 jamesbraza

I had litellm 1.67.2. It was installed along with other dependencies in a fresh environment. Following suggestions in the litellm issue thread, I downgraded to 1.67.1, and it solved the problem.

Dobrokhotov1989 avatar Apr 27 '25 17:04 Dobrokhotov1989

Thanks for the bug report! Hopefully litellm fixes it soon

jamesbraza avatar Apr 27 '25 21:04 jamesbraza