uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering
uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering copied to clipboard
In extract_pdf_nougat_qa.ipynb, ExtractClient(config) gets 'NoneType' object error
🐛 Describe the bug
In extract_pdf_nougat_qa.ipynb, https://github.com/CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering/blob/main/example/extract/extract_pdf_nougat_qa.ipynb , running this block, it gets 'NoneType' object error:
data = [
{"filename": input_file},
]
config = ExtractPDFConfig(
model_config=NougatModelConfig(
model_name = "facebook/nougat-small",
batch_size = 2
),
splitter=PARAGRAPH_SPLITTER,
)
nougat_client = ExtractClient(config)
Error:
TypeError Traceback (most recent call last) Cell In[13], line 12 1 data = [ 2 {"filename": input_file}, 3 ] 5 config = ExtractPDFConfig( 6 model_config=NougatModelConfig( 7 model_name = "facebook/nougat-small", (...) 10 splitter=PARAGRAPH_SPLITTER, 11 ) ---> 12 nougat_client = ExtractClient(config) 14 # output = nougat_client.run(data)
File /opt/conda/envs/uniflow/lib/python3.10/site-packages/uniflow/flow/client.py:28, in ExtractClient.init(self, config) 21 """Client constructor 22 23 Args: 24 config (Config): Config for the flow 25 26 """ 27 self._config = config ---> 28 self._server = ExtractServer(asdict(self._config))
File /opt/conda/envs/uniflow/lib/python3.10/site-packages/uniflow/flow/server.py:49, in ExtractServer.init(self, config) 47 for i in range(self.num_thread): 48 with OpScope(name="thread" + str(i)): ---> 49 self._flow_queue.put(self._flow_cls(**kwargs))
File /opt/conda/envs/uniflow/lib/python3.10/site-packages/uniflow/flow/extract/extract_pdf_flow.py:33, in ExtractPDFFlow.init(self, model_config, splitter) 24 """Extract PDF Flow Constructor. 25 26 Args: 27 model_config (Dict[str, Any]): Model config. 28 splitter (str): Splitter to use. Defaults to "". 29 """ 30 super().init() 31 self._extract_pdf_op = ExtractPDFOp( 32 name="extract_pdf_op", ---> 33 model=CvModel( 34 model_config=model_config, 35 ), 36 ) 37 self._process_pdf_op = ProcessPDFOp(name="process_pdf_op") 38 self._split_op = SplitterOpsFactory.get(splitter)
File /opt/conda/envs/uniflow/lib/python3.10/site-packages/uniflow/op/model/cv/model.py:28, in CvModel.init(self, model_config) 19 def init( 20 self, 21 model_config: Dict[str, Any], 22 ) -> None: 23 """Initialize Preprocess Model class. 24 25 Args: 26 model_config (Dict[str, Any]): Model config. 27 """ ---> 28 super().init(prompt_template=None, model_config=model_config)
File /opt/conda/envs/uniflow/lib/python3.10/site-packages/uniflow/op/model/abs_model.py:29, in AbsModel.init(self, prompt_template, model_config) 22 """Initialize Model class. 23 24 Args: 25 prompt_template (PromptTemplate): Guided prompt template. 26 model_config (Dict[str, Any]): Model config. 27 """ 28 model_server_cls = ModelServerFactory.get(model_config["model_server"]) ---> 29 self._model_server = model_server_cls(prompt_template, model_config) 30 self._prompt_template = prompt_template 31 self._num_samples = 1
File /opt/conda/envs/uniflow/lib/python3.10/site-packages/uniflow/op/model/cv/model_server.py:36, in NougatModelServer.init(self, prompt_template, model_config) 34 self._model_config = NougatModelConfig(**self._model_config) 35 self.dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16 ---> 36 self.processor = NougatProcessor.from_pretrained( 37 self._model_config.model_name, torch_dtype=self.dtype 38 ) 39 self.model = VisionEncoderDecoderModel.from_pretrained( 40 self._model_config.model_name, torch_dtype=self.dtype 41 ) 42 self.device = "cuda" if torch.cuda.is_available() else "cpu"
File /opt/conda/envs/uniflow/lib/python3.10/site-packages/transformers/processing_utils.py:465, in ProcessorMixin.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, **kwargs) 462 if token is not None: 463 kwargs["token"] = token --> 465 args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs) 466 processor_dict, kwargs = cls.get_processor_dict(pretrained_model_name_or_path, **kwargs) 468 return cls.from_args_and_dict(args, processor_dict, **kwargs)
File /opt/conda/envs/uniflow/lib/python3.10/site-packages/transformers/processing_utils.py:511, in ProcessorMixin._get_arguments_from_pretrained(cls, pretrained_model_name_or_path, **kwargs) 508 else: 509 attribute_class = getattr(transformers_module, class_name) --> 511 args.append(attribute_class.from_pretrained(pretrained_model_name_or_path, **kwargs)) 512 return args
File /opt/conda/envs/uniflow/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:409, in AutoImageProcessor.from_pretrained(cls, pretrained_model_name_or_path, **kwargs) 407 return image_processor_class.from_dict(config_dict, **kwargs) 408 elif image_processor_class is not None: --> 409 return image_processor_class.from_dict(config_dict, **kwargs) 410 # Last try: we use the IMAGE_PROCESSOR_MAPPING. 411 elif type(config) in IMAGE_PROCESSOR_MAPPING:
TypeError: 'NoneType' object is not callable
Versions
Collecting environment information... PyTorch version: 2.4.0.dev20240317+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31
Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime) Python platform: Linux-5.15.0-1055-aws-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: Tesla T4 Nvidia driver version: 535.104.12 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Stepping: 7 CPU MHz: 2499.998 BogoMIPS: 4999.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 64 KiB L1i cache: 64 KiB L2 cache: 2 MiB L3 cache: 35.8 MiB NUMA node0 CPU(s): 0-3 Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Retbleed: Vulnerable Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] pytorch-triton==3.0.0+989adb9a29 [pip3] torch==2.4.0.dev20240317+cu121 [conda] numpy 1.26.4 pypi_0 pypi [conda] pytorch-triton 3.0.0+989adb9a29 pypi_0 pypi [conda] torch 2.4.0.dev20240317+cu121 pypi_0 pypi