BioGPT icon indicating copy to clipboard operation
BioGPT copied to clipboard

colab notebook template?

Open runfish5 opened this issue 2 years ago • 16 comments

I precisely followed the great README.md and share it in the colab notebook here: github gist

All went fine until I acctually do the inference:

path1 = '/content/BioGPT/'

from src.transformer_lm_prompt import TransformerLanguageModelPrompt

m = TransformerLanguageModelPrompt.from_pretrained(
        path1 + "checkpoints/RE-DTI-BioGPT", 
        "checkpoint_avg.pt", 
        path1 + "data/KD-DTI/relis-bin",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="data/bpecodes",
        max_len_b=1024,
        beam=1)

Upon running this I get error:

[/usr/lib/python3.8/posixpath.py](https://localhost:8080/#) in join(a, *p)
     74     will be discarded.  An empty last part will result in a path that
     75     ends with a separator."""
---> 76     a = os.fspath(a)
     77     sep = _get_sep(a)
     78     path = a

TypeError: expected str, bytes or os.PathLike object, not NoneType
(expand error)

TypeError                                 Traceback (most recent call last)
[<ipython-input-21-41726c6a2fdf>](https://localhost:8080/#) in <module>
      3 from src.transformer_lm_prompt import TransformerLanguageModelPrompt
      4 
----> 5 m = TransformerLanguageModelPrompt.from_pretrained(
      6         path1 + "checkpoints/RE-DTI-BioGPT",
      7         "checkpoint_avg.pt",

2 frames
[/usr/local/lib/python3.8/dist-packages/fairseq/models/fairseq_model.py](https://localhost:8080/#) in from_pretrained(cls, model_name_or_path, checkpoint_file, data_name_or_path, **kwargs)
    265         from fairseq import hub_utils
    266 
--> 267         x = hub_utils.from_pretrained(
    268             model_name_or_path,
    269             checkpoint_file,

[/usr/local/lib/python3.8/dist-packages/fairseq/hub_utils.py](https://localhost:8080/#) in from_pretrained(model_name_or_path, checkpoint_file, data_name_or_path, archive_map, **kwargs)
     64         "vocab.json": "bpe_vocab",
     65     }.items():
---> 66         path = os.path.join(model_path, file)
     67         if os.path.exists(path):
     68             kwargs[arg] = path

[/usr/lib/python3.8/posixpath.py](https://localhost:8080/#) in join(a, *p)
     74     will be discarded.  An empty last part will result in a path that
     75     ends with a separator."""
---> 76     a = os.fspath(a)
     77     sep = _get_sep(a)
     78     path = a

TypeError: expected str, bytes or os.PathLike object, not NoneType

I'm happy I made it so far, but I cannot really solve this myself. Could you tell me what I did wrong?

runfish5 avatar Feb 09 '23 08:02 runfish5

It looks like a path issue (with the path1 that you added). Try running it from within the BioGPT directory.

pkhoueiry avatar Feb 09 '23 13:02 pkhoueiry

Try replacing path1 + "checkpoints/RE-DTI-BioGPT" with os.path.join(path1, "checkpoints/RE-DTI-BioGPT"). Similary for this as well path1 + "data/KD-DTI/relis-bin"

ShilpaSangappa avatar Feb 09 '23 15:02 ShilpaSangappa

Hi, thanks for your fast reply.

@pkhoueiry :

It looks like a path issue (with the path1 that you added). Try running it from within the BioGPT directory.

When I run the original code snippet and add !pwd infront, I get:

/content/BioGPT
---------------------------------------------------------------------------
[/usr/local/lib/python3.8/dist-packages/fairseq/utils.py](https://localhost:8080/#) in split_paths(paths, separator)
     61 def split_paths(paths: str, separator=os.pathsep) -> List[str]:
     62     return (
---> 63         paths.split(separator) if "://" not in paths else paths.split(MANIFOLD_PATH_SEP)
     64     )
     65 

TypeError: argument of type 'NoneType' is not iterable
5 frames

```py [/usr/local/lib/python3.8/dist-packages/fairseq/models/fairseq_model.py](https://localhost:8080/#) in from_pretrained(cls, model_name_or_path, checkpoint_file, data_name_or_path, **kwargs) 265 from fairseq import hub_utils 266 --> 267 x = hub_utils.from_pretrained( 268 model_name_or_path, 269 checkpoint_file,

/usr/local/lib/python3.8/dist-packages/fairseq/hub_utils.py in from_pretrained(model_name_or_path, checkpoint_file, data_name_or_path, archive_map, **kwargs) 71 utils.import_user_module(argparse.Namespace(user_dir=kwargs["user_dir"])) 72 ---> 73 models, args, task = checkpoint_utils.load_model_ensemble_and_task( 74 [os.path.join(model_path, cpt) for cpt in checkpoint_file.split(os.pathsep)], 75 arg_overrides=kwargs,

/usr/local/lib/python3.8/dist-packages/fairseq/checkpoint_utils.py in load_model_ensemble_and_task(filenames, arg_overrides, task, strict, suffix, num_shards, state) 430 431 if task is None: --> 432 task = tasks.setup_task(cfg.task) 433 434 if "task_state" in state:

/usr/local/lib/python3.8/dist-packages/fairseq/tasks/init.py in setup_task(cfg, **kwargs) 44 ), f"Could not infer task type from {cfg}. Available argparse tasks: {TASK_REGISTRY.keys()}. Available hydra tasks: {TASK_DATACLASS_REGISTRY.keys()}" 45 ---> 46 return task.setup_task(cfg, **kwargs) 47 48

/content/BioGPT/src/language_modeling_prompt.py in setup_task(cls, args, **kwargs) 125 args (argparse.Namespace): parsed command-line arguments 126 """ --> 127 paths = utils.split_paths(args.data) 128 assert len(paths) > 0 129 # find language pair automatically

/usr/local/lib/python3.8/dist-packages/fairseq/utils.py in split_paths(paths, separator) 61 def split_paths(paths: str, separator=os.pathsep) -> List[str]: 62 return ( ---> 63 paths.split(separator) if "://" not in paths else paths.split(MANIFOLD_PATH_SEP) 64 ) 65

TypeError: argument of type 'NoneType' is not iterable

</p>

runfish5 avatar Feb 10 '23 02:02 runfish5

Try replacing path1 + "checkpoints/RE-DTI-BioGPT" with os.path.join(path1, "checkpoints/RE-DTI-BioGPT"). Similary for this as well path1 + "data/KD-DTI/relis-bin"

@ShilpaSangappa Thank you for your fast reply, unfortunately I get the same error as in my comment just above.

runfish5 avatar Feb 10 '23 02:02 runfish5

Hi

I got error, when

import torch
from fairseq.models.transformer_lm import TransformerLanguageModel
m = TransformerLanguageModel.from_pretrained(
        "/content/fairseq/checkpoints/Pre-trained-BioGPT", 
        "checkpoint.pt", 
        "data",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="data/bpecodes",
        min_len=100,
        max_len_b=1024)
m.cuda()
src_tokens = m.encode("COVID-19 is")
generate = m.generate([src_tokens], beam=5)[0]
output = m.decode(generate[0]["tokens"])
print(output)

TypeError Traceback (most recent call last) in 1 import torch 2 from fairseq.models.transformer_lm import TransformerLanguageModel ----> 3 m = TransformerLanguageModel.from_pretrained( 4 "/content/fairseq/checkpoints/Pre-trained-BioGPT", 5 "checkpoint.pt",

yywill avatar Feb 10 '23 08:02 yywill

I get the same error, I guess the problem is a data file

sepehrasgarian avatar Feb 10 '23 18:02 sepehrasgarian

Two things: 1- The newly created checkpoints directory should be in BioGPT folder, the one you cloned 2- Run the script from the BioGPT folder, again, the one you cloned.

If not, you have paths issues and you will have to adapt the python scripts to resolve dependencies.

pkhoueiry avatar Feb 10 '23 19:02 pkhoueiry

Thanks for your reply, maybe the error is because of data/KD-DTI/relis-bin" and it should first created

sepehrasgarian avatar Feb 10 '23 19:02 sepehrasgarian

@pkhoueiry Yes I did thing 1 and 2 (sorry, it didn't save in github gist). It's the same error as above. It might be that there are paths issues, but it would be better to resolve them to avoid additional problems. @sepehrasgarian 's hint seemed to have brought me one step forward, but there is yet another error:

[/content/BioGPT/src/language_modeling_prompt.py](https://localhost:8080/#) in setup_task(cls, args, **kwargs)
    131             args.source_lang, args.target_lang = data_utils.infer_language_pair(paths[0])
    132         if args.source_lang is None or args.target_lang is None:
--> 133             raise Exception(
    134                 "Could not infer language pair, please provide it explicitly"
    135             )

Exception: Could not infer language pair, please provide it explicitly
error details (5 frames)

/content/BioGPT
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
[<ipython-input-18-eb583ca22072>](https://localhost:8080/#) in <module>
      2 from src.transformer_lm_prompt import TransformerLanguageModelPrompt
      3 
----> 4 m = TransformerLanguageModelPrompt.from_pretrained(
      5         "checkpoints/RE-DTI-BioGPT",
      6         "checkpoint_avg.pt",

4 frames
[/usr/local/lib/python3.8/dist-packages/fairseq/models/fairseq_model.py](https://localhost:8080/#) in from_pretrained(cls, model_name_or_path, checkpoint_file, data_name_or_path, **kwargs)
    265         from fairseq import hub_utils
    266 
--> 267         x = hub_utils.from_pretrained(
    268             model_name_or_path,
    269             checkpoint_file,

[/usr/local/lib/python3.8/dist-packages/fairseq/hub_utils.py](https://localhost:8080/#) in from_pretrained(model_name_or_path, checkpoint_file, data_name_or_path, archive_map, **kwargs)
     71         utils.import_user_module(argparse.Namespace(user_dir=kwargs["user_dir"]))
     72 
---> 73     models, args, task = checkpoint_utils.load_model_ensemble_and_task(
     74         [os.path.join(model_path, cpt) for cpt in checkpoint_file.split(os.pathsep)],
     75         arg_overrides=kwargs,

[/usr/local/lib/python3.8/dist-packages/fairseq/checkpoint_utils.py](https://localhost:8080/#) in load_model_ensemble_and_task(filenames, arg_overrides, task, strict, suffix, num_shards, state)
    430 
    431             if task is None:
--> 432                 task = tasks.setup_task(cfg.task)
    433 
    434             if "task_state" in state:

[/usr/local/lib/python3.8/dist-packages/fairseq/tasks/__init__.py](https://localhost:8080/#) in setup_task(cfg, **kwargs)
     44     ), f"Could not infer task type from {cfg}. Available argparse tasks: {TASK_REGISTRY.keys()}. Available hydra tasks: {TASK_DATACLASS_REGISTRY.keys()}"
     45 
---> 46     return task.setup_task(cfg, **kwargs)
     47 
     48 

[/content/BioGPT/src/language_modeling_prompt.py](https://localhost:8080/#) in setup_task(cls, args, **kwargs)
    131             args.source_lang, args.target_lang = data_utils.infer_language_pair(paths[0])
    132         if args.source_lang is None or args.target_lang is None:
--> 133             raise Exception(
    134                 "Could not infer language pair, please provide it explicitly"
    135             )

Exception: Could not infer language pair, please provide it explicitly
  1. How can I provide the so called language pair, which definition is not known to me?

  2. I also don't know if I have to import the mosesdecoder?

  3. And lastly, would it be easier to use it via Huggingface implementation? Are there specific functionalities not supported with the Huggingface implementation?

runfish5 avatar Feb 13 '23 01:02 runfish5

@pkhoueiry This file is missing in the repository 'data/KD-DTI/relis-bin". Can you please help with this on how to get it or generate

I have done below things as mentioned above but still not working Two things: 1- The newly created checkpoints directory should be in BioGPT folder, the one you cloned 2- Run the script from the BioGPT folder, again, the one you cloned.

My working directory is BioGPT

PFB the error for reference `--------------------------------------------------------------------------- TypeError Traceback (most recent call last) in 1 import torch 2 from src.transformer_lm_prompt import TransformerLanguageModelPrompt ----> 3 m = TransformerLanguageModelPrompt.from_pretrained("checkpoints/RE-DTI-BioGPT", "checkpoint_avg.pt", "data/KD-DTI/relis-bin", tokenizer='moses', bpe='fastbpe', bpe_codes="data/bpecodes", max_len_b=1024,beam=1)

5 frames /usr/local/lib/python3.8/dist-packages/fairseq/utils.py in split_paths(paths, separator) 61 def split_paths(paths: str, separator=os.pathsep) -> List[str]: 62 return ( ---> 63 paths.split(separator) if "://" not in paths else paths.split(MANIFOLD_PATH_SEP) 64 ) 65

TypeError: argument of type 'NoneType' is not iterable`

Thank you

karkeranikitha avatar Feb 13 '23 15:02 karkeranikitha

You have to generate it by running the preprocessing (very fast). Check the related page in the example folder: https://github.com/microsoft/BioGPT/tree/main/examples/RE-DTI

pkhoueiry avatar Feb 13 '23 15:02 pkhoueiry

You have to generate it by running the preprocessing (very fast). Check the related page in the example folder: https://github.com/microsoft/BioGPT/tree/main/examples/RE-DTI

Thanks alot @pkhoueiry. It worked.

There is a fastBPE error that I am trying to resolve now

module 'fastBPE' has no attribute 'fastBPE'

karkeranikitha avatar Feb 13 '23 18:02 karkeranikitha

Sorry for sharing a secret Github gist in the issue opening, I recognized it now and made it public, I hope the gist is public now.

According to the README.md it was not obvious to me that I have to preprocess something, but the preprocessing seems fairly simple. However, it's more complicated for the KD-DTI dataset than for RE-DDI and RE-BC5CDR, thus I try to run BioGPT with RE-DDI first. I'm seeing another problem with preprocessing, which I report in #56.
I think it would be better to change the RE-example in the README.md to RE-BC5CDR or RE-DDI.

runfish5 avatar Feb 14 '23 04:02 runfish5

Thank you for your help! The preprocessing seems to be solved. I solved it by adding:

%env MOSES=/content/BioGPT/mosesdecoder
%env FASTBPE=/content/BioGPT/fastBPE

but now I still got an ERROR:

     25             import fastBPE
     26 
---> 27             self.bpe = fastBPE.fastBPE(codes)
     28             self.bpe_symbol = "@@ "
     29         except ImportError:

AttributeError: module 'fastBPE' has no attribute 'fastBPE'
expand error

AttributeError                            Traceback (most recent call last)
[<ipython-input-14-abe1cad175ae>](https://localhost:8080/#) in <module>
      4 from src.transformer_lm_prompt import TransformerLanguageModelPrompt
      5 
----> 6 m = TransformerLanguageModelPrompt.from_pretrained(
      7         "checkpoints/RE-DDI-BioGPT",
      8         "checkpoint_avg.pt",

3 frames
[/usr/local/lib/python3.8/dist-packages/fairseq/models/fairseq_model.py](https://localhost:8080/#) in from_pretrained(cls, model_name_or_path, checkpoint_file, data_name_or_path, **kwargs)
    273         )
    274         logger.info(x["args"])
--> 275         return hub_utils.GeneratorHubInterface(x["args"], x["task"], x["models"])
    276 
    277     @classmethod

[/usr/local/lib/python3.8/dist-packages/fairseq/hub_utils.py](https://localhost:8080/#) in __init__(self, cfg, task, models)
    106 
    107         self.tokenizer = encoders.build_tokenizer(cfg.tokenizer)
--> 108         self.bpe = encoders.build_bpe(cfg.bpe)
    109 
    110         self.max_positions = utils.resolve_max_positions(

[/usr/local/lib/python3.8/dist-packages/fairseq/registry.py](https://localhost:8080/#) in build_x(cfg, *extra_args, **extra_kwargs)
     59             builder = cls
     60 
---> 61         return builder(cfg, *extra_args, **extra_kwargs)
     62 
     63     def register_x(name, dataclass=None):

[/usr/local/lib/python3.8/dist-packages/fairseq/data/encoders/fastbpe.py](https://localhost:8080/#) in __init__(self, cfg)
     25             import fastBPE
     26 
---> 27             self.bpe = fastBPE.fastBPE(codes)
     28             self.bpe_symbol = "@@ "
     29         except ImportError:

AttributeError: module 'fastBPE' has no attribute 'fastBPE'

runfish5 avatar Feb 17 '23 08:02 runfish5

I made a !pip install fastBPE and it went through those lines, but stopped later at

generate = m.generate([src_tokens], beam=args.beam)[0]

With the error:

NameError                                 Traceback (most recent call last)
[<ipython-input-8-abe1cad175ae>](https://localhost:8080/#) in <module>
     16 src_text= text3 # input text, e.g., a PubMed abstract
     17 src_tokens = m.encode(src_text)
---> 18 generate = m.generate([src_tokens], beam=args.beam)[0]
     19 output = m.decode(generate[0]["tokens"])
     20 print(output)

NameError: name 'args' is not defined

runfish5 avatar Feb 17 '23 09:02 runfish5

Yes, pip install fastBPE works and you can replace args.beam by a variable or value ie 1 or 5

panamantis avatar Mar 02 '23 21:03 panamantis

SUCCESS!!!!!!! ⭐ ⭐ ⭐ ⭐ ⭐ Thank you all for your help. With your combined effort, I finished the template, here it is: https://gist.github.com/raven44099/edd254c6f5dbcfe5faad7701d1df88cf

runfish5 avatar Mar 10 '23 06:03 runfish5

@raven44099 I'm trying to run your colab and it stumbles into the same error as you posted. Did you run the whole KD-DTI preparation in order to create the relis-bin? including the installation of BERT-DTI? bc I don't see it in your notebook and I wonder where did I get things wrong,

lir0ni avatar Aug 05 '23 15:08 lir0ni

I didn't finish KD-DTI dataset stuff, sorry for not making it clear. I only finished it for RE-DDI.

runfish5 avatar Oct 02 '23 01:10 runfish5