LISA Paper provided example can not be reproduced !!

I have draw the example from the README.md.

load-8bit response is acceptable, but it didn't give me any explanation.
I think load-8-bit may decrease performance, so i exec in fp16 mode only (no serious quantization). But i got the worse results.. it still doesn't explain.

bunch of out-of-control [SEG] token is pop out?

About version and package :

accelerate                1.0.1
aiofiles                  23.2.1
aiohappyeyeballs          2.4.3
aiohttp                   3.10.10
aiosignal                 1.3.1
altair                    5.4.1
annotated-types           0.7.0
anyio                     4.6.2
async-timeout             4.0.3
attrs                     24.2.0
autocommand               2.2.2
backports.tarfile         1.2.0
bitsandbytes              0.41.1
certifi                   2024.8.30
charset-normalizer        3.4.0
click                     8.1.7
contourpy                 1.3.0
cycler                    0.12.1
deepspeed                 0.15.2
einops                    0.4.1
exceptiongroup            1.2.2
fastapi                   0.100.1
ffmpy                     0.4.0
filelock                  3.16.1
flash_attn                2.6.3
fonttools                 4.54.1
frozenlist                1.4.1
fsspec                    2024.9.0
gradio                    3.39.0
gradio_client             1.3.0
grpcio                    1.66.2
h11                       0.14.0
hjson                     3.1.0
httpcore                  1.0.6
httpx                     0.27.2
huggingface-hub           0.25.2
idna                      3.10
importlib_metadata        8.0.0
importlib_resources       6.4.5
inflect                   7.3.1
jaraco.collections        5.1.0
jaraco.context            5.3.0
jaraco.functools          4.0.1
jaraco.text               3.12.1
Jinja2                    3.1.4
joblib                    1.4.2
jsonschema                4.23.0
jsonschema-specifications 2024.10.1
kiwisolver                1.4.7
linkify-it-py             2.0.3
markdown-it-py            2.2.0
markdown2                 2.4.10
MarkupSafe                2.1.5
matplotlib                3.9.2
mdit-py-plugins           0.3.3
mdurl                     0.1.2
more-itertools            10.3.0
mpmath                    1.3.0
msgpack                   1.1.0
multidict                 6.1.0
narwhals                  1.9.3
networkx                  3.2.1
ninja                     1.11.1.1
numpy                     1.24.2
nvidia-ml-py              12.560.30
openai                    0.27.8
opencv-python             4.8.0.74
orjson                    3.10.7
packaging                 24.1
pandas                    2.2.3
peft                      0.4.0
Pillow                    9.4.0
pip                       24.2
platformdirs              4.2.2
propcache                 0.2.0
protobuf                  5.28.2
psutil                    6.0.0
py-cpuinfo                9.0.0
pycocotools               2.0.6
pydantic                  2.9.2
pydantic_core             2.23.4
pydub                     0.25.1
pyparsing                 3.2.0
python-dateutil           2.9.0.post0
python-multipart          0.0.12
pytz                      2024.2
PyYAML                    6.0.2
ray                       2.6.1
referencing               0.35.1
regex                     2024.9.11
requests                  2.31.0
rpds-py                   0.20.0
sacremoses                0.1.1
safetensors               0.4.5
scipy                     1.11.2
semantic-version          2.10.0
sentencepiece             0.2.0
setuptools                75.1.0
shortuuid                 1.0.11
six                       1.16.0
sniffio                   1.3.1
starlette                 0.27.0
sympy                     1.12
tokenizers                0.15.2
tomli                     2.0.1
torch                     2.1.2+cu121
torchaudio                2.1.2+cu121
torchvision               0.16.2+cu121
tqdm                      4.64.1
transformers              4.35.2
triton                    2.1.0
typeguard                 4.3.0
typing_extensions         4.12.2
tzdata                    2024.2
uc-micro-py               1.0.3
urllib3                   2.2.3
uvicorn                   0.23.2
websockets                11.0.3
wheel                     0.44.0
yarl                      1.15.2
zipp                      3.20.2

I have encountered several issue, so i follow transformer version to Llava, and modify the code according to this issue https://github.com/haotian-liu/LLaVA/issues/968

The real problem i afraid affect the decoding strategy is that https://github.com/salesforce/LAVIS/issues/571.

So, i have replace all private function (i.e. _expand_mask) to object to pass the static check of python. Moreover, i have place RuntimeError to the begin of all function which will use it (but i didn't get any RuntimeError). So, that's mean all private function will not be used during inference.

Any suggestion will be appreciated!!

Oct 15 '24 05:10 HuangChiEn

Did not u see the train data. The GT for LLaVA to output is trained to be "Sure, it's the <seg>." It was trained to to say like that, so you cannot force it to output the explaination. Although the writer's demo seem to be wrong.

Oct 22 '24 01:10 jifeng35

Did not u see the train data. The GT for LLaVA to output is trained to be "Sure, it's the ." It was trained to to say like that, so you cannot force it to output the explaination. Although the writer's demo seem to be wrong.

yeah, i agree your point, model barely can not generate the phrase what it never seen in training set. However, the demo examples not just give one example.

So, i wonder how to reproduce such inference results by adjusting the prompt (we just see it trigger the prompt by 'explain why') ?

On the other hand, the reproduced results also appears the error-prone output, for example, it generate massive [SEG] tokens in console. It's also one of my question.

Oct 22 '24 07:10 HuangChiEn

我觉得可以不用纠结这个问题，如果你希望得到demo中的效果，可以考虑调研一下LISA++，LISA++可以很好的完成demo中的任务，对话更自然一些

Oct 22 '24 13:10 jifeng35

我觉得可以不用纠结这个问题，如果你希望得到demo中的效果，可以考虑调研一下LISA++，LISA++可以很好的完成demo中的任务，对话更自然一些

有查到你提的那篇論文，但我找不到它的github；你提到LISA++可以很好的完成demo中的任务，那你知道在哪取得LISA++的github嗎 (source code和reproduce的權重) ?

Oct 23 '24 00:10 HuangChiEn

好像确实没有代码和权重，LISA++的论文里也没提到

Oct 23 '24 00:10 jifeng35

then, let's wait and see that will LISA author give any comment ~

Oct 23 '24 03:10 HuangChiEn

addition note : 11/06 we have also test the different checkpoint : "xinlai/LISA-13B-llama2-v1-explanatory", i haven't notice "explanatory" this term may denote the new feature i request in this thread. Note that to fully allowing the capability of model, we also turn off load-in-8 bits. fp16 is the efficient flag we only preserved.

However, we encountered even worst results. Text output (buggy output) :

Segment output (doesn't correctly segment) : dog_with_horn_masked_img_0

Nov 06 '24 01:11 HuangChiEn

@HuangChiEn Hi~I have been working on reproducing LISA recently. Have you noticed Issue# 162？ I see that you are all paying attention to the reproduction of reason seg, but rarely mention reference seg(e.g. validation on refCOCO dataset). Have you successfully replicated the results tested on the refCOCO dataset?

Dec 02 '24 12:12 fazhdo

@HuangChiEn Hi~I have been working on reproducing LISA recently. Have you noticed Issue# 162？ I see that you are all paying attention to the reproduction of reason seg, but rarely mention reference seg(e.g. validation on refCOCO dataset). Have you successfully replicated the results tested on the refCOCO dataset?

hello, i hadn't fully tested on it (run through all validation set), but the segmentation results are correct on the paper provided examples.

On the other hand, i have successfully reproduced the reason seg, while only xinlai/LISA-13B-llama2-v1-explanatory quantized version works (--precision='fp16' --load_in_8bit or --load_in_4bit).

Dec 04 '24 01:12 HuangChiEn