NielsRogge
NielsRogge
You just gotta make sure to normalize your bounding boxes as the model only knows embeddings for boxes between 0 and 1000. See here: https://huggingface.co/docs/transformers/en/model_doc/layoutlm#usage-tips (it's equivalent for LiLT)
I'd recommend using AzureReadAPI with setting readingOrder="natural".
I'd recommend taking a look here: https://github.com/facebookresearch/detr/blob/3af9fa878e73b6894ce3596450a8d9b89d918ca9/datasets/coco.py#L74-L76. The data preparation is equivalent for MaskFormer/Mask2Former/OneFormer. Basically, COCO stores segmentation masks as polygons, so you need to convert them to a set...
> The solution seems for me to write a custom dataset converter to convert my polygon annotations to the custom RGB format (R channel for classID, G channel for instance...
@Robotatron it does support it, however the image processor (which can be used to speed up data preparation) doesn't. So I'd advise to prepare the data yourself for the model,...
@cyh-0 MaskFormer outputs a binary mask + class for each of its object queries (`model.config.num_queries`). If an image contains 2 semantic categories for instance, and the model uses 100 object...
Hi, Thanks for the kind words :) CLIP cannot really be used for image captioning out-of-the-box, as it only consists of 2 encoders (a vision and a text encoder). There...
Hi, We do provide a script to fine-tune CLIP and similar models on an (image, text) dataset here: https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text. Alternatively have a look at the OpenCLIP repository which also provides...
Hi yes, here's a guide: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Donut/DocVQA/Creating_a_toy_DocVQA_dataset_for_Donut.ipynb
Creating a HF Dataset from scratch is explained here: https://huggingface.co/docs/datasets/image_dataset.