unilm TextDiffuser - How the LayoutTransformer model trained?

Dear TextDiffuser author: I have read the paper and source code carefully. but still have some confusion.

The laion-ocr dataset provide ocr.txt, which use 4 point annotation not rectangle bbox, however the LayoutTransformer output is [x1, y1, x2, y2] rectangle bbox, so can we train the layout transformer with laion-ocr, what preprocess we need? Also, the inference of layouttransformer will extract word wrapped by ', calcuate width of targeted word and then remove the ' in caption. However, when training layouttransformer, i am wandering how you build connect between cption and ocr box. For example, the laion-ocr caption will not wrap the showed word with ' , so how can i extract the target word, which will calcuate width later?

Sincerely looking forward to your reply

Jul 07 '23 03:07 crj1998

Thansk for your attention to TextDiffuser. [x1, y1, x2, y2] denotes the coords of top-left and bottom-right points, which belong to the minimum horizontal rectangle of the 4 point annotation.

We try to add single quote ' to the caption according to the detected ocr results. For example, if the caption is [a cat holds a board saying hello world] and the words 'hello' and 'world' are detected, we can transform the caption to [a cat holds a board saying 'hello world']. Indeed there may be some noise but most cases are good.

Jul 07 '23 04:07 JingyeChen

Thansk for your attention to TextDiffuser. [x1, y1, x2, y2] denotes the coords of top-left and bottom-right points, which belong to the minimum horizontal rectangle of the 4 point annotation.

We try to add single quote ' to the caption according to the detected ocr results. For example, if the caption is [a cat holds a board saying hello world] and the words 'hello' and 'world' are detected, we can transform the caption to [a cat holds a board saying 'hello world']. Indeed there may be some noise but most cases are good.

Is there a plan to release layout transformer training scripts? In fact, I try to train it on laion-ocr, but the result are not good. During training, given a prompt and bbox, the encoder part is same as released code. However, i don not know how loss calculated, so In my training implentation, i calcuate the l1 loss between k-th predicted box and k-th ground-truth box as following:

# give a sample: caption - bbox pair
k = random.choice(0, MAX_BOX_NUM)
known_box = bbox[:k]
known_box  = zero_padding_to_max_box_num(known_box )
pred_box = decoder(encoder(caption), known_box  )
# both known_box  and pred_box shape is [MAX_BOX_NUM, 4]

loss = l1_loss(pred_box[k], known_box[k])

I wandering how you build loss?

Jul 07 '23 04:07 crj1998

unilm unilm copied to clipboard

TextDiffuser - How the LayoutTransformer model trained?

unilm
unilm copied to clipboard