unilm
unilm copied to clipboard
TextDiffuser - How the LayoutTransformer model trained?
Dear TextDiffuser author: I have read the paper and source code carefully. but still have some confusion.
The laion-ocr dataset provide ocr.txt, which use 4 point annotation not rectangle bbox, however the LayoutTransformer output is [x1, y1, x2, y2] rectangle bbox, so can we train the layout transformer with laion-ocr, what preprocess we need?
Also, the inference of layouttransformer will extract word wrapped by '
, calcuate width of targeted word and then remove the '
in caption. However, when training layouttransformer, i am wandering how you build connect between cption and ocr box. For example, the laion-ocr caption will not wrap the showed word with '
, so how can i extract the target word, which will calcuate width later?
Sincerely looking forward to your reply
Thansk for your attention to TextDiffuser. [x1, y1, x2, y2] denotes the coords of top-left and bottom-right points, which belong to the minimum horizontal rectangle of the 4 point annotation.
We try to add single quote ' to the caption according to the detected ocr results. For example, if the caption is [a cat holds a board saying hello world] and the words 'hello' and 'world' are detected, we can transform the caption to [a cat holds a board saying 'hello world']. Indeed there may be some noise but most cases are good.
Thansk for your attention to TextDiffuser. [x1, y1, x2, y2] denotes the coords of top-left and bottom-right points, which belong to the minimum horizontal rectangle of the 4 point annotation.
We try to add single quote ' to the caption according to the detected ocr results. For example, if the caption is [a cat holds a board saying hello world] and the words 'hello' and 'world' are detected, we can transform the caption to [a cat holds a board saying 'hello world']. Indeed there may be some noise but most cases are good.
Is there a plan to release layout transformer training scripts? In fact, I try to train it on laion-ocr, but the result are not good. During training, given a prompt and bbox, the encoder part is same as released code. However, i don not know how loss calculated, so In my training implentation, i calcuate the l1 loss between k-th predicted box and k-th ground-truth box as following:
# give a sample: caption - bbox pair
k = random.choice(0, MAX_BOX_NUM)
known_box = bbox[:k]
known_box = zero_padding_to_max_box_num(known_box )
pred_box = decoder(encoder(caption), known_box )
# both known_box and pred_box shape is [MAX_BOX_NUM, 4]
loss = l1_loss(pred_box[k], known_box[k])
I wandering how you build loss?