docling icon indicating copy to clipboard operation
docling copied to clipboard

Which type of Markdown is supported?

Open thomasfrederikhoeck opened this issue 1 year ago • 2 comments

Question

Which type of Markdown is the output? Some Markdown formats support strikethrough (~~strikethrough~~) such as here on GitHub, while others don't. The reason I'm asking is that we are working a lot with contract types where words are simply deleted with a strikethrough which can drasticly change the meaning of the sentence.

thomasfrederikhoeck avatar Nov 21 '24 10:11 thomasfrederikhoeck

@thomasfrederikhoeck Good point! We have not yet considered the strike-through text explicitely, but I would assume it would simply carry through. I know we looked carefully at bold and italic and made sure we preserved that.

What are your current findings?

PeterStaar-IBM avatar Nov 22 '24 05:11 PeterStaar-IBM

@PeterStaar-IBM My current finding is that it is not preserved, but I don't know if that is related to the OCR not extracting it/reconizing it (OCR sometimes think they are clever and remove it) or if it the conversion to markdown which removes the striketrough.

thomasfrederikhoeck avatar Nov 22 '24 10:11 thomasfrederikhoeck

Meanwhile we have added the possibility to represent these styles in DoclingDocument if the input format contains that information. The serializers should respect it.

cau-git avatar May 20 '25 18:05 cau-git