python-docx2txt icon indicating copy to clipboard operation
python-docx2txt copied to clipboard

Python does not recognize the italic font in Docx

Open me-suzy opened this issue 1 year ago • 0 comments

The problem is when I copy the content of an html page into a Word file with the docx extension. The library does not see the italic font unless it is declared in the html class, otherwise it does not recognize it.

So, python does not recognize italics in docx files for the following reasons:

  • The docx format does not explicitly store information about the italic style.
  • Instead, italic style is represented by a combination of other properties, such as font and font size.

You can test this page that as some italic paragraphs. Copy the content in a docx file, and see how italic can be seen in Word and in Python.

https://neculaifantanaru.com/en/delight-my-gaze-with-something-that-reflects-the-harmony-of-nature-II.html

The Python library for handling docx files, docx2txt, ignores all properties that can affect italic style. To resolve this issue, it may be necessary to modify the docx2txt library to account for all properties that may affect the italic style. It may also be necessary to create a new library that is specifically designed to identify the italic style in docx files.

Here are some specific solutions that could be implemented:

In short, you must take into consideration the <em> and </em> or <i> and </i> tags in the html files, even if the classes did not include the italic style.

The docx2txt library could be modified to read the b and i properties from the docx format. These properties are used to indicate whether the text is underlined or italicized, respectively. A new library could be created that uses a machine learning algorithm to identify the italic style in docx files. This algorithm could consider properties such as font, font size, and letter spacing. It is important to note that these solutions are only suggestions. Their implementation would require further research and testing.

You can see the entire discussion here:

https://learn.microsoft.com/en-us/answers/questions/1375484/python-fails-to-correctly-identify-all-italic-font

me-suzy avatar Sep 25 '23 07:09 me-suzy