etherpad-lite icon indicating copy to clipboard operation
etherpad-lite copied to clipboard

[QUESTION] PDF and Docx import results unformatted text.

Open afifa-glowlogix opened this issue 4 years ago • 16 comments

I've followed this link to https://github.com/ether/etherpad-lite/wiki/How-to-enable-importing-and-exporting-different-file-formats-with-AbiWord to install LibreOffice and ep_import_documents_hook https://github.com/mrbabbs/ep_document_import_hook to import documents in my pad, but the resultant pad text is unformatted. It'll have a lot of extra spaces and change the lists to only 1.

afifa-glowlogix avatar Oct 03 '20 11:10 afifa-glowlogix

Our tests pass here.

https://github.com/mrbabbs/ep_document_import_hook -- are you sure you want to use this? Perhaps it's causing problems?

This looks to be a plugin issue unless you can replicate on say https://video.etherpad.com -- if you can, please provide the document.

JohnMcLear avatar Oct 03 '20 11:10 JohnMcLear

I think it has something to do with high fidelity of the document.

afifa-glowlogix avatar Oct 05 '20 06:10 afifa-glowlogix

Here is one of my documents: https://drive.google.com/file/d/1mAfZxHkR2ny5SMHVsAKF-fuY2A1RaT1u/view

afifa-glowlogix avatar Oct 05 '20 06:10 afifa-glowlogix

Did you try without plugins? Does it work on video.etherpad.com?

JohnMcLear avatar Oct 05 '20 07:10 JohnMcLear

Yeah. I've tried without the plugin as well but same results. and video.etherpad.com is also generating the same result.

afifa-glowlogix avatar Oct 05 '20 09:10 afifa-glowlogix

https://video.etherpad.com/p/3HvCofvIJq1TsySXHaEv works...

JohnMcLear avatar Oct 05 '20 09:10 JohnMcLear

I think this is more "I want Etherpad to behave the same as Word/Docs" not "there is an actual problem". Etherpad formats content differently and behaves differently because it's entirely different software.. Do you have a specific problem or???

JohnMcLear avatar Oct 05 '20 09:10 JohnMcLear

I'm seeing the document you gave us correctly imported with correct line listing. I'm also seeing Etherpad handle line numbers completely fine, by using 1.1 et al not 1.a..

Please try to be coherant. Provide one specific example in one document and frame your question as that, see the new issue guidelines for some advise in how to create bug reports.

JohnMcLear avatar Oct 05 '20 09:10 JohnMcLear

video etehrpad 2 video etherpad

@JohnMcLear that's how it's showing up here. All the indentations have gone and the page no is displaying on top? plus there is a sub-list under Position heading in original document.

Is there anything I'm missing? I can miss stuff while setting etherpad-lite on my system but it's strange the results I'm getting on video.etherpad.com

afifa-glowlogix avatar Oct 05 '20 09:10 afifa-glowlogix

@JohnMcLear okay. I'm sorry I brought up a different document's format issue here. The link https://video.etherpad.com/p/3HvCofvIJq1TsySXHaEv is not displaying the document correctly to me.

afifa-glowlogix avatar Oct 05 '20 09:10 afifa-glowlogix

I'll take care of this as part of fixing https://github.com/ether/etherpad-lite/pull/4240 Hopefully it will be ready this week

webzwo0i avatar Oct 05 '20 14:10 webzwo0i

The spaces issue should be fixed in current develop branch. It would be great, if you could test with the latest changes.

The indent issue is gone using the XHTML converter of soffice, but we are not ready to switch the converters yet. I'm not sure if this is a bug in libreoffice or their intented behavior, so that needs further investigation. (I don't see any hint of the indentation with the standard html converter)

The improper implementation of nested lists on your document is a bug in libreoffice's HTML converter. It makes a new OL for the level 2-nesting, but outside the OL of the first one. This means, the a/b/c sub-list is at the same level as the outer-most list, it just uses a/b/c instead of numbers. I look into libreoffice's bugtracker/recent releases to find out, if it's a known bug. If not, I don't think we can do anything. (Also needs more investigation to ensure, I don't made a mistake. My first impression is, that we can't distinguish if that list is nested or not.)

So I'm sorry that two of the issues can't be solved easily, but we're getting more test coverage atm and hopefully this will ease the transition to XHTML converter.

RE the printed page number, I'm going to fix this

webzwo0i avatar Dec 22 '20 20:12 webzwo0i

@afifa-glowlogix any feedback?

JohnMcLear avatar Jan 23 '21 14:01 JohnMcLear

@JohnMcLear Most of our users are non-technical and it was hard to make them understand this issue so we ended up using google docs. Though, thanks for the resolution of the issue @webzwo0i, I'll take some time out to check it with our documents.

afifa-glowlogix avatar Jan 25 '21 09:01 afifa-glowlogix

I'm pushing this back a version as the majority of the support is in.

JohnMcLear avatar Feb 12 '21 11:02 JohnMcLear

Bump @webzwo0i

JohnMcLear avatar Mar 13 '21 14:03 JohnMcLear