tabula icon indicating copy to clipboard operation
tabula copied to clipboard

Error: Spaces make a new line

Open AlanReviews opened this issue 4 years ago • 13 comments

For some documents that contain spaces and text in the first row, the CSV file contains a new line instead of space. For example, if I had the first row with a, "Hello world", and b, use Tabula to capture the first row, I have the CSV file as:

a,"Hello
world",b

I expect the CSV file to be:

a, "Hello world", b

AlanReviews avatar Nov 05 '20 20:11 AlanReviews

Hi. I cant raise an issue so im posting here.

I encounter several issues in parsing lattice and stream tables. is it possible to add param like row or column tolerate like in camelot ? thanks

jonardcaguioa avatar Jun 07 '21 12:06 jonardcaguioa

Hi I cant raise an issue so I am posting here

Tabula opens to keycloak (I don't even know that is)

image

nharrisanalyst avatar Aug 14 '21 00:08 nharrisanalyst

Whatever Keycloak is, it's running on port 8080, blocking Tabula from using that port. Quit Keycloak, then restart Tabula, then Tabula should work.

Jeremy B. Merrill Sent from my mobile device

On Fri, Aug 13, 2021, 8:13 PM Nathan Harris @.***> wrote:

Hi I cant raise an issue so I am posting here

Tabula opens to keycloak (I don't even know that is)

[image: image] https://user-images.githubusercontent.com/16217396/129428482-15fa5c1b-5c16-4bf1-aae3-27d620a6e24d.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tabulapdf/tabula/pull/1171#issuecomment-898777665, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEF3GXSMXWJ6KFCR4OHHVLT4WYLNANCNFSM4TL3EPNA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

jeremybmerrill avatar Aug 14 '21 00:08 jeremybmerrill

Hi guys, sorry for posting this here but there is nowhere else to do it. Maybe you should enable discussions on this repo, so that tabula users can help each other.

My post here is about line breaks inside table cells, like this (this is a screen of Tabula output): image

Tabula will insert those line breaks in the CSV but that would create chaos when importing to excel as it would create a new line for every line break, break text string indications (between quotes), etc...

image

Maybe there is a use case to keep those line breaks inside the cell but a "remove line breaks inside cells" flag would be a great feature.

Meanwhile, for my fellow users who are battling this problem, after much searching I've found a solution that would work with both Notepad++ and Sublime. A regex find and replace (taken for a very helpful post at StackOverflow):

_Use Notepad++ regex Find-and-Replace:

Find what:_

(,"[^"]*?)[\r\n]+

Replace with:

$1

(_There is a single space after $1)

Repeatedly click "Replace All" until no more matches are found._

This works.

dakial avatar Sep 03 '21 12:09 dakial

My post here is about line breaks inside table cells

In Excel they are entered as <Ctrl><CR>.

flywire avatar Sep 03 '21 12:09 flywire

Hey! I can't raise an issue, so am writing here. Could you please tell, what to do with "Java Heap Error", if my pdf is only 50 KB and has only 1 page?

I've already tried multiple times to reinstall both Java and Tabula, restarted the whole computer, cleared the "Local/Temp" folder, cleared ports and changed 8080 to 9999, but nothing happens. OS: Windows 10

The pfd is 100% valid — I have opened it this morning without any trouble but then suddenly it just stopped working. Since then it returns "Sorry, your file upload could not be processed. Please double-check that the file you uploaded is a valid PDF file and try again" — and I cannot see any of my previous files, it just shows "First time using Tabula? Welcome!"

I will be very grateful for any help 🙏

The error is the following: 2022-01-08 17:33:25.054:INFO:oejsh.ContextHandler:main: Started o.e.j.w.WebAppContext@47089e5f{/,file:/C:/Users/User/AppData/Local/Temp/jetty-0.0.0.0-8080-tabula.jar-_-any-6136871881336378357.dir/webapp/,AVAILABLE}{file:/H:/Course/tabula-win-1.2.1/tabula/tabula.jar} 2022-01-08 17:33:25.056:WARN:oejsh.RequestLogHandler:main: !RequestLog 2022-01-08 17:33:25.068:INFO:oejs.ServerConnector:main: Started ServerConnector@2bb0e277{HTTP/1.1}{0.0.0.0:8080} 2022-01-08 17:33:25.069:INFO:oejs.Server:main: Started @8398ms 2022-01-08 17:33:35.840:WARN:oejs.ServletHandler:qtp1706234378-18: Error for /documents java.lang.OutOfMemoryError: Java heap space at org.jruby.util.ByteList.ensure(ByteList.java:341) at org.jruby.util.io.EncodingUtils$1.resize(EncodingUtils.java:1197) at org.jruby.util.io.EncodingUtils.moreOutputBuffer(EncodingUtils.java:1513) at org.jruby.util.io.EncodingUtils.transcodeLoop(EncodingUtils.java:1415) at org.jruby.util.io.EncodingUtils.transcodeLoop(EncodingUtils.java:1312) at org.jruby.util.io.EncodingUtils.strTranscode0(EncodingUtils.java:936) at org.jruby.util.io.EncodingUtils.strTranscode(EncodingUtils.java:857) at org.jruby.util.io.EncodingUtils.strEncode(EncodingUtils.java:829) at org.jruby.RubyString.encode(RubyString.java:5398) at json.ext.Parser.convertEncoding(Parser.java:196) at json.ext.Parser.initialize(Parser.java:175) at json.ext.Parser$INVOKER$i$0$1$initialize.call(Parser$INVOKER$i$0$1$initialize.gen) at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:725) at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:208) at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:741) at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:278) at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:79) at org.jruby.RubyObject.callInit(RubyObject.java:348) at json.ext.Parser.newInstance(Parser.java:151) at json.ext.Parser$INVOKER$s$0$1$newInstance.call(Parser$INVOKER$s$0$1$newInstance.gen) at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:212) at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:208) at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:338) at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:183) at org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:324) at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:74) at org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:84) at org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:179) at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:165) at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:200) at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:318) at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:155)

undine-su-menulio avatar Jan 08 '22 15:01 undine-su-menulio

Hey! I can't raise an issue, so am writing here. Could you please tell, what to do with "Java Heap Error", if my pdf is only 50 KB and has only 1 page?

`

Howdy. I don't have any affiliation with this project. I just follow it on github. Can you share your pdf? I can try it out on my computer and see if I can figure out what is going on.

john-harrold avatar Jan 08 '22 15:01 john-harrold

Hey! I can't raise an issue, so am writing here. Could you please tell, what to do with "Java Heap Error", if my pdf is only 50 KB and has only 1 page?

`

Howdy. I don't have any affiliation with this project. I just follow it on github. Can you share your pdf? I can try it out on my computer and see if I can figure out what is going on.

Thank you! Here

undine-su-menulio avatar Jan 08 '22 15:01 undine-su-menulio

Hey! I can't raise an issue, so am writing here. Could you please tell, what to do with "Java Heap Error", if my pdf is only 50 KB and has only 1 page?

` Howdy. I don't have any affiliation with this project. I just follow it on github. Can you share your pdf? I can try it out on my computer and see if I can figure out what is going on.

Thank you! Here

So I was able to extract the tables. Here is the result: tabula-un1.csv

I'm running on a mac. It's weird but I cannot figure out what version of Java I have installed. Like

java --version

Says it cannot find it, but it's got to be installed somewhere because Tabula is running :). I don't want to muck around with Java and end up breaking it. Hopefully this extraction works for you.

john-harrold avatar Jan 08 '22 16:01 john-harrold

Hey! I can't raise an issue, so am writing here. Could you please tell, what to do with "Java Heap Error", if my pdf is only 50 KB and has only 1 page?

` Howdy. I don't have any affiliation with this project. I just follow it on github. Can you share your pdf? I can try it out on my computer and see if I can figure out what is going on.

Thank you! Here

So I was able to extract the tables. Here is the result: tabula-un1.csv

I'm running on a mac. It's weird but I cannot figure out what version of Java I have installed. Like

java --version

Says it cannot find it, but it's got to be installed somewhere because Tabula is running :). I don't want to muck around with Java and end up breaking it. Hopefully this extraction works for you.

Thank you so much for your help! Wish you all the best things in the world and all the blessings! ☀️ Hope one day I will figure out what gets in the way in my case

undine-su-menulio avatar Jan 08 '22 16:01 undine-su-menulio

Great job guys! thanks so much!

now... not sure how to raise an issue because some links are broken...

IMPORTING A TABLE with '(' brackets causes the software to omit the entire line. Screenshot 2022-01-22 at 13 26 18 Screenshot 2022-01-22 at 13 26 37

ginozzzz avatar Jan 22 '22 06:01 ginozzzz

Great tool and thank you for your great job! I just wanted to post an issue that i encountered: I'm using tabula-win-1.2.1 on win11 and i'm exporting tables from pdf files that contains only tables (same columns and rows all the way down). Each pdf file contains 90-100 pages. The problem is that it doesn't export more than 3300 rows. I tested it in 9 different files and the result is the same, no more than 3300 rows. Thank you again for your job!

ictalbaniaco avatar Aug 31 '22 17:08 ictalbaniaco

Great tool! We use it to parse PDF files that appear to be the same format to us humans, but drive Tabula nuts in either stream or matrix mode. Culprits include formatting characters used by the PDF creators (different people I assume). We find spaces, tabs and other un-printable characters in the CSV output. The files themselves always present as header, trailer and intermediate details with page headers and trailer in the details.

May I suggest that since we humans recognize the format, allowing us to specify vertical columns (groups) and horizontal rows (elements) might remove reliance on recognizing said groups and elements. Were I a seasoned open-source developer who had time on his hands (I'm neither), I'd look at the code and see where / how this might work.

OldGuy1949 avatar Nov 19 '23 18:11 OldGuy1949