tika icon indicating copy to clipboard operation
tika copied to clipboard

TIKA-1735 - Adding DWGRead parser to Tika if available

Open monkmachine opened this issue 2 years ago • 23 comments

Thanks for your contribution to Apache Tika! Your help is appreciated!

Before opening the pull request, please verify that

  • there is an open issue on the Tika issue tracker which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes.
  • the issue ID TIKA-1735
    • is referenced in the title of the pull request
    • and placed in front of your commit messages surrounded by square brackets ([TIKA-XXXX] Issue or pull request title)
  • commits are squashed into a single one (or few commits for larger changes)
  • Tika is successfully built and unit tests pass by running mvn clean test
  • there should be no conflicts when merging the pull request branch into the recent main branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled main branch
  • if you add new module that downstream users will depend upon add it to relevant group in tika-bom/pom.xml.

We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the Tika mailing list. Thanks!

monkmachine avatar May 12 '22 06:05 monkmachine

@tballison @monkmachine

Or do you want to use our current parser only if the dwg executable is not available.

I would vote +1 on use current parser only if the dwg executable is not available.

nddipiazza avatar May 13 '22 14:05 nddipiazza

should we use TestContainers to test this within a docker container to make sure it works? or is it sufficient to just run test only if dwgread is installed?

nddipiazza avatar May 13 '22 14:05 nddipiazza

@tballison @monkmachine

Or do you want to use our current parser only if the dwg executable is not available.

I would vote +1 on use current parser only if the dwg executable is not available.

Yeah that's what I think is the best solution :) Thanks Nicolas for the initial code, hopefully I haven't butchered yours too much. I did manage to find we hold text in "value_text" and "text" fields. MTEXT fields also contain formatting which I will remove (I've done a quick stab at some of the main ones contain within the documents I have but will do them all before I finish the PR)

monkmachine avatar May 13 '22 14:05 monkmachine

should we use TestContainers to test this within a docker container to make sure it works? or is it sufficient to just run test only if dwgread is installed?

Would this be on the Apache hosted build server? If so would you construct the test as I wouldn't know the path to dwgread?

monkmachine avatar May 13 '22 15:05 monkmachine

hopefully I haven't butchered yours too much.

are you kidding me? you're awesome! all looks great to me.

Would this be on the Apache hosted build server?

yes - that's right.

on the tika pipes module we had simply used the TestContainers java test utility and it is pretty easy. You can see that code for an example

nddipiazza avatar May 13 '22 16:05 nddipiazza

are you kidding me? you're awesome! all looks great to me.

+1 Thank you!

tballison avatar May 13 '22 17:05 tballison

should we use TestContainers to test this within a docker container to make sure it works? or is it sufficient to just run test only if dwgread is installed?

Maybe? I worry about the build times and network resources to pull the images. But testing is critical. Can we may do @Nightly/@Weekly/@ShortlyBeforeTheNextRelease or something?

If we go down this route, we should consider doing the same for tesseract.

tballison avatar May 13 '22 17:05 tballison

@nddipiazza @tballison This looks messy, can you advise a way to clean it up? A better way of doing it? Still think its worth having the comments there? https://github.com/apache/tika/blob/28ceff420948ede5e18ba10a4f1d6d1751f30b3b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-cad-module/src/main/java/org/apache/tika/parser/dwg/DWGReadParser.java#L272-L318

monkmachine avatar May 13 '22 21:05 monkmachine

@nddipiazza @tballison This looks messy, can you advise a way to clean it up? A better way of doing it? Still think its worth having the comments there?

OMG, what a mess. The output, not you.

What I've done before is a regex pattern+matcher that captures the escape sequence first OR then the controls ~/(\\)|(\[A-Z0-9]{1,5})/, capture group(2) (and skip it), append group 1 to tail.

That's a rough answer and probably wrong, but see what you can do.

The braces...hmmmm... Maybe take a second pass and do the same thing? You can't just add this in the OR ~/{[^}]{0,50}}/ because that'll not correctly process escaped } within the brackets.

tballison avatar May 13 '22 21:05 tballison

@nddipiazza @tballison This looks messy, can you advise a way to clean it up? A better way of doing it? Still think its worth having the comments there?

OMG, what a mess. The output, not you.

What I've done before is a regex pattern+matcher that captures the escape sequence first OR then the controls ~/(\)|([A-Z0-9]{1,5})/, capture group(2) (and skip it), append group 1 to tail.

That's a rough answer and probably wrong, but see what you can do.

The braces...hmmmm... Maybe take a second pass and do the same thing? You can't just add this in the OR ~/{[^}]{0,50}}/ because that'll not correctly process escaped } within the brackets.

I threw together a somewhat working example. I think there are still some things I'm missing: https://github.com/tballison/tika-addons/blob/main/DWGReadDev/src/test/java/TestRegexCleaners.java

Obv, we'll want to make the patterns static, etc.

tballison avatar May 16 '22 16:05 tballison

@nddipiazza @tballison This looks messy, can you advise a way to clean it up? A better way of doing it? Still think its worth having the comments there?

OMG, what a mess. The output, not you. What I've done before is a regex pattern+matcher that captures the escape sequence first OR then the controls ~/()|([A-Z0-9]{1,5})/, capture group(2) (and skip it), append group 1 to tail. That's a rough answer and probably wrong, but see what you can do. The braces...hmmmm... Maybe take a second pass and do the same thing? You can't just add this in the OR ~/{[^}]{0,50}}/ because that'll not correctly process escaped } within the brackets.

I threw together a somewhat working example. I think there are still some things I'm missing: https://github.com/tballison/tika-addons/blob/main/DWGReadDev/src/test/java/TestRegexCleaners.java

Obv, we'll want to make the patterns static, etc.

Will take a look @tballison , thanks for your help. I've been cleaning up the code to match the checkstyle (which I've only learnt about today) and testing my janky regexes (in the current form) against some documents I have. Like I said I managed to build Tika Server and check the config was working correctly so been a successful few hours today :) Will take a look at your example tomorrow and hopefully at some point this week find some time to check the stop method on the other pull request. We can then look to create a guide/script on how to install Tika Server as a windows service using Daemon.

monkmachine avatar May 16 '22 18:05 monkmachine

Help! @tballison @nddipiazza Any reason why this section would sometimes write extra lines out? On some json files when cleaning up it writes out the file correctly then appends another section which makes the json invalid? I'm getting extra lines in some (very few) files which I can't share unfortunately. I can't for the life of me work out why.... I've even tried removing the replaceAll

https://github.com/apache/tika/blob/a3e2be9b69de6a0550f1211aa9ff8df96c60f767/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-cad-module/src/main/java/org/apache/tika/parser/dwg/DWGReadParser.java#L112-L122

monkmachine avatar May 18 '22 16:05 monkmachine

If I read byte by byte (i.e. byte[] bytes = new byte[1];) I get the correct result: image

If I read with anything other than byte by byte I get added bytes/strings from some other part of the file: image

It's only doing it on this one json file, I can reproduce it every time on this one.

Confused.com......

    public void jsonConvert() throws FileNotFoundException, IOException {



    	 try (FileInputStream fis = new FileInputStream("c:\\temp1\\dwgreadout.json");
                 FileOutputStream fos = new FileOutputStream("c:\\temp1\\dwgreadoutClean.json")) {
             byte[] bytes = new byte[1000];
             while (fis.read(bytes) != -1) {
                 byte[] fixedBytes = new String(bytes, StandardCharsets.UTF_8)
                 		
                         //.replaceAll(dwgc.getCleanDwgReadRegexToReplace(), dwgc.getCleanDwgReadReplaceWith())
                         //.replaceAll(" nan ", " 0 ")
                         //.replaceAll(" nan,", " 0,")
                         .getBytes(StandardCharsets.UTF_8);
                 String st = new String(fixedBytes, StandardCharsets.UTF_8);
                 fos.write(fixedBytes, 0, fixedBytes.length);
                 
                 
             }
         } 
    }

monkmachine avatar May 18 '22 18:05 monkmachine

If I use buffer reader I get the correct output but it's slower: 3s vs 10s (it's quite a large file)




    	    //FileInputStream fis = new FileInputStream("c:\\temp1\\dwgreadout.json");
            FileOutputStream fos = new FileOutputStream("c:\\temp1\\dwgreadoutClean.json");
            try (BufferedReader br = new BufferedReader(new FileReader("c:\\temp1\\dwgreadout.json"))) 
            {

                String sCurrentLine;
                while ((sCurrentLine = br.readLine()) != null) 
                {
                	sCurrentLine = sCurrentLine
                	    .replaceAll(" nan ", " 0 ")
                            .replaceAll(" nan,", " 0,") +"\n";
                	fos.write(sCurrentLine.getBytes(), 0, sCurrentLine.getBytes().length);
                }
            
                 //fos.write(fixedBytes, 0, fixedBytes.length);
                 
                 
             }}

monkmachine avatar May 18 '22 18:05 monkmachine

If I use buffer reader I get the correct output but it's slower: 3s vs 10s (it's quite a large file)




    	    //FileInputStream fis = new FileInputStream("c:\\temp1\\dwgreadout.json");
            FileOutputStream fos = new FileOutputStream("c:\\temp1\\dwgreadoutClean.json");
            try (BufferedReader br = new BufferedReader(new FileReader("c:\\temp1\\dwgreadout.json"))) 
            {

                String sCurrentLine;
                while ((sCurrentLine = br.readLine()) != null) 
                {
                	sCurrentLine = sCurrentLine
                	    .replaceAll(" nan ", " 0 ")
                            .replaceAll(" nan,", " 0,") +"\n";
                	fos.write(sCurrentLine.getBytes(), 0, sCurrentLine.getBytes().length);
                }
            
                 //fos.write(fixedBytes, 0, fixedBytes.length);
                 
                 
             }}

I agree that it's slower, but it is more likely to be correct. I don't understand the problem as thoroughly as you do, but running regexes against a part of a file at a time is guaranteed to fail depending on where the breaks in the parts are.

tballison avatar May 18 '22 19:05 tballison

Can you guarantee that reading per line will be ok on this json-disaster? If so, that's the way to go.

The other thing is that you'll want to specify the encoding on your reader. It is much better to work with the String than the bytes.

tballison avatar May 18 '22 19:05 tballison

Can you tell if they're writing utf8? Are there any ascii accented data items or non-ascii characters that you can use to figure out what they're default encoding is?

tballison avatar May 18 '22 19:05 tballison

If NaN is the only problem, is there any way to tell jackson to be lax? Maybe something like: https://stackoverflow.com/questions/63516411/convert-nan-value-into-null-when-parsing-json-using-jackson-library

tballison avatar May 18 '22 19:05 tballison

No, that probably won't work. Sorry. If you send me some examples, I can try some things.

tballison avatar May 18 '22 19:05 tballison

No, that probably won't work. Sorry. If you send me some examples, I can try some things.

Yeah we'd be ok if Jackson allowed "nan" as well as "NaN" as we could use JsonReadFeature.ALLOW_NON_NUMERIC_NUMBERS.

monkmachine avatar May 18 '22 20:05 monkmachine

Can you tell if they're writing utf8? Are there any ascii accented data items or non-ascii characters that you can use to figure out what they're default encoding is?

If you can have a scan of the library to work it out? https://github.com/LibreDWG/libredwg/tree/master/src

monkmachine avatar May 18 '22 20:05 monkmachine

Can you guarantee that reading per line will be ok on this json-disaster? If so, that's the way to go.

The other thing is that you'll want to specify the encoding on your reader. It is much better to work with the String than the bytes.

I've got to test it tomorrow with my files but looks to be promising.

monkmachine avatar May 18 '22 20:05 monkmachine

No, that probably won't work. Sorry. If you send me some examples, I can try some things.

Can't send you examples unfortunately :( I did manage to process over 1000's files today with it and compared to a .net libary (CAD .Net) it does look better. I also managed to find a few more formatting's that are output that weren't on the two websites.

monkmachine avatar May 18 '22 20:05 monkmachine