toxy icon indicating copy to clipboard operation
toxy copied to clipboard

How to parse RTF files?

Open ZedZipDev opened this issue 4 years ago • 10 comments
trafficstars

It was removed and I cannot find how to parse RTF files. Is it possible?

ZedZipDev avatar Nov 12 '21 11:11 ZedZipDev

Yes, it's removed from .net core version since I don't find a good library on .NET core to parse it. Give me some time to find a new RTF library on .NET core.

tonyqus avatar Nov 13 '21 00:11 tonyqus

Yes, I understand. The RTF format is a "wild" ;-). Btw, I have used some code in my SQLCLR function to parse rtf ; rtf in -> pure text out. Finally it works fine. I can share it .

ZedZipDev avatar Nov 13 '21 08:11 ZedZipDev

I'm reviewing RockNHawk's code of RTFTextParser https://github.com/RockNHawk/Toxy.NetCore/blob/netcore/ToxyFramework/Parsers/RTFTextParser.cs

Do you think this ToHtml method can meet your need?

tonyqus avatar Nov 14 '21 21:11 tonyqus

Probably it is what I need but: I have created a small test app and tried to parse rtf files from your \testdata folder.

        static void TestParseRTFFromSample1()
        {
            string path = HelperClass.GetRTFPath("Blank.rtf");// ("htmlrtf1.rtf");// ("Simple text.rtf");
            var parser = new RTFTextParser(new ParserContext(path));
            string result = parser.Parse();//<-----------error is here
            Console.WriteLine("Result:{0}", result);
        }
public override string Parse()
        {
            using (var fs = new FileStream(Context.Path, FileMode.Open))
            {
                var html = Rtf.ToHtml(fs);//<---------
                return html;
            }
        }

The exception text:

System.TypeInitializationException
  HResult=0x80131534
  Message=The type initializer for 'RtfPipe.TextEncoding' threw an exception.
  Source=RtfPipe
  StackTrace:
   at RtfPipe.TextEncoding.get_RtfDefault()
   at RtfPipe.RtfStreamReader..ctor(Stream stream, Int32 bufferSize)
   at RtfPipe.RtfStreamReader..ctor(Stream stream)
   at RtfPipe.RtfSource.op_Implicit(Stream value)
   at Toxy.Parsers.RTFTextParser.Parse() in D:\MyProjects3\NET\Toxy.NetCore-netcore\ToxyFramework\Parsers\RTFTextParser.cs:line 20
   at ConsoleApp1.Program.TestParseRTFFromSample1() in D:\MyProjects3\NET\Toxy.NetCore-netcore\ConsoleApp1\Program.cs:line 21
   at ConsoleApp1.Program.Main(String[] args) in D:\MyProjects3\NET\Toxy.NetCore-netcore\ConsoleApp1\Program.cs:line 13

  This exception was originally thrown at this call stack:
    [External Code]

Inner Exception 1:
ArgumentException: 'Windows-1252' is not a supported encoding name. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method. (Parameter 'name')


ZedZipDev avatar Nov 15 '21 08:11 ZedZipDev

Ok, finally, it can be fixed in caller code or in RTFPipe NuGet. And it works. But my idea is: to have something works like iFilter: to extract pure text. For example. If I see Hello World in the Word I'd like to receive the text Hello World but I receive something like this: <div style="font-size:12pt;font-family:&quot;Times New Roman&quot;, serif;"><p style="text-align:justify;font-size:10.5pt;margin:0;"><br>Hello World</p></div>

And SQL Server FTS will index all these words but it is not correct.

ZedZipDev avatar Nov 15 '21 08:11 ZedZipDev

Your requirement makes sense. I also have concerns on RTFPipe. The extracted html result is not what most users need.

tonyqus avatar Nov 15 '21 12:11 tonyqus

By the way, I have tested and this recommendation works fine:

https://stackoverflow.com/questions/46119392/how-do-i-convert-an-rtf-string-to-a-markdown-string-and-back-c-net-core-or/54755138#54755138

It really extracts pure text from RTF file.

ZedZipDev avatar Nov 15 '21 12:11 ZedZipDev

Can you add this (BracketPipe-like, see the previous link) implementation to your framework?

ZedZipDev avatar Nov 19 '21 07:11 ZedZipDev

I still have some concern on this method. It converts RTF to HTML and then convert HTML to markdown but it's still not plain text.

Instead, I'm investigating if this post will work or not.

tonyqus avatar Nov 20 '21 06:11 tonyqus

English extraction works but something wrong with far-east character (e.g. Chinese) extraction

image

tonyqus avatar Nov 20 '21 07:11 tonyqus