toxy
toxy copied to clipboard
How to parse RTF files?
It was removed and I cannot find how to parse RTF files. Is it possible?
Yes, it's removed from .net core version since I don't find a good library on .NET core to parse it. Give me some time to find a new RTF library on .NET core.
Yes, I understand. The RTF format is a "wild" ;-). Btw, I have used some code in my SQLCLR function to parse rtf ; rtf in -> pure text out. Finally it works fine. I can share it .
I'm reviewing RockNHawk's code of RTFTextParser https://github.com/RockNHawk/Toxy.NetCore/blob/netcore/ToxyFramework/Parsers/RTFTextParser.cs
Do you think this ToHtml method can meet your need?
Probably it is what I need but: I have created a small test app and tried to parse rtf files from your \testdata folder.
static void TestParseRTFFromSample1()
{
string path = HelperClass.GetRTFPath("Blank.rtf");// ("htmlrtf1.rtf");// ("Simple text.rtf");
var parser = new RTFTextParser(new ParserContext(path));
string result = parser.Parse();//<-----------error is here
Console.WriteLine("Result:{0}", result);
}
public override string Parse()
{
using (var fs = new FileStream(Context.Path, FileMode.Open))
{
var html = Rtf.ToHtml(fs);//<---------
return html;
}
}
The exception text:
System.TypeInitializationException
HResult=0x80131534
Message=The type initializer for 'RtfPipe.TextEncoding' threw an exception.
Source=RtfPipe
StackTrace:
at RtfPipe.TextEncoding.get_RtfDefault()
at RtfPipe.RtfStreamReader..ctor(Stream stream, Int32 bufferSize)
at RtfPipe.RtfStreamReader..ctor(Stream stream)
at RtfPipe.RtfSource.op_Implicit(Stream value)
at Toxy.Parsers.RTFTextParser.Parse() in D:\MyProjects3\NET\Toxy.NetCore-netcore\ToxyFramework\Parsers\RTFTextParser.cs:line 20
at ConsoleApp1.Program.TestParseRTFFromSample1() in D:\MyProjects3\NET\Toxy.NetCore-netcore\ConsoleApp1\Program.cs:line 21
at ConsoleApp1.Program.Main(String[] args) in D:\MyProjects3\NET\Toxy.NetCore-netcore\ConsoleApp1\Program.cs:line 13
This exception was originally thrown at this call stack:
[External Code]
Inner Exception 1:
ArgumentException: 'Windows-1252' is not a supported encoding name. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method. (Parameter 'name')
Ok, finally, it can be fixed in caller code or in RTFPipe NuGet.
And it works.
But my idea is: to have something works like iFilter: to extract pure text. For example.
If I see
Hello World
in the Word I'd like to receive the text
Hello World
but I receive something like this:
<div style="font-size:12pt;font-family:"Times New Roman", serif;"><p style="text-align:justify;font-size:10.5pt;margin:0;"><br>Hello World</p></div>
And SQL Server FTS will index all these words but it is not correct.
Your requirement makes sense. I also have concerns on RTFPipe. The extracted html result is not what most users need.
By the way, I have tested and this recommendation works fine:
https://stackoverflow.com/questions/46119392/how-do-i-convert-an-rtf-string-to-a-markdown-string-and-back-c-net-core-or/54755138#54755138
It really extracts pure text from RTF file.
Can you add this (BracketPipe-like, see the previous link) implementation to your framework?
I still have some concern on this method. It converts RTF to HTML and then convert HTML to markdown but it's still not plain text.
Instead, I'm investigating if this post will work or not.
English extraction works but something wrong with far-east character (e.g. Chinese) extraction
