TextParse.jl icon indicating copy to clipboard operation
TextParse.jl copied to clipboard

Handle line breaks encapsulated in XML tags

Open andreasnoack opened this issue 7 years ago • 1 comments

Indeed, this is a pretty exotic feature request but I happen to have some CSVs where the last column contains mixed text including XML and the text within the XML tags can potentially have newline characters which shouldn't be interpreted as newlines when parsing the file. Two such examples

<PAGE_AUTHORS>&#xD;\n&#xD;\n&#xD;\n&#xD;\n&#xD;\nHACKETT;Ark. &#xC3;&#xA2;&#xC2;&#x80;&#xC2;&#x94; A sheriff;admin;About the Author</PAGE_AUTHORS>

and

<PAGE_AUTHORS>K G Rana;\nMax Planck Institute of Microstructure Physics;Weinberg 2;D-06120 Halle;Germany;\nMax Planck Institute for Chemical Physics of Solids;N&#xC3;&#xB6;thnitzer Str. 40;D-01187 Dresden;O Meshcheriakova;J K&#xC3;&#xBC;bler;\nInstitut f&#xC3;&#xBC;r Festk&#xC3;&#xB6;rperphysik;Technische Universit&#xC3;&#xA4;t Darmstadt;D-64289 Darmstadt;B Ernst;J Karel;R Hillebrand;E Pippel;P Werner;A K Nayak;C Felser;S S P Parkin</PAGE_AUTHORS>

The first example is taken from the file 20160810171500.gkg.csv from the GDELT2 dataset

andreasnoack avatar Feb 23 '18 14:02 andreasnoack

Are these columns surrounded by quotation marks? If not, we would have to add support for XML to handle this? That seems not like a good idea :) Or am I misunderstanding something here?

davidanthoff avatar Mar 17 '19 22:03 davidanthoff