unstructured
unstructured copied to clipboard
convert tail to text/emphasised tag and process all emphasised descendants
cfr discussion at #2362
@stdweird , thanks for the contribution! Do you have an html doc handy that this PR fixes, which could get added to unittests?
@cragwolfe not sure if you want real data or not, but eg
<html>
<body>
<div>a
<ul>b
<li>c1</li>d1
<li>c2</li>d2
</ul>e
</div>f<br>g
</body>
</html>
the main intend should be to keep the result as close as possible to the orignial text (eg efg after the list items) , but right now retrieving all text is higher priority (at least for me). i don't think even this PR does that (g is still lost i think), but it is already an improvement.
@stdweird - Are you still working on this?