Alroy
Alroy
This is how I am using trafilatura (1.6.2) ``` web_content = "".join( extract( original_html, include_formatting=True, include_tables=True, include_comments=False, include_links=False, output_format="xml" ) ) # type: ignore ``` It is skipping all the...
- https://kickstarter.mycaptain.in/privacy-policy?_gl=1%2a1q217ks%2a_ga%2aMTE2NDUzMDMyOC4xNjk2MTY0MDU2%2a_ga_98XLX61WLZ%2aMTY5NjE2NDA1NS4xLjEuMTY5NjE2NDA1NS42MC4wLjA - https://www.shopify.com/legal/privacy i am using trafilatura (1.6.2) like this ``` web_content = "".join( extract( original_html, include_formatting=True, include_tables=True, include_comments=False, include_links=False, output_format="xml", favor_recall=True, ) ) # type: ignore ```
This is how i am using trafilatura (1.6.2) ``` web_content = "".join( extract( original_html, include_formatting=True, include_tables=True, include_comments=False, include_links=False, output_format="xml", favor_recall=True, ) ) # type: ignore ``` It is skipping the...
Site: https://stackoverflow.com/legal/privacy-policy#:~:text=We%20will%20only%20process%20your,be%20shared%20with%20other%20parties.  For one of the tables, it has a list within a cell, this content gets missed out. This is what trafilatura generated ` Marketing our services...
For this website https://www.enpass.io/privacy-notice/ the entire content within main tag is ignored. Only content from div tags before main tag is extracted. ``` web_content = "".join( extract( web_content, include_formatting=True, include_tables=True,...
@adbar Do you plan on using GTP models to aid in semantic segmentation anytime soon? https://docs.unstructured.io/open-source/core-functionality/chunking
- check the results fot this site https://www.sofi.com/online-privacy-policy/#global-privacy-control expected: 2024-07-01 - https://www.tesla.com/legal/privacy expected: 2023-05-01 @adbar
extract( web_content, include_formatting=False, include_tables=True, include_comments=False, include_links=True, output_format="xml", favor_recall=True, config=config, ) ) # type: ignore with this config urls are not showing up. What is the issu. How can it be...