Auto-CORPus
Auto-CORPus copied to clipboard
attrs for paragraphs - no longer uses ID
The config_pmc.json file will fail for new HTML files from PMC due to the fact that the id tag is gone from the paragraph tag.
I suggest to change it to (where the first part includes Valentina's fix for some articles, and the second part is new to work on new PMC files):
"paragraphs": {
"data": {},
"defined-by": [
{
"tag": "p",
"attrs": {"id": "_*[pP\\-|pP|Par]*\\d+"}
},
{
"tag": "p",
"attrs": {"class": "p"}
}
]
},