Unclear documentation in Parse MEDLINE XML - delete = True or False?
Version 0.5.1 Documentation on Parse MEDLINE XML in README differs a bit from the medline_parser script.
Readme: delete : boolean if False means paper got updated so you might have two
Script: An iterator of dictionary containing information about articles in NLM format.
see parse_article_info). Articles that have been deleted will be
added with no information other than the field delete being True
I'm somewhat confused. As one seems to indicate that delete = False -> paper updated While delete = True -> paper deleted. But these don't seem like natural opposites. Doesn't updated mean that the previous paper was deleted?
Readme for reference:
MEDLINE XML has a different XML format than PubMed Open Access. The structure of XML files can be found in MEDLINE/PubMed DTD [here](https://www.nlm.nih.gov/databases/dtd/). You can use the function parse_medline_xml` to parse that format. This function will return list of dictionaries, where each element contains:
pmid: PubMed IDpmc: PubMed Central IDdoi: DOIother_id: Other IDs found, each separated by;title: title of the articleabstract: abstract of the articleauthors: authors, each separated by;mesh_terms: list of MeSH terms with corresponding MeSH ID, each separated by;e.g.'D000161:Acoustic Stimulation; D000328:Adult; ...publication_types: list of publication type list each separated by;e.g.'D016428:Journal Article'keywords: list of keywords, each separated by;chemical_list: list of chemical terms, each separated by;pubdate: Publication date. Defaults to year information only.journal: journal of the given papermedline_ta: this is abbreviation of the journal namenlm_unique_id: NLM unique identificationissn_linking: ISSN linkage, typically use to link with Web of Science datasetcountry: Country extracted from journal information fieldreference: string of PMID each separated by;or list of references made to the articledelete: boolean ifFalsemeans paper got updated so you might have twolanguages: list of languages, separated by;vernacular_title: vernacular title. Defaults to empty string whenever non-available.
XMLs for the same paper. You can delete the record of deleted paper because it got updated.`
Greatful for clarification as I've hade some duplication issues
Hi, thanks for raising this. You've definitely highlighted a point of confusion in the documentation. I've been looking into this while working on a related fix, and I'd like to share my understanding of the situation.
First, to address your specific question:
Doesn't updated mean that the previous paper was deleted?
In the context of PubMed data, "updated" and "deleted" are separate events. An "updated" article replaces a previous record with the same PMID, while a "deleted" article is explicitly marked for removal via a <DeleteCitation> tag.
Looking at the project's issue history, it seems there's a connection between a few different issues. As I see it:
- The original intent to handle deleted articles with a
deleteflag appears to have been discussed in issue #17. - More recently, issue #166 reported that this functionality was missing, likely due to a regression.
- My recently merged PR #167 was an attempt to fix it by re-implementing this logic in the current codebase.
So, with these latest changes, the parser's behavior should now be clearer:
- When a
<DeleteCitation>is found, it yields a dictionary like{'pmid': '...', 'delete': True}. - For a standard
<PubmedArticle>, the'delete'flag in the output will beFalse.
Based on this, the README quote you mentioned does seem to need clarification.
XMLs for the same paper. You can delete the record of deleted paper because it got updated.`
The statement that 'delete': False means a paper was 'updated' can be misleading. One could argue that deletion is a special case of updating, but the library's implementation now makes a clear distinction that the documentation should probably reflect. I'd be happy to open a new PR with a suggested change for the documentation if the maintainers think it's a good idea.
Thanks again for bringing this up.
I'd be happy to open a new PR with a suggested change for the documentation if the maintainers think it's a good idea.
That'd be wonderful!