pubmed_parser icon indicating copy to clipboard operation
pubmed_parser copied to clipboard

Unclear documentation in Parse MEDLINE XML - delete = True or False?

Open callebalik opened this issue 7 months ago • 2 comments

Version 0.5.1 Documentation on Parse MEDLINE XML in README differs a bit from the medline_parser script.

Readme: delete : boolean if False means paper got updated so you might have two Script: An iterator of dictionary containing information about articles in NLM format. see parse_article_info). Articles that have been deleted will be added with no information other than the field delete being True

I'm somewhat confused. As one seems to indicate that delete = False -> paper updated While delete = True -> paper deleted. But these don't seem like natural opposites. Doesn't updated mean that the previous paper was deleted?

Readme for reference: MEDLINE XML has a different XML format than PubMed Open Access. The structure of XML files can be found in MEDLINE/PubMed DTD [here](https://www.nlm.nih.gov/databases/dtd/). You can use the function parse_medline_xml` to parse that format. This function will return list of dictionaries, where each element contains:

  • pmid : PubMed ID
  • pmc : PubMed Central ID
  • doi : DOI
  • other_id : Other IDs found, each separated by ;
  • title : title of the article
  • abstract : abstract of the article
  • authors : authors, each separated by ;
  • mesh_terms : list of MeSH terms with corresponding MeSH ID, each separated by ; e.g. 'D000161:Acoustic Stimulation; D000328:Adult; ...
  • publication_types : list of publication type list each separated by ; e.g. 'D016428:Journal Article'
  • keywords : list of keywords, each separated by ;
  • chemical_list : list of chemical terms, each separated by ;
  • pubdate : Publication date. Defaults to year information only.
  • journal : journal of the given paper
  • medline_ta : this is abbreviation of the journal name
  • nlm_unique_id : NLM unique identification
  • issn_linking : ISSN linkage, typically use to link with Web of Science dataset
  • country : Country extracted from journal information field
  • reference : string of PMID each separated by ; or list of references made to the article
  • delete : boolean if False means paper got updated so you might have two
  • languages : list of languages, separated by ;
  • vernacular_title: vernacular title. Defaults to empty string whenever non-available.

XMLs for the same paper. You can delete the record of deleted paper because it got updated.`

Greatful for clarification as I've hade some duplication issues

callebalik avatar May 13 '25 15:05 callebalik

Hi, thanks for raising this. You've definitely highlighted a point of confusion in the documentation. I've been looking into this while working on a related fix, and I'd like to share my understanding of the situation.

First, to address your specific question:

Doesn't updated mean that the previous paper was deleted?

In the context of PubMed data, "updated" and "deleted" are separate events. An "updated" article replaces a previous record with the same PMID, while a "deleted" article is explicitly marked for removal via a <DeleteCitation> tag.

Looking at the project's issue history, it seems there's a connection between a few different issues. As I see it:

  • The original intent to handle deleted articles with a delete flag appears to have been discussed in issue #17.
  • More recently, issue #166 reported that this functionality was missing, likely due to a regression.
  • My recently merged PR #167 was an attempt to fix it by re-implementing this logic in the current codebase.

So, with these latest changes, the parser's behavior should now be clearer:

  • When a <DeleteCitation> is found, it yields a dictionary like {'pmid': '...', 'delete': True}.
  • For a standard <PubmedArticle>, the 'delete' flag in the output will be False.

Based on this, the README quote you mentioned does seem to need clarification.

XMLs for the same paper. You can delete the record of deleted paper because it got updated.`

The statement that 'delete': False means a paper was 'updated' can be misleading. One could argue that deletion is a special case of updating, but the library's implementation now makes a clear distinction that the documentation should probably reflect. I'd be happy to open a new PR with a suggested change for the documentation if the maintainers think it's a good idea.

Thanks again for bringing this up.

iacopy avatar Jun 13 '25 09:06 iacopy

I'd be happy to open a new PR with a suggested change for the documentation if the maintainers think it's a good idea.

That'd be wonderful!

Michael-E-Rose avatar Jun 14 '25 05:06 Michael-E-Rose