bib2df icon indicating copy to clipboard operation
bib2df copied to clipboard

Problems parsing .bib from Web of Science

Open gorkang opened this issue 4 years ago • 12 comments

I downloaded a bib file from Web of Science savedrecs.zip and there are multiple issues when reading it. The solution shown in #21 doesn't work here :(

Most of them seen to be related with what you @ottlngr mentioned in in #21 (key-value pairs not separated by linebreaks):

  • AUTHORS: The authors not in the first line are lost
  • ABSTRACT: Only the first line of the abstract is imported

But other issues seem to arise from a different thing:

  • A bunch of extra columns appear (for a simplified case, see [A] below)

[A] single_reference.zip When reading this bib reference, the following lines of the abstract are creating new columns (the first-word of the line is the column title, and the text in the cell is whatever comes after the "="):

  • benefits and harms; n = 451) or non-evidence-based (e.g., relative risks
  • on benefits only; n = 446) patient information about a cancer screening
  • non-evidence-based patient information (n = 446), a mean of 33.1% of
  • whereas with evidence-based patient information (n = 451), only half as

So, the first of those creates a BENEFITS column with a text "451) or non-evidence-based (e.g., relative risks"

Please, let me know if I can be of any help testing/debugging this.

gorkang avatar Jul 19 '19 12:07 gorkang

Hi, thanks for your message.

This seems to happen because of the multi-line values in this particular .bib file. I'll have to play with it a bit to see what can be improved in bib2df to avoid this behaviour.

ottlngr avatar Jul 26 '19 21:07 ottlngr

Any news on this issue? I have the same problem. I have downloaded a bib file from Web of Science and anything after a line break (e.g. all of the abstracts) is excluded from the dataframe. I really like your package otherwise, and hope that you are able to resolve this critical problem!

jjsantana avatar Apr 03 '20 20:04 jjsantana

@ottlngr we ran into the same issue (our code builds on bib2df). Maybe the function here could constitute the basis for a solution (not sure how robust it is): https://github.com/paulcbauer/flex_bib/blob/master/merge_bib_lines.R

@jjsantana maybe this helps: https://github.com/paulcbauer/flex_bib#caveats

paulcbauer avatar Jun 30 '20 09:06 paulcbauer

@paulcbauer I added a test caste that covers this issue. Of cource it fails at the moment, but feel free to try integrating your function and see if the test succeeds.

ottlngr avatar Jul 02 '20 19:07 ottlngr

I added some code (optional argument merge_lines + function to merge lines). I am not sure whether (and how) it interacts with the separate_names argument. Also, there may be a nicer way to integrate it into your functions.

On Thu, Jul 2, 2020 at 9:59 PM Philipp Ottolinger [email protected] wrote:

@paulcbauer https://github.com/paulcbauer I added a test caste that covers this issue. Of cource it fails at the moment, but feel free to try integrating your function and see if the test succeeds.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/bib2df/issues/31#issuecomment-653195733, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB75DJ2FJD6OA7APBSUEOH3RZTRJBANCNFSM4IFFK3CA .

--

Dr. Paul C. Bauer

Mannheim Centre for European Social Research

University of Mannheim

Email: [email protected]

Current research: "Believing and Sharing Information by Fake Sources https://osf.io/mrxvc" Websites: Homepage http://www.paulcbauer.eu/, GoogleScholar https://scholar.google.ch/citations?user=zRqPQ_kAAAAJ&hl=en&oi=ao, ResearchGate https://www.researchgate.net/profile/Paul_Bauer4, www.tweetingpoliticians.com, SSRN http://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=1911340, Twitter https://twitter.com/p_c_bauer, Github https://github.com/paulcbauer

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination, distribution, forwarding, or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited without the express permission of the sender. If you received this communication in error, please contact the sender and delete the material from any computer.

paulcbauer avatar Jul 07 '20 09:07 paulcbauer

Cool, thanks for the effort. I will have a closer look at it.

ottlngr avatar Jul 10 '20 09:07 ottlngr

Cool thanks. There was some sort of error message but I didn't know how relevant it is.

On Fri, Jul 10, 2020 at 11:59 AM Philipp Ottolinger < [email protected]> wrote:

Cool, thanks for the effort. I will have a closer look at it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/bib2df/issues/31#issuecomment-656593845, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB75DJ7SN4PXUPZZVPLQIBTR23RARANCNFSM4IFFK3CA .

--

Dr. Paul C. Bauer

Mannheim Centre for European Social Research

University of Mannheim

Email: [email protected]

Current research: "Believing and Sharing Information by Fake Sources https://osf.io/mrxvc" Websites: Homepage http://www.paulcbauer.eu/, GoogleScholar https://scholar.google.ch/citations?user=zRqPQ_kAAAAJ&hl=en&oi=ao, ResearchGate https://www.researchgate.net/profile/Paul_Bauer4, www.tweetingpoliticians.com, SSRN http://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=1911340, Twitter https://twitter.com/p_c_bauer, Github https://github.com/paulcbauer

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination, distribution, forwarding, or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited without the express permission of the sender. If you received this communication in error, please contact the sender and delete the material from any computer.

paulcbauer avatar Jul 10 '20 10:07 paulcbauer

@paulcbauer 's suggestion to the merge_bib_lines function in https://github.com/paulcbauer/flex_bib#caveats works out for me as a temporary solution (Thank you!). It can also process bib files that contain multiple bibs.

xiaofanliang avatar Dec 28 '20 03:12 xiaofanliang

Hi there

Wondered if there was an update on this issue. I'm unable to import full abstracts from WoS .bib files and cannot get the above solutions to work. Thanks.

robertberryuk avatar May 14 '21 11:05 robertberryuk

Apologies - I did get @paulcbauer's merge_bib_lines function to work and it solved the issue with import of incomplete abstracts - many thanks.

robertberryuk avatar May 14 '21 14:05 robertberryuk

Problem I have now is that the merge_bib_lines function does not parse text properly when the character "=" is encountered - any ideas? Thanks

robertberryuk avatar May 20 '21 09:05 robertberryuk

There should be some regex workaround. I just don't have any time right now to look into this (hopefully in the next weeks). Sorry!

On Thu, May 20, 2021 at 11:30 AM Robert Berryr @.***> wrote:

Problem I have now is that the merge_bib_lines function does not parse text properly when the character "=" is encountered - any ideas? Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/bib2df/issues/31#issuecomment-844911821, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB75DJ7ADZI2ZVWJB4TA7WTTOTJE3ANCNFSM4IFFK3CA .

--

Dr. Paul C. Bauer

Mannheim Centre for European Social Research

University of Mannheim

Email: @.***

Current research: "Believing and Sharing Information by Fake Sources https://doi.org/10.1080/10584609.2020.1840462" (Political Communication) Websites: Homepage http://www.paulcbauer.eu/, GoogleScholar https://scholar.google.ch/citations?user=zRqPQ_kAAAAJ&hl=en&oi=ao, ResearchGate https://www.researchgate.net/profile/Paul_Bauer4, www.tweetingpoliticians.com, SSRN http://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=1911340, Twitter https://twitter.com/p_c_bauer, Github https://github.com/paulcbauer

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination, distribution, forwarding, or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited without the express permission of the sender. If you received this communication in error, please contact the sender and delete the material from any computer.

paulcbauer avatar Jun 03 '21 10:06 paulcbauer