goscholar icon indicating copy to clipboard operation
goscholar copied to clipboard

Google scholar scraper written in Go

GoDoc Build Status Coverage Status Go Report Card license GitHub version

goscholar

Google Scholar scraper written in Go

Install

$ go get github.com/sotetsuk/goscholar

for command line:

$ go get github.com/sotetsuk/goscholar/cmd/goscholar
$ goscholar -h

Build

Also, you can use build command to build command line tool from the source code.

$ git clone [email protected]:sotetsuk/goscholar.git
$ goscholar/build

Options:

--dev: apply go fmt to all files and save dependencies

After build command executed, you will find corss-compiled binary files in bin directory.

Feature

  • API for Go
  • API for command line
  • search by keywords, title, and author
  • find by <cluster-id>
  • search the articles citing <cluster-id>
  • JSON output
  • recursive crawling is not implemented

Go API

Example

// create Query and generate URL
q := Query{Keywords:"nature 2015", Author:"y bengio", Title:"Deep learning"}
url := q.SearchUrl()

// fetch document sending the request to the URL
doc, err := Fetch(url)
if err != nil {
	log.Error(err)
	return
}

// parse articles
ch := make(chan *Article, 10)
go ParseDocument(ch, doc)
for a := range ch {
	fmt.Println("---")
	fmt.Println(a)
}

Command line API

Example

$ goscholar search --keywords "deep learning nature" --author "y bengio" --after 2015 --num 1 | jq .
[
  {
    "title": {
      "name": "Deep learning",
      "url": "http://www.nature.com/nature/journal/v521/n7553/abs/nature14539.html"
    },
    "year": "2015",
    "cluster_id": "5362332738201102290",
    "num_cite": "499",
    "num_ver": "7",
    "info_id": "0qfs6zbVakoJ",
    "link": {
      "name": "psu.edu",
      "url": "http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.436.894&rep=rep1&type=pdf",
      "format": "PDF"
    },
    "bibtex": "@article{lecun2015deep, title={Deep learning}, author={LeCun, Yann and Bengio, Yoshua and Hinton, Geoffrey}, journal={Nature}, volume={521}, number={7553}, pages={436--444}, year={2015}, publisher={Nature Publishing Group}}",
    "author": [
      "LeCun, Yann",
      "Bengio, Yoshua",
      "Hinton, Geoffrey"
    ],
    "journal": "Nature",
    "booktitle": "",
    "volume": "521",
    "number": "7553",
    "pages": "436--444",
    "publisher": "Nature Publishing Group"
  }
]
$ goscholar find 15502119379559163003 | jq .
[
  {
    "title": {
      "name": "Deep learning via Hessian-free optimization",
      "url": "http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_Martens10.pdf"
    },
    "year": "2010",
    "cluster_id": "15502119379559163003",
    "num_cite": "269",
    "num_ver": "",
    "info_id": "e6RSJHGXItcJ",
    "link": {
      "name": "wustl.edu",
      "url": "http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_Martens10.pdf",
      "format": "PDF"
    },
    "bibtex": "@inproceedings{martens2010deep, title={Deep learning via Hessian-free optimization}, author={Martens, James}, booktitle={Proceedings of the 27th International Conference on Machine Learning (ICML-10)}, pages={735--742}, year={2010}}",
    "author": [
      "Martens, James"
    ],
    "journal": "",
    "booktitle": "Proceedings of the 27th International Conference on Machine Learning (ICML-10)",
    "volume": "",
    "number": "",
    "pages": "735--742",
    "publisher": ""
  }
]
$ goscholar cite 15502119379559163003 --num 1 | python -mjson.tool
[
  {
    "title": {
      "name": "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups",
      "url": "http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6296526"
    },
    "year": "2012",
    "cluster_id": "3674494786452480182",
    "num_cite": "1559",
    "num_ver": "27",
    "info_id": "tmCGO4pt_jIJ",
    "link": {
      "name": "toronto.edu",
      "url": "http://www.cs.toronto.edu/~asamir/papers/SPM_DNN_12.pdf",
      "format": "PDF"
    },
    "bibtex": "@article{hinton2012deep, title={Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups}, author={Hinton, Geoffrey and Deng, Li and Yu, Dong and Dahl, George E and Mohamed, Abdel-rahman and Jaitly, Navdeep and Senior, Andrew and Vanhoucke, Vincent and Nguyen, Patrick and Sainath, Tara N and others}, journal={Signal Processing Magazine, IEEE}, volume={29}, number={6}, pages={82--97}, year={2012}, publisher={IEEE}}",
    "author": [
      "Hinton, Geoffrey",
      "Deng, Li",
      "Yu, Dong",
      "Dahl, George E",
      "Mohamed, Abdel-rahman",
      "Jaitly, Navdeep",
      "Senior, Andrew",
      "Vanhoucke, Vincent",
      "Nguyen, Patrick",
      "Sainath, Tara N",
      "others"
    ],
    "journal": "Signal Processing Magazine, IEEE",
    "booktitle": "",
    "volume": "29",
    "number": "6",
    "pages": "82--97",
    "publisher": "IEEE"
  }
]

(This article cites 15502119379559163003=Deep learning via Hessian-free optimization)

Usage

goscholar: Google Scholar crawler and scraper written in Go

Usage:
  goscholar search [--keywords=<keywords>] [--author=<author>] [--title=<title>]
                   [--after=<year>] [--before=<year>] [--num=<num>] [--start=<start>]
                   [--user-agent=<user-agent>]
  goscholar find <cluster-id> [--user-agent=<user-agent>]
  goscholar cite <cluster-id> [--after=<year>] [--before=<year>] [--num=<num>] [--start=<start>]
                              [--user-agent=<user-agent>]
  goscholar -h | --help
  goscholar --version

Query-options:
  <cluster-id>
  --keywords=<keywords>
  --author=<author>
  --title=<title>

Search-options:
  --after=<year>
  --before=<year>
  --num=<num>
  --start=<start>

Others:
  -h --help
  --version

Dependencies

Related Work

goscholar is inspired by scholar.py

Contribute

Contritubing is more than welcome! See Issues for what is required.

License

MIT License