pubstats
pubstats copied to clipboard
Publication statistics
This repository establishes simple statistics for a set of conferences.
Using the DBLP data set, we extract the top conferences and then aggregate them on per-author basis. Based on different sub groups (e.g., security, embedded systems, or OS) we then calculate per author statistics in a nice overview.
Processing happens in two stages:
parse_dblp.pyextracts all publications and dumps them in a pickle files based on the per-area aggregation (this is slow as DBLP is a 3GB XML file). To be able to process such a large XML file, we use a stream processor that simply dumps interesting publications intoPubobjects (seepubs.py).top_authors.pyleverages the pickle files to process per-area statistics and aggregate statistics.author_cliquesleverages the pickle files to calculate per-area author- cliques.
Using/Howto
- Easy mode: check out the homepage
make allto download DBLP data, pickle, and create the html datamake freshto update DBLP data and pickle itmake topauthorsto create the top author pagesmake cliquesto create the cliques
Contributing
Ideas, comments, or improvements are welcome! Please reach out to Mathias Payer to discuss. You can also reach out to @gannimo on Twitter.
Changelog
- 2023-08-21 random bugfixes and conference updates
- 2023-02-06 adjusted SE/DB conferences based on feedback
- 2021-02-09 fixed VLDB conference and added ICDE and PODS for the database community; added ASE and ISSTA for the software engineering community
- 2021-01-11 added HPCA for architecture and adjusted paper length calculation for DAC
- 2021-01-09 remove tutorials and short papers (by parsing pages data)
- 2021-01-05 figures for overview page
- 2021-01-04 new overview table across areas
- 2021-01-02 added author cliques
- 2020-12-30 first version with author statistics
Acknowledgements
This code and page was developed by Mathias Payer, initially over the 2020 holiday break. The site includes feedback and suggestions from too many to list, thank you for that!
We use information from DBLP and CSRankings for anti-aliasing of authors. The idea for the statistics was inspired by Davide's Software Security Circus.
License
All data in this repository is licensed under CC BY-NC-ND 4.0.