Recently, I tried to aggregate in a single place various codes I had written to handle scientific papers. Some feature I was missing, and I would like to add, was the ability to fetch automatically references from a given paper. For arXiv papers, I had a simple solution using the LaTeX sources, but I wanted to have something more universal, taking a simple PDF file in input (thanks John for the suggestion, and Al for the tips on existing software solutions).
I tried a comparison of three existing software to extract references from a PDF file:
- pdfextract from Crossref, very easy to use, written in Ruby.
- Grobid, more advanced (using machine learning models), written in Java, but quite easy to use too.
- Cermine, using the same approach as Grobid, but I could not get it to build on my computer. I used their REST service instead.
I downloaded some articles, to get a (hopefully) representative set, composed of 147 different papers from various journals (I did not had access to some of them, so I could not fetch the full dataset). I ran
Cermine on each of them and compared the results.
The raw results are available here for each paper, and I generated a single page comparison to ease the visual diff between the three results, available here (note that this webpage is very heavy, around 16MB).
Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself.
I also found ParsCit which may be of interest. Though, you first need to extract text from your PDF file. I did not yet test it more in depth.
This tweet tends to confirm the results I had, that Grobid is the best one.
If it can be useful, here is a small web service written in Python to allow a user to upload a paper and parse citations and try to assess open-access availability of the cited papers. It uses CERMINE as it was the easiest way to go, especially since it offers a web API, which allows me to distribute a simply working script, without any additional requirements.