Let’s add some metadata on arXiv!

Posted on December 26, 2015 in OpenAccess • 6 min read

This article contains ideas and explanations around this code. Many references to it will be done through this article.

Disclaimer: The above code is here as a proof of concept and to back this article with some code. It is clearly not designed (nor scalable) to run in production. However, the reference_fetcher part was giving good results on the arXiv papers I tested it on.

Nowadays, most of the published scientific papers are available online, either directly on the publisher’s website, or as preprints on Open access repositories. For physics and computer science, most of them are available on the arXiv.org repository (a major, worldwide, Open access repository managed by Cornell). All published papers get a unique (global) identifier, called a DOI, which can be used to identify them and link to them. For instance, if one gets to https://dx.doi.org/10.1103%2FPhysRevB.47.7312 it is automatically redirected to the Physical Review B website, on the page of the paper with DOI 10.1103%2FPhysRevB.47.7312. This is really useful to target a paper, and identify it uniquely, in a machine-readable way and in a way that will last. However, very little use seems to be done of this system. This is why I had the idea to put some extra metadata on published papers, using such systems.

From now on, I will mainly focus on arXiv for two main reasons. First, it is Open access, so it is accessible everywhere (and not depending on the rights from a particular institution) and reusable, and second, arXiv provides sources for most of the papers, which is of great interest as we will see below. arXiv gives a unique identifier to the preprints. Correspondence between DOIs and arXiv identifiers can be made quite easily as some publishers push back DOI to arXiv upon publication, and authors manually update the fields on arXiv for the rest of the publishers.

Using services such as Crossref or the publisher’s website, it is really easy to get a formatted bibliography (plaintext, BibTeX, …) from a given identifier (e.g. see some codes for DOI or arXiv id for BibTeX output). Then, writing a bibliography should be as easy as keeping track of a list of identifiers!

Let’s make a graph of citations!

In scientific papers, references are usually a plaintext list of papers used as reference, at the end of the article. This list follows some rules and formats, but there exist a wide variety of different formats, and it is often really difficult to parse them automatically (see http://arxiv.org/abs/1506.06690 for an example of references format).

If one wants to fetch automatically the references from a given paper (to download them in batch for instance), he would basically have to parse a PDF file, find the references section, and parse each textual item, which is really difficult and error-prone. Some repositories, such as arXiv, offers sources for the published preprints. In this case, one can deal with a LaTeX-formatted bibliography (a thebibliography environment, not a full BiBTeX though), which is a bit better, but still a pity to deal with. When referencing an article, nobody uses DOIs!

First idea is then to try to automatically fetch references for arXiv preprints and mark them as relationships between articles.

Fortunately, arXiv provides bbl source files for most of the articles (which are LaTeX-formatted bibliography). We can them avoid having to parse a PDF file, and directly get some structured text, but bibliography is still in plaintext, without any machine-readable identifier. Here comes Crossref which offers a wonderful API to try to fetch a DOI from a plain text (see http://labs.crossref.org/resolving-citations-we-dont-need-no-stinkin-parser/). And it gives surprisingly good results!

This automatic fetching of DOI for references of a given arXiv papers is available in this code.

Then, one can simply write a simple API accepting POST requests to add papers to a database, fetch referenced papers, and mark relationships between them. This is how https://github.com/Phyks/arxiv_metadata began.

If you post a paper to it, identified either by its DOI (and a valid associated arXiv id is found) or directly by its arXiv id, it will add it to the database, resolve its references and mark relationships in database between this paper and the references papers. One can then simply query the graph of “citations”, in direct or reverse order, to get any papers cited by a given one, or citing a given one.

The only similar service I know of on the web is the one provided by SAO/NASA ADS. See for instance how it deals with the introductory paper. It is quite fantastic for giving both the papers citing this one and cited by this one, in a browsable form, but its core is not open-source (or I did not find it), and I have no idea how it works in the background. There is no easily accessible API, and it works only in some very specific fields (typically Physics).

Let’s add even more relations!

Now that we have a base API to add papers and relationships between them to a database, we can imagine going one step further and mark any kind of relations between the papers.

For instance, one can find that a given paper could be another reference for another one, which was not citing it. We could then collaboratively work to put extra metadata on scientific papers, such as extra references, which would be useful to everyone.

Such relationships could also be similar to, introductory_course, etc. This is quite limitless and the above code can already handle it. :)

Let’s go one step further and add tags!

So, by now, we can have uniquely identified papers, with any kind of relationships between them, which we can crowdsource. Let’s take some time to look at how arXiv stores papers.

They classify them by “general categories” (e.g. cond-mat which is a (very) large category called “Condensed Matter”) and subcategories (e.g. cond-mat.quant-gas for “Quantum gases” under “Condensed Matter”). A RSS feed is offered for all these categories, and researchers usually follow the subcategory of their research area to keep up to date with published articles.

Although some article are released under multiple categories, most of them only have one category, very often because they do not fit anywhere else, but sometimes because the author did not think it could be relevant in another field. Plus some researchers work at the edge of two fields, and following everything published in these two fields is a very time-consuming task.

Next step is then to collaboratively tag articles. We could get tags as targeted as we want, or as general as we want, and everyone could follow the tags they want. Plus doing it collaboratively allows someone who finds an article interesting for its field, which was not the author’s field, to make it appear in the feed of his colleagues.


We finally have the tools to mark relations between papers, to annotate them, complete them, and tag them. And all of this collaboratively. With DOIs and similar unique identifiers, we have the ability to get rid of the painful plaintext citations and references and use easily machine-manageable identifiers, while still getting some nicely rendered BibTeX citations automagically.

People are already doing this kind of things for webpages (identified by their URL) with Reddit or HackerNews and so on, let’s do the same for scientific papers! :)

A demo instance should be available at http://arxiv.phyks.me/. This may not be very stable or highly available though.

EDIT (14/05/2017): The demo instance is no longer available. But the repo is still out there if you want to give it a try!