So you want to retrieve data from Crossref using R. If you want to jump to the coding part, click here, but if you are new to APIs and/or Crossref role in the scholarly metadata ecosystem, let me invite to read more below.
Crossref and the research nexus
At Crossref, we make research objects easy to find, cite, link, assess, and reuse. Our goal is to connect all research knowledge through open and persistent metadata. We have termed this interconnected knowledge ecosystem, the Research Nexus.
The metadata we collect and make available are the identifiers and properties of different items in the scholarly record. We don’t collect the full text or the actual datasets that constitute a scholarly paper, instead, we are interested in relationships between the paper, the authors, institutions, journals, funding agencies, protocols, and more.
We make this metadata open through our REST API
Our data is readily available. You have three access levels to the API metadata:
Public. Free, fully anonymous.
Polite. Free, you provide your email. (Recommended).
We can use this information to contact you in case of issues. We get rid of it after 90 days.
Plus. Our paid premium service. You get:
A service level agreement guaranteeing you extra service and support, giving you a consistent and predictable experience.
Additional features such as snapshots and priority service/rate limits.
What’s an API by the way?
An API is a software intermediary that allows two applications to talk to each other. This means that you don’t directly get the data as would happen when you download data associated to a research paper. In this document, we use the rcrossref package to build the requests that the Crossref API handles to return the data from Crossref servers.
flowchart LR
A(Client) --> B{API}
B --> C(Server)
C --> B
B --> A
Ready to dive into some code?
Libraries
We’ll make use of the rcrossrefChamberlain et al. (2022) package to interect with the Crossref API from the comfort of our R environment. We’ll also load gt to quickly visualize better-looking data tables.
Or we can directly type it in the environment file by first executing:
Code
file.edit("~/.Renviron")
Then, adding the email address to be shared with Crossref as crossref_email = name@example.com
Finally, save the file and restart your R session.
Let’s start with some examples:
Funder-related output
As a funding organization, you may wish to know how many publications are openly available to the public, e.g. the German Research Foundation.
First, we can do a general query to get the specific ID tied to the name of a given institution and store it in a variable that we’ll call funders.
Code
funders <-cr_funders(query ="German Research Foundation")
Typically, the output of these call is a JSON object, but with the rcrossref package we get nested tables that we can pass to gt, to quickly obtain better-looking data tables.
Notice that the specific organization that we are interested in is in the first row. We can also store this id for future use.
fund_id <-'501100001659'# this is the id corresponding to the Deutsche Forschungsgemeinschaft
With this information, we can store the funder id of our interest in a new variable, and use it to retrieve additional information, e.g. how many papers are openly available to the public?
If we use our newly store funder id to retrieve data specifically from this organization, we’ll see that it has 223006 works associated with it:
The following code block shows you how we can add additional parameters to expand our results, for example, to retrieve records that are missing their full-text versions:
We use the id to retrieve works specifically linked to our funder.
2
This parameter let us retrieve works instead of a summary.
3
This filter let us retrieve works that have the full text available.
4
We can limit the number of results (for the sake of efficiency).
Of course, we can add additional parameter to refine our result, for example, we can include additional filters to specify that we want journal articles. The following code chunk also select some relevant columns out of the entire list of options.