Get journal-level metada from Crossref’s API using R

Author
Affiliation

Luis M. Montilla

Crossref

Code
library(tidyverse)
library(httr2)
library(DT)

First let’s describe an example querying a single ISSN. Here, the object called issn_list is list of three ISSN as a minimal example.

Code
issn_list <- c("2167-8359",
               "2804-3871",
               "1932-6203")

Probably you will need to deal with a longer list of ISSN; you can copy your list of ISSN as a .csv file to the main directory of this workbook and use the following command to read the data into your work environment.

Code
1issn_list <- read.csv("the name of your file.csv")
1
Add the name of your csv file within the quotation marks.

Then we can define the base URL and an specific endpoint that we will query

Code
base_url <- "https://api.crossref.org/"
endpoint <- "journals"

For this example, we will query only the first element of our list.

Code
my_journals <- request(base_url) |> 
  req_url_path_append(endpoint, issn_list[1]) |> 
  req_url_query(mailto="lmontilla@crossref.org") |> 
  req_perform()

And then, we can extract the body from the response in JSON format.

Code
my_table <- my_journals |> 
  resp_body_json() 

Now, we can extract specific information from the response, for example, as a table. The following chunk will extract infromation from the ‘message’ level of the response, and then, it will apply a custom function to build a table with some elements of interest:

Code
my_table |> 
  pluck('message') |> 
        {\(y) {
          #browser() 
          tibble(
            "issn" = y |> pluck("ISSN"),
            "publisher" = y |> pluck("publisher"),
            "current_dois" = y |> pluck("counts", "current-dois"),
            "backfile_dois" = y |> pluck("counts", "backfile-dois"),
            "total_dois" = y |> pluck("counts", "total-dois")
            ) |> 
            cbind(
              y |> 
                pluck('coverage-type') |> 
                unlist() |> 
                data.frame() |>
                rownames_to_column('variable') |> 
                pivot_wider(names_from = variable, 
                            values_from = `unlist.pluck.y...coverage.type...`)
            )
          }}() |> 
  datatable()

We can use these steps as a reference to use more powerful functions that will let us make all our requests sequentially. The first thing that we’ll do is build (not execute) a list of requests, including adding a rate variable that will keep our requests within the safe limits of the Public and Polite API.

Warning

We limit each IP address to sending 50 reqs/sec. Please avoid getting blocked by identifying yourself and keeping your rates below this value.

Code
list_queries <- issn_list |> 
  map(\(x){
    request("https://api.crossref.org/journals/") |>
      req_url_path_append(x) |>
      req_url_query() |>
1      req_throttle(rate = 30/60)
})
1
This value means requests / 60 seconds

Now that we have our list of requests, and that each includes a rate specification, we can use an specific function to sequentially perform each of those queries.

Code
my_item <- list_queries |> 
  req_perform_sequential(progress = TRUE) 
Waiting 2s for throttling delay ■■■■■■■■■■■■■■■                 
Waiting 2s for throttling delay ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  

This new object contains a list of retrieved queries. We can turn this into JSON files and explore their content. For example, let’s do this for the first element of the list.

Code
my_item |> 
  pluck(1) |> 
  resp_body_json() 

We can write a custom function to loop over our new list and:

  1. Format each response of the list as JSON.
  2. Extract the “message” element, that contains the information we are looking for (in this case).
  3. Build a table with the relevant information
  4. Finish the loop and merge the tables into a single master table.
Code
my_df <- my_item |> 
  map(\(x){
    #browser()
    resp_body_json(x, simplifyVector = TRUE) |> 
        pluck('message') |> 
        {\(y) {
          #browser() 
          tibble(
            "issn" = y |> pluck("ISSN"),
            "publisher" = y |> pluck("publisher"),
            "current_dois" = y |> pluck("counts", "current-dois"),
            "backfile_dois" = y |> pluck("counts", "backfile-dois"),
            "total_dois" = y |> pluck("counts", "total-dois")
            ) |> 
            cbind(
              y |> 
                pluck('coverage-type') |> 
                unlist() |> 
                data.frame() |>
                rownames_to_column('variable') |> 
                pivot_wider(names_from = variable, 
                            values_from = `unlist.pluck.y...coverage.type...`)
            )
          }}()
  }) |> 
   bind_rows()

We can use the package DT to make interactive tables

Code
my_df |> 
  datatable()

Other resources