REST API serving snapshots (`/snapshots`)

REST API serving snapshots (`/snapshots`)

Legacy

Area distribution-querying
Language Java
Description Serving up bulk snapshots of XML and JSON metadata.
Production URLs
Quality Sentry, no SONAR
Upstream services
Upstream data
Downstream data
Source Code
Products

Snapshot File Contents and Formats

Available snapshot files:

  • all.json.tar.gz
  • all.xml.tar.gz

AWS Access Control

The “org.crossref.snapshots” is the AWS S3 bucket with the bulk extracts. The bucket is available for read-only access by the “service-snapshot” IAM user. This user has an explicit policy associated with it; and so it does not show in S3 Permissions interface. This user’s access key is used by the CS services to grant access to the bucket.

The configuration in the Content System is determined by the following deployment-common.properties:

  • qs.snapshot.aws-bucket-name
  • qs.snapshot.aws-access-key
  • qs.snapshot.aws-secret-key

Crossref Access Control

The service can be used by anyone (i.e. browsing the structure), but downloading is restricted to Plus members. Therefore, a member must have both the Metadata Plus service and their accompanying access token to download a snapshot. This data is transferred from Sugar every 4 hours during normal business hours (US/Eastern).

To download a snapshot, the member must provide in their request an Crossref-Plus-API-Token HTTP header with their access token:

Crossref-Plus-API-Token: Bearer XXX

When the member uses the download URL with the access token then their HTTP client will be redirected to download the snapshot from S3 using a time limited, secure URL. The URL must be used before it expires. The URL expires in 15 minutes. The time limit is determined by qs.snapshot.url-maximum-age.

The base URL for viewing the snapshot organization is

https://api.crossref.org/snapshots

The navigation interface is HTML and built from the S3 bucket item details on a schedule and cached locally as a org.crossref.qs.snapshot.Listing, configured via org.crossref.qs.snapshot.SnapshotController.

An update can be forced via JMX, but must be done per deployment, eg

$ curl \
  "http://svc1a:8080/jmx/exec/qs.snapshot:name=Controller/updateListing" \
  "http://svc1b:8080/jmx/exec/qs.snapshot:name=Controller/updateListing"

Navigation can be done without a Plus access token. Downloading, does require the access token in the authorization header. For example,

$ curl \
  -o journals.xml.tar.gz \
  -H'Authorization: Bearer XXX' \
  'https://api.crossref.org/snapshots/monthly/2018/03/journals.xml.tar.gz'

If you want to download all of the month’s snapshots then you could use wget, but, generally, we expect members to use their own automation to download the wanted files and not all files.

A shortcut will direct to the the most recently uploaded set of files, for example

https://api.crossref.org/snapshots/monthly/latest

Usage Data

Each request for a download load URL is logged in a “snapshots_usage_YYYYMM” table in the “usages” MySql database. The data can be requested using the URL (Note: a ‘from’ date is required to produce a result).

http://api.crossref.org/snapshots/usage?from=2019-07-01&until=2019-07-30

This results in a tab-separated list of records. Each record has an “id”, “memberid”, “key”, and “requested” columns.

You can limit the data by providing query parameters. The parameters are

Parameter Meaning
memberid Select only records with the given member id. The value is a decimal integer.
key Select only records with the given S3 key to the downloaded item. The value is a string. Eg “key=monthly/2018/03/journals.xml.tar.gz”
from Select only records requested after and including the given timestamp. The value is a string formatted YYYY-MM-DDTHH:MM:SS.
until Select only records requested before and excluding the given timestamp. The value is a string formatted YYYY-MM-DDTHH:MM:SS.
orderby Order the results by the named columns: “memberid”, “key”, and “requested”. Reverse the order using “desc” (ie descending), eg “orderby=requested+desc”.