New JSON Snapshots

New JSON Snapshots

Source Code
Tags

New JSON snapshots tool will be compatible with new ES REST API.

Data flow

graph LR; api(REST API
https://api.crossref.org/v1/works); tool(Snapshots tool - download); verify(Snapshots tool - verification); s3-success(Snapshots S3 bucket
monthly/); s3-fail(Snapshots S3 bucket
failed/); api-read(REST API
https://api.crossref.org/snapshots); api --> tool; tool --> verify; verify -- success --> s3-success; verify -- failure --> s3-fail; s3-success --> api-read;

New JSON snapshots tool downloads data from REST API /works route. It should use PLUS REST API, so that limited references are included in the downloaded metadata. The tool uses parallel threads to download DOIs created on each day (from 2002-07 to the last day of the previous month) separately. While downloading, also in parallel, downloaded DOI metadata is grouped into JSON files (a few thousand DOIs per file) and those files are added to a .tar.gz file.

When the snapshot file is finished, it undergoes the verification. The following conditions are checked:

  • whether the size of the snapshot file is larger than the size of the snapshot file from the previous month
  • whether the size of the snapshot file is smaller than twice the size of the snapshot file from the previous month
  • whether the number of DOIs is larger than the number of DOIs in the snapshot file from the previous month
  • whether the number of DOIs is smaller than twice the number of DOIs in the snapshot file from the previous month
  • whether the number of DOIs is consistent with the number of JSON files in the snapshot

If the new snapshot passes the validation, it is uploaded to S3 and immediately available for PLUS users through https://api.crossref.org/snapshots Basic statistics of the snapshot file (file size and number of DOIs) are also uploaded to S3, and will be used for the verification of the snapshot next month.

If the validation fails, the snapshot is uploaded, along with the statistics, to S3 to failed prefix. It is not available for the PLUS users, but can be further examined by us.