Work metadata in REST API

Work metadata in REST API

Tags	xml-transformation xml json

graph LR; deposit(Deposit); qs(Query System); restapi(REST API); oaipmh(OAI-PMH); dbs(Oracle DBs); cddb(CDDB); deposit --> dbs; deposit --> cddb; dbs --> qs; cddb --> qs; qs --> oaipmh; qs --> restapi; oaipmh --> users; restapi --> users;

This covers the workflow from pusher to user. It is mostly Cayenne code.

The ingestion mechanism of works uses a folder configured as [:dir :data]. It contains log files and subfolders feed-in, feed-processed and feed-failed mentioned below.

The pusher uses /feeds API route to push the files to Cayenne. The files are sent in the body of HTTP POST request, one file per request. This is done every 30 minutes. There is a content verification mechanism based on MD5 checksums. The files are saved by Cayenne in the feed-in folder.

A separate process reads the ingest files from feed-in and ingests them, applying modifications to the works index. Files ingested successfully are moved to feed-processed. Files that couldn’t be ingested due to errors are moved to feed-failed.

An ingest file has some basic metadata encoded in its name. Each ingest file has a name of the form <provider>-<content type>-<file id>.body. Provider is currently always crossref. Content type can be unixsd or update.

A file of type unixsd contains the full information of a single DOI in XML UNIXSD format. During ingestion, it is either added to the works index as a new document, or fully replaces existing document in the index. More specifically, Elasticsearch’s bulk API with index action is used.

Note: UNIXREF is NOT used as the ingest file format. UNIXREF contains effectively only a subset of the information available for a DOI. For example, UNIXREF does not contain crm-item elements.

If the ingested item is a journal paper and the journal ISSN is given, the ASJC journal subject names are also attached to the item’s metadata. This uses the journal ES index to look up the subject names for a given ISSN.

A file of type update contains a structure in JSON representing a sequence of updates of is-referenced-by-count field. Each update in the sequence contains four fields: type of action (currently always set), DOI, field name (currently always is-cited-by-count) and the value of the field. During ingestion, partial updates of the indexed documents are performed. More specifically, we use Elasticsearch’s bulk API with a series of update actions.

Note: There is a difference in the field name between the ingest file (is-cited-by-count) and the index/REST API JSON output (is-referenced-by-count). The mapping is done by Cayenne during the ingestion.

Note: It would be possible to update is-referenced-by-count the same way as all the other fields - by sending the entire XML UNIXSD file. The reason this field is updated separately is performance. We need to perform a lot of is-referenced-by-count updates, and using ES bulk API with update actions is faster.

Other notes:

The pusher uses two types:

application/vnd.crossref.unixsd+xml - for the entire documents in XML UNIXSD format
application/vnd.crossref.update+json - for the lists of is-referenced-by-count updates in JSON format