graph LR; deposit(Deposit); qs(Query System); restapi(REST API); oaipmh(OAI-PMH); dbs(Oracle DBs); cddb(CDDB); deposit --> dbs; deposit --> cddb; dbs --> qs; cddb --> qs; qs --> oaipmh; qs --> restapi; oaipmh --> users; restapi --> users;
This covers the workflow from pusher to user. It is mostly Cayenne code.
graph LR; pusher(Pusher); feedfolder(feed folder); indexjourn(Cayenne ES Journal Index); ingest(Cayenne Task Ingest Feed); index(Cayenne ES Work Index); api(Cayenne /v1/works); pusher --> |UNIXSD|feedfolder; pusher --> |update|feedfolder; feedfolder --> |UNIXSD|ingest; feedfolder --> |update|ingest; indexjourn --> ingest ingest --> |ES index|index; ingest --> |ES update|index; index --> api;
The ingestion mechanism of works uses a folder configured as
[:dir :data]. It contains log files and subfolders
feed-failed mentioned below.
The pusher uses
/feeds API route to push the files to Cayenne. The files are sent in the body of HTTP POST request, one file per request. This is done every 30 minutes. There is a content verification mechanism based on MD5 checksums. The files are saved by Cayenne in the
A separate process reads the ingest files from
feed-in and ingests them, applying modifications to the
works index. Files ingested successfully are moved to
feed-processed. Files that couldn’t be ingested due to errors are moved to
An ingest file has some basic metadata encoded in its name. Each ingest file has a name of the form
<provider>-<content type>-<file id>.body. Provider is currently always
crossref. Content type can be
A file of type
unixsd contains the full information of a single DOI in XML UNIXSD format. During ingestion, it is either added to the
works index as a new document, or fully replaces existing document in the index. More specifically, Elasticsearch’s bulk API with
index action is used.
Note: UNIXREF is NOT used as the ingest file format. UNIXREF contains effectively only a subset of the information available for a DOI. For example, UNIXREF does not contain
If the ingested item is a journal paper and the journal ISSN is given, the ASJC journal subject names are also attached to the item’s metadata. This uses the
journal ES index to look up the subject names for a given ISSN.
A file of type
update contains a structure in JSON representing a sequence of updates of
is-referenced-by-count field. Each update in the sequence contains four fields: type of action (currently always
set), DOI, field name (currently always
is-cited-by-count) and the value of the field. During ingestion, partial updates of the indexed documents are performed. More specifically, we use Elasticsearch’s bulk API with a series of
Note: There is a difference in the field name between the ingest file (
is-cited-by-count) and the index/REST API JSON output (
is-referenced-by-count). The mapping is done by Cayenne during the ingestion.
Note: It would be possible to update
is-referenced-by-count the same way as all the other fields - by sending the entire XML UNIXSD file. The reason this field is updated separately is performance. We need to perform a lot of
is-referenced-by-count updates, and using ES bulk API with
update actions is faster.
The pusher uses two types:
application/vnd.crossref.unixsd+xml- for the entire documents in XML UNIXSD format
application/vnd.crossref.update+json- for the lists of
is-referenced-by-countupdates in JSON format