OAI-PMH Harvesting tool

OAI-PMH Harvesting tool

Legacy

Area tools-libraries
Language Java
Description OAI-PMH Harvesting tool
Production URLs
Quality No Sentry, no SONAR
Upstream services
Upstream data
Source Code

Download the tool from GitLab.

The Crossref OAI PMH harvesting tool gathers all data associated with an OAI PMH request for verbs ListIdentifiers, ListRecords, and ListSets. The tool will perform the resumption handling and retry failed requests. The tools simply outputs a stream of harvested data.

To use the tool you will need to download the crossref-harvesting-tool.jar file to a local directory, for example, the $HOME/lib/ directory. To run the tool use the command line

java -jar $HOME/lib/crossref-harvesting-tool.jar --help

The tool will show the usage notice

usage: org.crossref.tools.harvester.CrossrefHarvesterTool --output file --access-token token --attempt-wait milliseconds --maximum-attempts count url...

Since the command’s invocation is cumbersome the helper script harvester is available. Download this to a local directory, for example, the $HOME/bin/ directory.

To view recently changes records run

harvester --access-token 'AT' 'http://oai.crossref.org/oai?verb=ListIdentifiers&from=2017-03-05'

Replace AT with the Plus user’s access token. This will output the following

2017-03-06 13:44:39.130-0500 INFO harvesting url=http://oai.crossref.org/oai?verb=ListIdentifiers&from=2017-03-05; created=2017-03-06 13:44:39,111-0500
request.url: http://oai.crossref.org/oai?verb=ListIdentifiers&from=2017-03-05
request.begin: 2017-03-06 13:44:39,144-0500
request.end: 2017-03-06 13:44:44,435-0500
handling.begin: 2017-03-06 13:44:45,325-0500
identifier.count: 1
identifier.datestamp: 2017-03-06
identifier.identifier: info:doi/10.1002%2Faic.10055
identifier.doi: 10.1002/aic.10055
identifier.count: 2
identifier.datestamp: 2017-03-06
identifier.identifier: info:doi/10.1002%2Faic.10074
identifier.doi: 10.1002/aic.10074
[...]
identifier.count: 400
identifier.datestamp: 2017-03-05
identifier.identifier: info:doi/10.1371%2Fjournal.pbio.1000121.s006
identifier.doi: 10.1371/journal.pbio.1000121.s006
handling.end: 2017-03-06 13:44:46,877-0500
2017-03-06 13:44:46.878-0500 INFO resuming using c3dc49c5-7a16-4b9b-84da-3b7df6ee2379
2017-03-06 13:44:46.878-0500 INFO harvested url=http://oai.crossref.org/oai?resumptionToken=902fba5e-167e-4b01-a395-377fe47ede56&verb=ListIdentifiers&from=2017-03-05; created=2017-03-06 13:44:45,384-0500
2017-03-06 13:44:46.879-0500 INFO harvesting url=http://oai.crossref.org/oai?verb=ListIdentifiers&resumptionToken=c3dc49c5-7a16-4b9b-84da-3b7df6ee2379&from=2017-03-05; created=2017-03-06 13:44:46,878-0500
request.url: http://oai.crossref.org/oai?verb=ListIdentifiers&resumptionToken=c3dc49c5-7a16-4b9b-84da-3b7df6ee2379&from=2017-03-05
request.begin: 2017-03-06 13:44:46,879-0500
[...]

The records are written as pairs of keys and values. This format makes it easy to select the data wanted and then extract it. For example,if you are only interested in the DOIs from the above harvest then run harvester ‘http://oai.crossref.org/oai?verb=ListIdentifiers&from=2017-03-05' | grep ‘^identifier.doi:’ | cut -d’ ' -f 2 For the output

10.1002/aic.10055
10.1002/aic.10074
10.1057/palgrave.jors.2601799
10.1057/palgrave.jors.2601810
10.1057/palgrave.jors.2601814
10.1057/palgrave.jors.2601811
10.1057/palgrave.jors.2601815
[...]
The set of keys are
handling.begin
handling.end
request.url
request.begin
request.end
set.count
set.setspec
set.setName
identifier.count
identifier.datestamp
identifier.identifier
identifier.doi
record.count
record.datestamp
record.identifier
record.doi
record.metadata

The values of the keys are mostly self evident. The ‘handling’ keys appear only once and record the begin and end dates and times. The ‘request’ keys appear for the initial request and subsequent resumption requests. The ‘set’, ‘identifier’, and ‘record’ keys appear for the each set, identifier, or record in the request response, respectively.

To write the harvested data to a file use the –output option.

To alter retry behavior then use the –attempt-wait option to set the duration between retry attempts and the –maximum-attempts to set a maximum number of retries (pre request).