Metadata Bucket Updates

Metadata Bucket Updates

Component
Area greenfield
Description A repository containing tools to manage metadata updates
Quality No Sentry, no SONAR
Upstream services
Upstream data
Related services
Tags

Metadata Bucket Updates is a repository that contains a number of tools for managing updates to metadata.

Metadata Bucket Builder

Metadata Bucket Builder can build an initial metadata bucket using a snapshot as source. It will build the bucket in accordance with the spec defined by the metadata bucket.

It is built using Python. Python was used here because it is a good choice to run an ad-hoc script like bucket_builder/build_bucket.py and because it has good support in AWS Lambda, which will be used for other metadata update tools.

A command to build the bucket might look something like this:

AWS_PROFILE=crossref-staging python bucket_builder/build_bucket.py --snapshot_path /path/to/extracted/snapshot --destination_bucket=crossref-metadata-bucket-temp

Or, if you don’t want to build to S3, you can build to a local directory

python bucket_builder/build_bucket.py --snapshot_path /path/to/extracted/snapshot --destination_directory /home/my-user/some-dir

Metadata Kafka Pusher (no longer used)

Metadata Kafka Pusher will subscribes to changes to data in the metadata bucket and push the keys related to those changes into a Kafka topic.

It is built using Python. Python was used here because it is a good choice for AWS Lambda, which is where this will be run. Keys in the consumed bucket should be created in accordance with the spec defined by the metadata bucket.

Messages will be pushed to the following topics:

  • metadata_s3_update_xml
  • metadata_s3_update_citation

Messages are encoded as JSON and will typically have the following format:

{
  "s3_key": "The S3 key that triggered the message",
  "s3_bucket": "The S3 bucket from which the message was triggered"
}

The following environment variables can be configured:

Environment Variable Default Purpose
KAFKA_HOST localhost:9094 Used to configure the kafka hosts to be used as bootstrap servers
KAFKA_TOPIC_XML metadata_s3_update_xml The destination topic for xml metadata
KAFKA_TOPIC_CITATION metadata_s3_update_citation The destination topic for json citation count data
REPLICATION_FACTOR 1 Used to configure the replication factor of KAFKA_TOPIC_XML and KAFKA_TOPIC_CITATION
PARTITION_COUNT 1 Used to configure the partition count of KAFKA_TOPIC_XML and KAFKA_TOPIC_CITATION
CREATE_TOPICS When set to any value the code will try to create KAFKA_TOPIC_XML and KAFKA_TOPIC_CITATION before using them

Consumers

Consumption of data pushed to the above topics is carried out by:

  • REST API: The REST API will hh
Service Purpose
REST API Uses the published keys to download data from S3 and update its internal indexes