Metadata Bucket Updates

Metadata Bucket Updates

Component
Area	greenfield
Description	A repository containing tools to manage metadata updates
Quality	No Sentry, no SONAR
Upstream services	Metadata Bucket
Upstream data	Metadata Bucket
Related services	Metadata Bucket REST API (Cayenne)
Tags	metadata

Metadata Bucket Updates is a repository that contains a number of tools for managing updates to metadata.

Metadata Bucket Builder

Metadata Bucket Builder can build an initial metadata bucket using a snapshot as source. It will build the bucket in accordance with the spec defined by the metadata bucket.

It is built using Python. Python was used here because it is a good choice to run an ad-hoc script like bucket_builder/build_bucket.py and because it has good support in AWS Lambda, which will be used for other metadata update tools.

A command to build the bucket might look something like this:

AWS_PROFILE=crossref-staging python bucket_builder/build_bucket.py --snapshot_path /path/to/extracted/snapshot --destination_bucket=crossref-metadata-bucket-temp

Or, if you don’t want to build to S3, you can build to a local directory

python bucket_builder/build_bucket.py --snapshot_path /path/to/extracted/snapshot --destination_directory /home/my-user/some-dir

Metadata Kafka Pusher (no longer used)

Metadata Kafka Pusher will subscribes to changes to data in the metadata bucket and push the keys related to those changes into a Kafka topic.

It is built using Python. Python was used here because it is a good choice for AWS Lambda, which is where this will be run. Keys in the consumed bucket should be created in accordance with the spec defined by the metadata bucket.

Messages will be pushed to the following topics:

metadata_s3_update_xml
metadata_s3_update_citation

Messages are encoded as JSON and will typically have the following format:

{
  "s3_key": "The S3 key that triggered the message",
  "s3_bucket": "The S3 bucket from which the message was triggered"
}

The following environment variables can be configured:

Environment Variable	Default	Purpose
KAFKA_HOST	localhost:9094	Used to configure the kafka hosts to be used as bootstrap servers
KAFKA_TOPIC_XML	metadata_s3_update_xml	The destination topic for xml metadata
KAFKA_TOPIC_CITATION	metadata_s3_update_citation	The destination topic for json citation count data
REPLICATION_FACTOR	1	Used to configure the replication factor of KAFKA_TOPIC_XML and KAFKA_TOPIC_CITATION
PARTITION_COUNT	1	Used to configure the partition count of KAFKA_TOPIC_XML and KAFKA_TOPIC_CITATION
CREATE_TOPICS		When set to any value the code will try to create KAFKA_TOPIC_XML and KAFKA_TOPIC_CITATION before using them

Consumers

Consumption of data pushed to the above topics is carried out by:

REST API: The REST API will hh

Service	Purpose
REST API	Uses the published keys to download data from S3 and update its internal indexes