Metadata Bucket Updates
Component | |
Area | greenfield |
Description | A repository containing tools to manage metadata updates |
Quality | No Sentry, no SONAR |
Upstream services | |
Upstream data | |
Related services | |
Tags |
Metadata Bucket Updates is a repository that contains a number of tools for managing updates to metadata.
Metadata Bucket Builder
Metadata Bucket Builder can build an initial metadata bucket using a snapshot as source. It will build the bucket in accordance with the spec defined by the metadata bucket.
It is built using Python. Python was used here because it is a good choice to run an ad-hoc script like bucket_builder/build_bucket.py
and because it
has good support in AWS Lambda, which will be used for other metadata update tools.
A command to build the bucket might look something like this:
AWS_PROFILE=crossref-staging python bucket_builder/build_bucket.py --snapshot_path /path/to/extracted/snapshot --destination_bucket=crossref-metadata-bucket-temp
Or, if you don’t want to build to S3, you can build to a local directory
python bucket_builder/build_bucket.py --snapshot_path /path/to/extracted/snapshot --destination_directory /home/my-user/some-dir
Metadata Kafka Pusher (no longer used)
Metadata Kafka Pusher will subscribes to changes to data in the metadata bucket and push the keys related to those changes into a Kafka topic.
It is built using Python. Python was used here because it is a good choice for AWS Lambda, which is where this will be run. Keys in the consumed bucket should be created in accordance with the spec defined by the metadata bucket.
Messages will be pushed to the following topics:
- metadata_s3_update_xml
- metadata_s3_update_citation
Messages are encoded as JSON and will typically have the following format:
{
"s3_key": "The S3 key that triggered the message",
"s3_bucket": "The S3 bucket from which the message was triggered"
}
The following environment variables can be configured:
Environment Variable | Default | Purpose |
---|---|---|
KAFKA_HOST | localhost:9094 | Used to configure the kafka hosts to be used as bootstrap servers |
KAFKA_TOPIC_XML | metadata_s3_update_xml | The destination topic for xml metadata |
KAFKA_TOPIC_CITATION | metadata_s3_update_citation | The destination topic for json citation count data |
REPLICATION_FACTOR | 1 | Used to configure the replication factor of KAFKA_TOPIC_XML and KAFKA_TOPIC_CITATION |
PARTITION_COUNT | 1 | Used to configure the partition count of KAFKA_TOPIC_XML and KAFKA_TOPIC_CITATION |
CREATE_TOPICS | When set to any value the code will try to create KAFKA_TOPIC_XML and KAFKA_TOPIC_CITATION before using them |
Consumers
Consumption of data pushed to the above topics is carried out by:
- REST API: The REST API will hh
Service | Purpose |
---|---|
REST API | Uses the published keys to download data from S3 and update its internal indexes |