DR-0500: Denormalize relationships for /relationships REST API endpoint

Summary

We query the /relationships endpoint based on properties of the relationship itself, and on properties of the subject and object items. To scale this, we need to denormalize all queried fields into one table. We didn’t want to do this initially, but found that it was necessary at the scale and characteristics of the data we have.

Context and Problem Statement

We serve relationship matches through the /relationships REST API endpoint, and allow filtering for fields that are intrinsic to the relationship:

  • subject.id
  • object.id
  • relationship-type
  • from-updated-time
  • until-updated-time
  • asserted-by
  • not-asserted-by
  • linked-by
  • not-linked-by

and those that are intrinsic to the subject or object item:

  • subject.collection
  • subject.type
  • subject.registration-agency
  • object.collection
  • object.type
  • object.registration-agency

We know we want to these add additional filters:

  • subject.id.domain
  • object.id.domain
  • subject.steward
  • object.steward

These are naturally modelled (simplified):

erDiagram
    relationship_current }|--|| item_info : subject_pk
    relationship_current }|--|| item_info : object_pk
    relationship_current }|--|| item_info : asserted_by
    relationship_current }|--|| item_info : linked_by

    relationship_current {
        bigint subject_pk
        int relationship_type
        bigint object_pk
        time updated_time
        bigint asserted_by_pk
        bigint linked_by_pk
    }

    item_info {
        bigint item_pk
        string type
        string subtype
    }

To query these, it’s necessary to JOIN the tables on query. Performance was satisfactory until we hit a few hundred million, at which point it decreased to the point of timing out.

We determined through experimentation with the query planner that the query plan can change drastically based on the characteristics of the data and query.

Decision Drivers

  • Fast querying of known use cases.
    • Better UX
    • Cost and scalability
  • Flexibility to meet future use cases.
  • Cost of maintaining PostgreSQL database.
  • Cost of maintaining code.
  • Up-to-dateness of endpoint, and ability to maintain throughput.
  • Ability to scale up to full dataset (hundreds of millions of items, billions of relationships)
  • Allow for stable sorting of resultset so we can paginate with cursors.

Considered Options

There was one obvious solution (full denormalize), and one more creative solution (distinct values). Sample query plans provided at the end of the page.

Option A: Original

Query the relationship_current table. JOIN onto item_info on subject_pk and object_pk where there are filters on those values.

Pros:

  • Simple, obvious data model.
  • Doesn’t rely on extra synchronization.

Cons:

  • Hit a performance problem when we scaled up.

The fundamental issue was that querying only by relationship_current fields was fast. Querying only by item_info fields in the subject or object position was acceptable, if not fast. But in combination, the query planner struggled.

This is because the sort field (relationship.pk) wasn’t present on the item_info table, requiring a full table scan, then sort.

Depending on different characteristics and table statistics, the query plan also switched.

Option B: Full Denormalize

Replicate all relevevant fields onto the relationships table. Make a new table, called relationship_api with this denormalized data.

Pros:

  • Querying does’t involve any JOINs meaning greater predictability from the query planner.
  • Queries for supported indexes are very fast (of the order of 5 milliseconds where before they took upwards of 5000 milliseconds)
  • Query plan very simple.
  • Query planner consistent and stable across different queries.
  • Simple to understand, maintain and extend.
  • Can deal with unexpected cardinalities (e.g. adding the ‘domains’ filter).
  • Allows us to add filters when we need them.

Cons:

  • Requires indexes for each combination of values.
    • Unanticipated queries can be slow.
    • Requires product management understanding of all queries.
    • Index bloat potentially slows down ingestion and increases storage.
  • Requires separate maintenance job to keep table synchronized

This was the obvious solution, but we were hesitant because of the potential downsides. However, performance was excellent when tested with 250 million rows.

Option C: Distinct Values

Create a table that represents all combinations of filterable values for a given Item. For example, this table might have the following entries

value_pk type subtype
1 work journal
2 work journal-article
3 work book
4 work book-chapter
5 org
6 person

We could therefore store only the value_pk on the relationships table. Querying for this would involve filtering the values table for matching items, and then JOIN.

For example, filtering subject.type=work would return the values (1, 2, 3, 4) which we could then JOIN to the subject_value_pk.

In theory this would drastically cut down the number of columns we would need to store.

However, in testing, we ran into the same issues as the original concept. The JOIN query plans were unpredictable depending on the number of matched rows. The sort by relationship.pk, necessary for pagination, also hit some painful sorts. Especially for values that returned large amounts of the table.

We also had the question of adding more fields. This table is effectively the product of all values on all item columns, although it only contains combinations that occur. This table could multiply rapidly if the characteristics of the data change. In theory, if we continue to add more low-cardiality fields (e.g. small vocabularies), this might not happen. But adding domain (which has of the order of 50,000 values) and steward (which has of the order of 20,000 values) would quickly multiply the size of the table. This would make the JOIN query plan even slower and more unpredictable.

Pros:

  • Only takes 2 columns (subject and object) in the relationships table
    • Cuts down index bloat
    • Cuts down storage
    • Cuts down impact of updates
  • Can add fields in future without altering the relationships table

Cons:

  • Unpredictable at scale.
  • Higher cardinatlioty fields would be risky to add.

Decision Outcome

Option B: Full denormalization

Positive Consequences

  • Very fast queries for supported queries.
  • Simpler code.
  • Extensible to more fields.

Negative Consequences

  • Some delay in updating the API (of the order of 1 hour in testing)
  • Potential for index and storage bloat.

Appendix

Query Plan for option A:

EXPLAIN ANALYZE
SELECT
    relationship_match_current.pk,
    subject_item_pk,
    relationship_type_pk,
    object_item_pk,
    asserted_at,
    subject_item_info.type AS subject_type,
    subject_item_info.subtype AS subject_subtype,
    object_item_info.type AS object_type,
    object_item_info.subtype AS object_subtype,
    item_tree_asserting_party_pk,
    matching_party_pk
FROM relationship_match_current
LEFT JOIN item_info_view subject_item_info 
    ON subject_item_pk = subject_item_info.root_item_pk 
LEFT JOIN item_info_view object_item_info 
    ON object_item_pk = object_item_info.root_item_pk 
JOIN item_identifier subject_item_identifier 
    ON subject_item_pk = subject_item_identifier.item_pk
JOIN identifier_ra_lookup subject_identifier_ra_lookup
    ON subject_item_identifier.pk = subject_identifier_ra_lookup.item_identifier_pk
JOIN ra_lookup subject_ra_lookup 
    ON subject_identifier_ra_lookup.ra_lookup_pk = subject_ra_lookup.pk
JOIN item_identifier object_item_identifier 
    ON object_item_pk = object_item_identifier.item_pk
JOIN identifier_ra_lookup object_identifier_ra_lookup
    ON object_item_identifier.pk = object_identifier_ra_lookup.item_identifier_pk
JOIN ra_lookup object_ra_lookup 
    ON object_identifier_ra_lookup.ra_lookup_pk = object_ra_lookup.pk
WHERE subject_ra_lookup.registration_agency_pk = 2
AND relationship_type_pk = 2
AND object_ra_lookup.registration_agency_pk = 3
ORDER by pk 
LIMIT 1000;

/* 
Limit  (cost=2361495.79..2361612.47 rows=1000 width=92) (actual time=14329.027..14671.499 rows=1000 loops=1)
  ->  Gather Merge  (cost=2361495.79..2388825.23 rows=234236 width=92) (actual time=14329.025..14671.460 rows=1000 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Sort  (cost=2360495.77..2360788.56 rows=117118 width=92) (actual time=14325.267..14325.330 rows=687 loops=3)
              Sort Key: relationship_match_current.pk
              Sort Method: top-N heapsort  Memory: 289kB
              Worker 0:  Sort Method: top-N heapsort  Memory: 287kB
              Worker 1:  Sort Method: top-N heapsort  Memory: 289kB
              ->  Hash Join  (cost=1311218.27..2354074.32 rows=117118 width=92) (actual time=12942.155..14323.710 rows=5513 loops=3)
                    Hash Cond: (object_identifier_ra_lookup.ra_lookup_pk = object_ra_lookup.pk)
                    ->  Parallel Hash Left Join  (cost=1310674.54..2345228.98 rows=3161915 width=96) (actual time=12940.336..14129.622 rows=4400266 loops=3)
                          Hash Cond: (relationship_match_current.object_item_pk = object_item_info.root_item_pk)
                          ->  Parallel Hash Left Join  (cost=1152026.26..2042782.69 rows=3161915 width=76) (actual time=10318.796..11353.472 rows=4400266 loops=3)
                                Hash Cond: (relationship_match_current.subject_item_pk = subject_item_info.root_item_pk)
                                ->  Parallel Hash Join  (cost=993377.99..1758862.40 rows=3161915 width=56) (actual time=7731.136..8991.911 rows=4400266 loops=3)
                                      Hash Cond: (relationship_match_current.object_item_pk = object_item_identifier.item_pk)
                                      ->  Parallel Hash Join  (cost=490014.79..1121595.21 rows=3763779 width=52) (actual time=3871.243..4622.224 rows=4400266 loops=3)
                                            Hash Cond: (relationship_match_current.subject_item_pk = subject_item_identifier.item_pk)
                                            ->  Parallel Seq Scan on relationship_match_current  (cost=0.00..460375.62 rows=5557342 width=52) (actual time=0.034..806.719 rows=4413152 loops=3)
                                                  Filter: (relationship_type_pk = 2)
                                                  Rows Removed by Filter: 1709868
                                            ->  Parallel Hash  (cost=438578.13..438578.13 rows=3135173 width=8) (actual time=2264.095..2264.100 rows=3102515 loops=3)
                                                  Buckets: 262144  Batches: 64  Memory Usage: 7808kB
                                                  ->  Parallel Hash Join  (cost=274220.55..438578.13 rows=3135173 width=8) (actual time=1497.667..2037.397 rows=3102515 loops=3)
                                                        Hash Cond: (subject_identifier_ra_lookup.item_identifier_pk = subject_item_identifier.pk)
                                                        ->  Hash Join  (cost=757.86..109787.61 rows=3135173 width=8) (actual time=2.023..385.838 rows=3102515 loops=3)
                                                              Hash Cond: (subject_identifier_ra_lookup.ra_lookup_pk = subject_ra_lookup.pk)
                                                              ->  Parallel Seq Scan on identifier_ra_lookup subject_identifier_ra_lookup  (cost=0.00..98819.35 rows=3888935 width=12) (actual time=0.013..144.773 rows=3116533 loops=3)
                                                              ->  Hash  (cost=533.41..533.41 rows=17956 width=4) (actual time=1.988..1.989 rows=17956 loops=3)
                                                                    Buckets: 32768  Batches: 1  Memory Usage: 888kB
                                                                    ->  Seq Scan on ra_lookup subject_ra_lookup  (cost=0.00..533.41 rows=17956 width=4) (actual time=0.008..1.159 rows=17956 loops=3)
                                                                          Filter: (registration_agency_pk = 2)
                                                                          Rows Removed by Filter: 4317
                                                        ->  Parallel Hash  (cost=192993.86..192993.86 rows=4629186 width=16) (actual time=765.357..765.358 rows=3703170 loops=3)
                                                              Buckets: 262144  Batches: 128  Memory Usage: 6176kB
                                                              ->  Parallel Seq Scan on item_identifier subject_item_identifier  (cost=0.00..192993.86 rows=4629186 width=16) (actual time=0.025..345.458 rows=3703170 loops=3)
                                      ->  Parallel Hash  (cost=435762.51..435762.51 rows=3888935 width=12) (actual time=2249.728..2249.731 rows=3116533 loops=3)
                                            Buckets: 262144  Batches: 128  Memory Usage: 5536kB
                                            ->  Parallel Hash Join  (cost=166420.04..435762.51 rows=3888935 width=12) (actual time=1368.722..1971.852 rows=3116533 loops=3)
                                                  Hash Cond: (object_item_identifier.pk = object_identifier_ra_lookup.item_identifier_pk)
                                                  ->  Parallel Seq Scan on item_identifier object_item_identifier  (cost=0.00..192993.86 rows=4629186 width=16) (actual time=0.059..353.821 rows=3703170 loops=3)
                                                  ->  Parallel Hash  (cost=98819.35..98819.35 rows=3888935 width=12) (actual time=555.341..555.341 rows=3116533 loops=3)
                                                        Buckets: 262144  Batches: 128  Memory Usage: 5536kB
                                                        ->  Parallel Seq Scan on identifier_ra_lookup object_identifier_ra_lookup  (cost=0.00..98819.35 rows=3888935 width=12) (actual time=0.039..201.536 rows=3116533 loops=3)
                                ->  Parallel Hash  (cost=83337.57..83337.57 rows=3894857 width=28) (actual time=536.778..536.778 rows=3116533 loops=3)
                                      Buckets: 131072  Batches: 128  Memory Usage: 4128kB
                                      ->  Parallel Seq Scan on item_info_view subject_item_info  (cost=0.00..83337.57 rows=3894857 width=28) (actual time=0.016..180.314 rows=3116533 loops=3)
                          ->  Parallel Hash  (cost=83337.57..83337.57 rows=3894857 width=28) (actual time=533.069..533.069 rows=3116533 loops=3)
                                Buckets: 131072  Batches: 128  Memory Usage: 4128kB
                                ->  Parallel Seq Scan on item_info_view object_item_info  (cost=0.00..83337.57 rows=3894857 width=28) (actual time=0.035..188.027 rows=3116533 loops=3)
                    ->  Hash  (cost=533.41..533.41 rows=825 width=4) (actual time=1.340..1.340 rows=825 loops=3)
                          Buckets: 1024  Batches: 1  Memory Usage: 38kB
                          ->  Seq Scan on ra_lookup object_ra_lookup  (cost=0.00..533.41 rows=825 width=4) (actual time=0.021..1.288 rows=825 loops=3)
                                Filter: (registration_agency_pk = 3)
                                Rows Removed by Filter: 21448
Planning Time: 25.793 ms
Execution Time: 14672.202 ms
*/

Query plan for option B:


-- STM
EXPLAIN ANALYZE SELECT *
FROM relationship_api ra 
WHERE relationship_type_pk = 70
AND item_tree_asserting_party_pk = 138700
AND object_subtype_pk = 4
AND asserted_at >= '2024-01-01'
AND asserted_at <= '2024-02-01'
ORDER BY pk
LIMIT 1000;

/*
Limit  (cost=0.56..8.61 rows=1 width=66) (actual time=0.027..0.568 rows=1000 loops=1)
  ->  Index Scan using stm_data_citations_idx on relationship_api ra  (cost=0.56..8.61 rows=1 width=66) (actual time=0.027..0.534 rows=1000 loops=1)
        Index Cond: ((relationship_type_pk = 70) AND (item_tree_asserting_party_pk = 138700) AND (object_subtype_pk = 4) AND (asserted_at >= '2024-01-01 00:00:00+00'::timestamp with time zone) AND (asserted_at <= '2024-02-01 00:00:00+00'::timestamp with time zone))
Planning Time: 0.081 ms
Execution Time: 0.595 ms
*/

-- DataCite
EXPLAIN ANALYZE SELECT *
FROM relationship_api ra 
WHERE relationship_type_pk = 2
AND object_ra_pk = 3
AND asserted_at >= '2024-01-01'
AND asserted_at <= '2024-02-01'
ORDER BY pk
LIMIT 1000;

/*
Limit  (cost=0.56..213.77 rows=1000 width=66) (actual time=8.192..8.539 rows=1000 loops=1)
  ->  Index Scan using datacite_data_citations_idx on relationship_api ra  (cost=0.56..1558699.13 rows=7310787 width=66) (actual time=8.192..8.505 rows=1000 loops=1)
        Index Cond: ((relationship_type_pk = 2) AND (object_ra_pk = 3) AND (asserted_at >= '2024-01-01 00:00:00+00'::timestamp with time zone) AND (asserted_at <= '2024-02-01 00:00:00+00'::timestamp with time zone))
Planning Time: 0.057 ms
Execution Time: 8.564 ms
*/

-- Event Data
EXPLAIN ANALYZE SELECT *
FROM relationship_api ra 
WHERE relationship_type_pk = 2
AND asserted_at >= '2024-01-01'
AND asserted_at <= '2024-02-01'
ORDER BY pk
LIMIT 1000;

/*
Limit  (cost=0.56..174.77 rows=1000 width=66) (actual time=6.317..6.647 rows=1000 loops=1)
  ->  Index Scan using event_data_idx on relationship_api ra  (cost=0.56..1669498.34 rows=9583587 width=66) (actual time=6.316..6.613 rows=1000 loops=1)
        Index Cond: ((relationship_type_pk = 2) AND (asserted_at >= '2024-01-01 00:00:00+00'::timestamp with time zone) AND (asserted_at <= '2024-02-01 00:00:00+00'::timestamp with time zone))
Planning Time: 0.066 ms
Execution Time: 6.677 ms
*/

Query Plan for option C:

EXPLAIN ANALYZE
SELECT
    relationship_api.pk,
    subject_item_pk,
    relationship_type_pk,
    object_item_pk,
    asserted_at,
    subject_item_structure.type_pk AS subject_type_pk,
    subject_item_structure.subtype_pk AS subject_subtype_pk,
    object_item_structure.type_pk AS object_type_pk,
    object_item_structure.subtype_pk AS object_subtype_pk,
    item_tree_asserting_party_pk,
    matching_party_pk
FROM relationship_api
JOIN item_structure subject_item_structure 
ON subject_item_structure_pk = subject_item_structure.pk 
JOIN item_structure object_item_structure 
ON object_item_structure_pk = object_item_structure.pk 
WHERE subject_item_structure.ra_pk = 2
AND relationship_type_pk = 2
AND object_item_structure.ra_pk = 3
ORDER by pk 
LIMIT 1000;

/*
Limit  (cost=0.74..3699.96 rows=1000 width=60) (actual time=137.704..539.370 rows=1000 loops=1)
  ->  Nested Loop  (cost=0.74..1991544.96 rows=538369 width=60) (actual time=137.702..539.325 rows=1000 loops=1)
        ->  Nested Loop  (cost=0.59..1933842.03 rows=2320751 width=60) (actual time=137.693..539.078 rows=1000 loops=1)
              ->  Index Scan using relationship_api_pkey on relationship_api  (cost=0.44..1604473.36 rows=13230068 width=60) (actual time=0.076..346.889 rows=1477150 loops=1)
                    Filter: (relationship_type_pk = 2)
                    Rows Removed by Filter: 463725
              ->  Memoize  (cost=0.15..0.17 rows=1 width=8) (actual time=0.000..0.000 rows=0 loops=1477150)
                    Cache Key: relationship_api.object_item_structure_pk
                    Cache Mode: logical
                    Hits: 1477119  Misses: 31  Evictions: 0  Overflows: 0  Memory Usage: 3kB
                    ->  Index Scan using item_structure_pkey on item_structure object_item_structure  (cost=0.14..0.16 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=31)
                          Index Cond: (pk = relationship_api.object_item_structure_pk)
                          Filter: (ra_pk = 3)
                          Rows Removed by Filter: 1
        ->  Memoize  (cost=0.15..0.17 rows=1 width=8) (actual time=0.000..0.000 rows=1 loops=1000)
              Cache Key: relationship_api.subject_item_structure_pk
              Cache Mode: logical
              Hits: 994  Misses: 6  Evictions: 0  Overflows: 0  Memory Usage: 1kB
              ->  Index Scan using item_structure_pkey on item_structure subject_item_structure  (cost=0.14..0.16 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=6)
                    Index Cond: (pk = relationship_api.subject_item_structure_pk)
                    Filter: (ra_pk = 2)
Planning Time: 1.987 ms
Execution Time: 539.481 ms
 */
Last modified June 7, 2024: docs: more handover updates (f04b34d)