PostgreSQL Arrays vs Junction Tables: When Breaking Normalization Rules Makes Sense

The Debate That Changed My Mind

I had a heated discussion today about a PostgreSQL table using array columns. My immediate reaction was textbook: “Why aren’t we using relationships instead of arrays? We’re breaking normalization!”

For years, I’ve held firm to the principle that relational databases should embrace relationships. Arrays felt like a code smell—a shortcut that sacrifices data integrity for convenience. But this conversation made me question whether my stance was principled or just archaic.

So I dug in. What are the real use cases and trade-offs of using arrays in a relational database? The answer surprised me: sometimes breaking the rules is the right choice.

The Bottom Line Up Front

PostgreSQL arrays offer 6x storage savings and up to 10x faster reads for containment queries compared to junction tables—but only when used appropriately. Arrays excel for small, rarely-modified collections that share a lifecycle with their parent row, such as tags on a blog post or permission roles. However, they sacrifice referential integrity entirely (no foreign keys exist for array elements), require expensive full-row rewrites on every modification, and can severely degrade write performance when combined with GIN indexes.

The decision hinges on three critical factors: modification frequency, referential integrity requirements, and expected collection size. For most transactional applications requiring data consistency, junction tables remain the default choice. Arrays should be adopted as a deliberate performance optimization, either for well-established patterns (like blog tagging) or when you have concrete evidence that junction tables create a measurable bottleneck.

Just because PostgreSQL supports arrays doesn’t make them the right architectural choice by default. They’re a powerful tool for specific problems—not a replacement for proper relational modeling.

When Arrays Actually Outperform Normalized Relationships

The performance numbers are striking. Benchmark data from production systems shows array columns with GIN indexes completing containment queries in 869 microseconds versus 8.32 milliseconds for equivalent junction table joins—approximately 9.6x faster at scale (25,000 posts, 10,000 tags, 15 tags per post). For asymmetric relationships where one side contains few elements while the other contains many, arrays can achieve 20-30x performance gains over traditional joins.

Storage efficiency provides another compelling advantage. Array models consume approximately 55% less storage than equivalent relational models in Crunchy Data’s benchmarks (199MB versus 446MB for 1.7 million records with 4.7 million relationships). This reduction comes from eliminating junction table overhead—no separate table, no additional primary keys, no foreign key columns.

The canonical use case is tagging systems where tags are treated as a unit rather than independently queried. Dimitri Fontaine, a PostgreSQL major contributor, specifically identifies tags as “a classic example of a good use case for PostgreSQL arrays.” Other strong candidates include:

  • Permission roles (small, fixed sets like ['admin', 'user', 'editor'])
  • IP address whitelists
  • Product attributes in e-commerce
  • Analytics event storage where data is written once and read many times

Here’s the key insight from production experience: array size matters far less than modification frequency. An array containing 10,000 elements written once and read thereafter is entirely valid. An array of 50 elements appended to on every incoming request becomes a performance bottleneck because PostgreSQL must copy and rewrite the entire row for every modification—arrays are atomic values under MVCC with no in-place updates.

GIN Indexing: Fast Searches with a Hidden Cost

The GIN (Generalized Inverted Index) makes array containment queries practical by creating an inverted index mapping each element to rows containing it. GIN lookups are approximately 3x faster than GiST lookups for equivalent queries. Creating a GIN index is straightforward:

CREATE INDEX idx_tags ON articles USING GIN(tags);

This index accelerates the containment (@>), overlap (&&), and contained-by (<@) operators. Finding all posts containing a specific tag becomes a simple indexed query:

SELECT * FROM posts WHERE tags @> ARRAY['postgresql'];

However, there’s a critical gotcha: the = ANY() operator does not use GIN indexes and forces sequential scans. Always use containment operators instead:

-- BAD: Forces sequential scan despite GIN index
SELECT * FROM posts WHERE 'postgresql' = ANY(tags);

-- GOOD: Uses GIN index efficiently
SELECT * FROM posts WHERE tags @> ARRAY['postgresql'];

The trade-off appears in write performance. GIN indexes suffer from write multiplication: a single row with 10 array elements creates 10 index entries. GitLab documented production issues where GIN pending list cleanup on a table receiving over 1,000 updates per minute cost between 465ms and 3,155ms per flush cycle. For write-heavy workloads, consider either disabling fastupdate (CREATE INDEX ... WITH (fastupdate = off)) to avoid bursty cleanup costs, or simply using junction tables with B-tree indexes instead.

TOAST Storage Creates Hidden Costs

When array data exceeds approximately 2KB, PostgreSQL moves it to TOAST (The Oversized-Attribute Storage Technique) storage. This threshold exists to ensure four tuples fit on a single 8KB heap page. Once TOASTed, array updates become significantly more expensive: PostgreSQL must fetch all chunks, decompress the data, apply the modification, recompress, and write back.

PostgreSQL 14 introduced LZ4 compression as an alternative to the default PGLZ algorithm. LZ4 provides 60-70% faster compression with decompression speeds matching uncompressed data access, though PGLZ achieves approximately 7% better compression ratios. For applications using large arrays, setting the column’s storage strategy to LZ4 can substantially reduce CPU overhead:

ALTER TABLE posts ALTER COLUMN tags SET COMPRESSION lz4;

The practical implication is that arrays work best when kept under 2KB. This typically accommodates arrays with fewer than 100-200 integer elements or 50-100 short text elements, depending on exact data characteristics.

The Referential Integrity Problem

Here’s where arrays show their biggest weakness: PostgreSQL does not support foreign key constraints on array elements. A patch for “Array ELEMENT Foreign Keys” was proposed around 2012-2013 by 2ndQuadrant but never merged due to lack of consensus on behavior semantics. A 2017 Google Summer of Code project attempted revival but also never completed.

This limitation has significant implications. If you store tag IDs in an array and later delete a tag from the tags table, the orphaned ID remains in arrays indefinitely—no ON DELETE CASCADE or automatic cleanup exists. Nothing prevents inserting non-existent IDs into arrays. The only enforcement mechanism is custom triggers:

CREATE OR REPLACE FUNCTION validate_array_fk()
RETURNS TRIGGER AS $$
BEGIN
    IF EXISTS (
        SELECT 1 FROM unnest(NEW.tag_ids) AS tid
        WHERE tid NOT IN (SELECT id FROM tags)
    ) THEN
        RAISE EXCEPTION 'Invalid tag ID in array';
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER check_tag_ids
    BEFORE INSERT OR UPDATE ON posts
    FOR EACH ROW EXECUTE FUNCTION validate_array_fk();

However, trigger-based validation requires triggers on both sides—the array table (for inserts/updates) and the referenced table (for deletes). This adds performance overhead, maintenance complexity, and still cannot replicate cascade behaviors automatically.

The practical guidance from PostgreSQL experts is clear: “Arrays are for data that shares the same lifecycle as the parent row. Not for relationships spanning across different tables.” If you need referential integrity, use junction tables.

Understanding Independent Querying and Referencing

The phrase “elements never need independent querying or referencing” is critical but often misunderstood. Let me unpack this with concrete examples, because this is where arrays break down as a data model.

What “independent querying” means

Independent querying means you need to ask questions about the elements themselves, not just filter by them.

Array-friendly (no independent querying):

-- Finding posts WITH a specific tag
SELECT * FROM posts WHERE tags @> ARRAY['postgresql'];

You’re using tags as a filter, but you don’t care about the tags themselves.

Array-hostile (requires independent querying):

-- How many posts use each tag?
SELECT tag, COUNT(*) FROM post_tags GROUP BY tag;

-- What are the top 10 most popular tags?
SELECT tag, COUNT(*) as usage FROM post_tags
GROUP BY tag ORDER BY usage DESC LIMIT 10;

-- Which tags have been used in the last week?
SELECT DISTINCT tag FROM post_tags pt
JOIN posts p ON pt.post_id = p.id
WHERE p.created_at > NOW() - INTERVAL '7 days';

With arrays, these queries become significantly more complex:

-- How many posts use each tag? (with arrays)
SELECT tag, COUNT(*)
FROM posts, unnest(tags) as tag
GROUP BY tag;

This works, but it’s:

  • Non-standard SQL that many developers won’t recognize
  • Forces a full table scan (PostgreSQL must unnest every row’s array)
  • Cannot be efficiently indexed for this access pattern
  • Much slower at scale

What “independent referencing” means

Independent referencing means other tables need to point to individual elements, not just the collection.

Consider this scenario:

-- With arrays: tags stored as text[] in posts table
CREATE TABLE posts (
    id SERIAL PRIMARY KEY,
    title TEXT,
    tags TEXT[]  -- ['postgresql', 'databases', 'performance']
);

-- Now you want to add tag descriptions
CREATE TABLE tag_metadata (
    tag_name TEXT PRIMARY KEY,
    description TEXT,
    created_at TIMESTAMP
);

This works, but what if you want tag_metadata to reference specific posts? You cannot create a foreign key from tag_metadata.tag_name to an element within posts.tags[]. PostgreSQL simply doesn’t support this.

A more complex example with user permissions:

-- Storing user roles as an array
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email TEXT,
    roles TEXT[]  -- ['admin', 'editor', 'viewer']
);

-- Now you want to track when roles were granted
CREATE TABLE role_grants (
    user_id INTEGER REFERENCES users(id),
    role TEXT,  -- How do you enforce this exists in users.roles?
    granted_at TIMESTAMP,
    granted_by INTEGER
);

There’s no way to create a foreign key ensuring role_grants.role exists in users.roles[]. You’re relying on application-level enforcement, which is fragile.

The junction table alternative provides proper referential integrity:

-- Normalized approach
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email TEXT
);

CREATE TABLE roles (
    id SERIAL PRIMARY KEY,
    name TEXT UNIQUE
);

CREATE TABLE user_roles (
    user_id INTEGER REFERENCES users(id),
    role_id INTEGER REFERENCES roles(id),
    granted_at TIMESTAMP,
    granted_by INTEGER,
    PRIMARY KEY (user_id, role_id)
);

Now role_grants can reference roles(id) with a proper foreign key, ensuring referential integrity.

The real-world complexity: Requirements evolve

Here’s where it gets tricky—requirements change. You start with simple tags in an array, then six months later:

  • Marketing wants a “tag analytics dashboard”
  • Product wants “suggest similar posts based on tag overlap”
  • SEO team wants “tag landing pages with metadata”
  • Content team wants “merge these duplicate tags”

Suddenly you need independent querying and referencing, but you’ve built on arrays. The migration path is painful:

-- Migrating from arrays to normalized structure
CREATE TABLE tags (
    id SERIAL PRIMARY KEY,
    name TEXT UNIQUE
);

CREATE TABLE post_tags (
    post_id INTEGER REFERENCES posts(id),
    tag_id INTEGER REFERENCES tags(id),
    PRIMARY KEY (post_id, tag_id)
);

-- Painful data migration
INSERT INTO tags (name)
SELECT DISTINCT unnest(tags) FROM posts;

INSERT INTO post_tags (post_id, tag_id)
SELECT p.id, t.id
FROM posts p
CROSS JOIN unnest(p.tags) tag_name
JOIN tags t ON t.name = tag_name;

ALTER TABLE posts DROP COLUMN tags;

Decision criteria summary

“Never need independent querying or referencing” means:

  • Elements are always accessed through their parent row
  • You never aggregate/count/analyze elements across rows
  • No other tables need to point to individual elements
  • Elements have no independent existence or attributes
  • The relationship is truly one-directional (parent owns elements, not vice versa)

If any of these assumptions break—or might break in the future—junction tables become the better choice. When in doubt, start with junction tables. They’re easier to optimize than to retrofit referential integrity later.

Query Patterns: A Study in Contrasts

Junction tables use standard SQL patterns familiar to all developers. Finding posts with a specific tag requires a join:

SELECT p.* FROM posts p
JOIN post_tags pt ON p.id = pt.post_id
WHERE pt.tag_id = 5;

Arrays use PostgreSQL-specific containment operators:

SELECT * FROM posts WHERE tags @> ARRAY[5];

Adding relationships shows the asymmetry in complexity. Junction tables use simple inserts:

INSERT INTO post_tags (post_id, tag_id) VALUES (1, 5);

Arrays require updates:

UPDATE posts SET tags = array_append(tags, 5) WHERE id = 1;

The array approach simplifies queries requiring multiple tag matches. Finding posts with all of three specific tags requires complex multi-joins with junction tables but reduces to a single containment check with arrays. Crunchy Data’s benchmarks showed this pattern completing in 120ms with arrays versus 950ms with junction tables—a 7x improvement.

ORM support presents a challenge for arrays. SQLAlchemy requires explicit MutableList wrappers to detect in-place mutations; standard .append() operations are not automatically detected without flag_modified() calls. ActiveRecord support exists through gems but with limitations for multi-dimensional arrays. Most ORMs lack native support for array-based many-to-many relationships.

Normalization in the Modern Era

Arrays technically violate First Normal Form (1NF), which requires columns to contain atomic values. However, modern perspectives argue this violation is acceptable when elements represent a single multi-valued attribute sharing the parent row’s lifecycle, and when individual elements never need independent querying or referencing.

The contemporary consensus treats denormalization not as bypassing normalization but as a performance optimization applied to a normalized model. The recommended approach is to start normalized, measure actual query patterns, then denormalize based on evidence. This is critical: denormalization should be a response to measured performance problems, not a preemptive architectural choice.

PostgreSQL’s advanced array features—GIN indexes, the intarray extension, native containment operators, and seamless UNNEST/ARRAY_AGG conversions—enable arrays to serve legitimate use cases that would have been impractical in databases with more limited type systems. However, these features also represent PostgreSQL’s expansion into NoSQL and data warehouse territory. The database is evolving to compete with document stores and analytical systems, adopting flexible schema patterns to capture broader market share.

This cross-pollination brings valuable tools, but it doesn’t invalidate the relational model’s core principles. Not every PostgreSQL feature is appropriate for every use case. The existence of array support doesn’t make it the right default—it makes it an available optimization when evidence supports its use.

A Decision Framework

The choice reduces to three key questions—but remember, junction tables should be your default unless you have specific evidence that arrays solve a measured problem:

  1. Does this data share its lifecycle completely with the parent row? If the data could exist independently or needs separate referential integrity, use junction tables.

  2. How frequently will the collection be modified after initial creation? Write-once or rarely-modified data favors arrays; frequently-updated collections favor junction tables.

  3. What is the expected collection size? Collections typically under 100 elements favor arrays; unbounded or large collections favor junction tables.

Important: Even when arrays seem appropriate, consider whether this is a well-established pattern (like blog tags) or a novel use case. For novel cases, start with junction tables until performance metrics prove arrays are necessary.

Scenario Recommendation Primary reason
User tags on content Arrays Classic use case, simpler schema, better read performance
Permission roles Arrays Small fixed set, rarely modified
Blog post authors Junction table Referential integrity required, authors write multiple posts
E-commerce product categories Junction table Need category counts, bidirectional queries, hierarchy
Analytics event storage Arrays Write-once, batch processing
Social following/followers Junction table High modification frequency, bidirectional relationships
IP address whitelists Arrays Same lifecycle as service account, rarely modified
Product-variant relationships Junction table Additional attributes needed on relationship

Practical Implementation Guidelines

When choosing arrays:

  • Create GIN indexes for any column that will be searched by containment
  • Consider the intarray extension for integer arrays—it provides specialized operators that can be 20-30x faster than generic array operators
  • Monitor GIN index health using pgstattuple’s pgstatginindex() function to track pending list size
  • Set reasonable array size limits through CHECK constraints to prevent unbounded growth

When choosing junction tables:

  • Create appropriate indexes on foreign key columns (PostgreSQL does not automatically index FK columns)
  • Consider composite indexes for common query patterns
  • Use ON DELETE CASCADE or ON DELETE SET NULL to maintain referential integrity automatically
  • For read-heavy workloads, consider materialized views that pre-compute common joins

For teams uncertain about requirements: Start with junction tables. They provide maximum flexibility and data integrity with well-understood migration paths. Converting from junction tables to arrays is straightforward using array_agg(). Converting from arrays to junction tables is significantly harder, requiring data migration, application changes, and potentially complex handling of any orphaned references that accumulated without FK protection.

Conclusion: Optimization, Not Default

Arrays represent “document storage in disguise” within PostgreSQL—trading relational guarantees for locality and simplicity. They deliver measurable performance benefits (6-10x faster reads, 6x storage reduction) for appropriate use cases but impose real costs (no referential integrity, expensive modifications, limited ORM support) that must be weighed deliberately.

After this deep dive, I’ve arrived at a nuanced position: arrays should be treated as a performance optimization based on measured evidence, not as a default architectural choice.

PostgreSQL has been expanding into territory traditionally occupied by NoSQL databases and data warehouses. Array columns, JSONB types, and other “flexible” features represent cross-pollination—PostgreSQL adopting capabilities to expand its market share and compete with document stores. These features are powerful tools in the right context, but they’re not automatically better just because they’re possible.

Just because PostgreSQL allows arrays doesn’t mean they’re the appropriate choice for most relationship modeling. The relational model’s constraints—normalization, foreign keys, referential integrity—exist for good reasons: data consistency, maintainability, and preventing subtle bugs that emerge months or years later.

The critical question isn’t “Can we use arrays?” but “Have we measured a performance problem that arrays specifically solve?”

For well-known use cases with established patterns (blog post tags, user preference lists, analytics event attributes), arrays can be adopted confidently. These scenarios have been battle-tested across thousands of applications. But for anything else, the burden of proof should be high:

  • Do we have concrete performance metrics showing junction tables are a bottleneck?
  • Are we certain these elements will never need independent querying?
  • Do we understand the migration cost if requirements change?
  • Is the team comfortable maintaining PostgreSQL-specific query patterns?

Without affirmative answers to these questions, junction tables remain the principled default. They provide:

  • Database-enforced referential integrity
  • Standard SQL patterns any developer understands
  • Flexibility for evolving requirements
  • Clear migration paths in any direction

My initial reaction to seeing arrays in production wasn’t wrong—it was incomplete. Arrays can be the right tool, but only when chosen deliberately as an optimization for a measured problem, not as a shortcut to avoid proper relationship modeling.

The relational model has survived decades because it solves real problems. PostgreSQL’s extensions are valuable additions to the toolbox, not replacements for foundational principles. Use arrays when you have evidence they’re needed. Default to proper relationships when you don’t.


What’s been your experience with PostgreSQL arrays? Have you encountered cases where they solved problems elegantly, or where they created technical debt? I’d love to hear your perspective.