ent-framework
  • Ent Framework
  • Getting Started
    • Code Structure
    • Connect to a Database
    • Create Ent Classes
    • VC: Viewer Context and Principal
    • Ent API: insert*()
    • Built-in Field Types
    • Ent API: load*() by ID
    • N+1 Selects Solution
    • Automatic Batching Examples
    • Ent API: select() by Expression
    • Ent API: loadBy*() Unique Key
    • Ent API: update*()
    • Ent API: deleteOriginal()
    • Ent API: count() by Expression
    • Ent API: exists() by Expression
    • Ent API: selectBy() Unique Key Prefix
    • Ent API: upsert*()
    • Privacy Rules
    • Validators
    • Triggers
    • Custom Field Types
  • Ent API: Configuration and Types
  • Scalability
    • Replication and Automatic Lag Tracking
    • Sharding and Microsharding
    • Sharding Terminology
    • Locating a Shard and ID Format
    • Sharding Low-Level API
    • Shard Affinity and Ent Colocation
    • Inverses and Cross Shard Foreign Keys
    • Shards Rebalancing and pg-microsharding Tool
    • Connection Pooling
  • Advanced
    • Database Migrations and pg-mig Tool
    • Ephemeral (Symbol) Fields
    • Atomic Updates and CAS
    • Custom Field Refactoring
    • VC Flavors
    • Query Cache and VC Caches
    • Loaders and Custom Batching
    • PostgreSQL Specific Features
    • Query Planner Hints
    • Cluster Maintenance Queries
    • Logging and Diagnostic Tools
    • Composite Primary Keys
    • Passwords Rotation
  • Architecture
    • Abstraction Layers
    • Ent Framework, Meta’s TAO, entgo
    • JIT in SQL Queries Batching
    • To JOIN or not to JOIN
Powered by GitBook
On this page
  • Types of Joins
  • Type 1: Statistical Queries and OLAP
  • Type 2: Precise Query Optimization
  • Type 3: Parent-Children Loading and N+1 Selects Problem
  • Painful Boilerplate Analogy
  • Round Trip Latency Consideration

Was this helpful?

Edit on GitHub
  1. Architecture

To JOIN or not to JOIN

PreviousJIT in SQL Queries Batching

Last updated 3 months ago

Was this helpful?

Ent Framework design discourages people from using SQL JOINs. Instead, it relies on the in-app parallel Promises merging and automatic queries batching, for 2 main reasons:

  1. It allows to work with microshards seamlessly (no JOINs can be run across the database boundaries efficiently).

  2. It holistically solves .

Types of Joins

In web development, JOINs are often times abused heavily. There are 3 main use cases when people use JOINs traditionaly, and only 2 of them are legit.

Type 1: Statistical Queries and OLAP

When you have a large database, you sometimes need to pull some statistical information out of it. E.g. to answer a question, how many users registered and performed some action within a time frame, or how much money did the service earn, etc. Often times, building an SQL query with JOINs and running it over a replica database is the easiest solution.

This use case is not so much frequent though. Although it's a fully legit use for the JOINs, it is relatively rare. Also, the larger your service becomes, the higher are the chances that you'll need to use some data warehouse solution instead for offline analysis (like Snowflake or a Presto-backed service).

What distinguishes such a use case is that you run a small number of very heavy queries (OLAP pattern).

Type 2: Precise Query Optimization

Sometimes you just want to squeeze the maximum performance from your database when running an OLTP load (i.e. when running a large number of very fast queries). I.e. you use JOINs for computational performance reasons: instead of transmitting 2 large lists from the database and intersecting them at the client (throwing away the absolute most of the transferred and non-matched rows), you ask the database server to do it internally utilizing indexes.

But again, although it's a fully legit use case for JOINs, the need for it is relatively rare.

Type 3: Parent-Children Loading and N+1 Selects Problem

And here coms the most frequent use cases when JOINs are traditionally used (actually, abused) in all mainstream ORMs. It is not related to slow queries, and not related to intersecting large lists throwing away non-matching items. The use case is purely about loading some objects and then their parents (or children), i.e. loading a sub-graph from a graph-like structure.

In fact, such a simple use case composes the absolute most of the queries in real life.

Let's see how it's done in Prisma:

const commentsWithDetails = await prisma.comment.findMany({
  where: {
    id: {
      in: commentIDs,
    },
  },
  include: {
    author: true, // Include the author of the comment.
    topic: {
      include: {
        creator: true, // Include the creator of the topic.
      },
    },
  },
});

Why is it suboptimal? Because such approaches force us to maka an assumption that at this level of abstraction, we have the complete list of comment IDs, and it is almost always not the case.

Consider that we only know one comment ID at a time, but still want to pull the objects related to that comment:

async function loadCommentWithDetails(id: string) {
  return prisma.comment.findUnique({
    where: {
      id
    },
    include: {
      author: true, // Include the author of the comment.
      topic: {
        include: {
          creator: true, // Include the creator of the topic.
        },
      },
    },
  });
}

Such an API has 2 fundamental flaws when using traditional ORMs without built-in query batching:

  1. If you need to load 100 comment—what would you even do? Try using Promise.all() with loadCommentWithDetails() for 100 IDs, an you'll get 100 database queries with JOINs.

  2. "Load comments with details"—with what exact details? In one place of the code you'll need authors and topics, and in another one, you may only need the direct comment creator. Would you build a separate function with boilerplate for that?

Painful Boilerplate Analogy

The above "type 3" of JOINs is not quite what JOINs are designed for: people use it to "duct tape" the real problem in a boilerplatish way.

This reminds the early days of Web, when people were emitting their HTML as plain text, escaping values in every place explicitly:

<-- PHP code from 1990x. Beware: your eyes will bleed! -->
<p>Hellow, <?php echo htmlspecialchars($first); ?>
<?php echo htmlspecialchars($last); ?>!</p>

Thousands of projects were written this way.

Type 3 JOIN is not much different conceptually.

Think about the data duplication such JOINs produce over the wire. Consider the following query:

SELECT
  comments.id,
  comments.text,
  users.id AS author_id,
  users.name AS author_name
FROM comments
JOIN users ON users.id = comments.author_id

The resulting data that is sends from the database server is:

id
text
author_id
author_name

1

hello

42

Alice

2

my

42

Alice

3

dear

42

Alice

4

friend

101

Bob

5

bye now

101

Bon

Does it hurt your sense of engineering perfection? Does it smell to you?

  1. The author_id+author_name pair of values (42+Alice) is repeated 3 times in the payload for the first 3 comments, and then 101+Bob is repeated 2 times for the last 2 comments. Imagine now that users table has way more columns in practice.

  2. Another smell is that "author_" prefix: although being minor, it's clearly a naming boilerplate. You need to introduce some naming mapping convention between the column names in the JOIN result and in your ORM objects (be it glueing parts with "_" or with "." or whatever).

Such things are more related not to real resources utilization (the difference is marginal), but to the design and architecture smells.

Round Trip Latency Consideration

To be fair, there is still one benefit in using type 3 JOINs: when you fetch comments, topics and users all at once, you only have 1 round-trip to the database server:

-- Traditional ORM's way: 1 round-trip.
SELECT *
FROM comments
JOIN topics ON topics.id = comments.topic_id
JOIN users authors ON authors.id = comments.author_id
JOIN users creators ON creators.id = topics.creator_id

I.e. you send 1 request and get 1 response (with duplicated data, but anyways).

If your backend-to-database network connection is slow (like one query takes 50 ms, which happens in commerical and highly vendor-lock-in prone solutions), then such consideration is significant.

So, in slow networks, JOINs win over the Ent Framework's automatic batching approach:

-- Ent Framework's way: 3 round-trips.
SELECT * FROM comments WHERE id IN(...);
SELECT * FROM topics WHERE id IN(...);
SELECT * FROM users WHERE id IN(...);

What's the catch?

50 ms for a database query round trip is not a norm. That's the catch.

In real life (and since 1990x), your network to the database is not slow. Quite the opposite, it is very fast, and you have sub-millisecond latency. Otherwise your entire backend becomes just so painfully slow in all other places that you can't manage it.

Databases are designed to serve queries, do it fast, and with low latency at high concurrency. This is what the databases are for. Let's use the microscope for science and not to hammer nails.

  • Reality: you have troubles with your app design if the query latency is longer than 1-2 ms; round trip time does not affect latency much in case the queries are batched.

To learn more about batching, "parallel calls", and how event loop works in Node, check out Loaders and Custom Batching article.

This query produces a JOIN, and it does it for only one sole purpose: to work-around the problem.

False assumption: 50 ms database query is a norm; JOINs are to minimize round trips and solve problem; take my money dear Vercel & Co.

N+1 Selects problem
N+1 Selects
N+1 Selects