home of bencornia.com

Ambiguous Groups

Published 2025 January 3

I have been writing some reports in sql and kept running into confusion about how the group by clause works. The majority of my distress had to do with the interplay between columns in the group by and the select. Let's see what the 1986 SQL standard 1 has to say about it.

Query Specification Syntax Rules #

If R is a grouped table, then each <column specification> in each <value expression> shall reference a grouping column or be specified within a <set function specification>. If R is not a grouped table and any <value expression> includes a <set function specification>, then every <column specification> in every <value expression> shall be specified within a <set function specification>

Ok lets break this down into parts:

If R is a grouped table, then each <column specification> in each <value expression> shall reference a grouping column or be specified within a <set function specification>.

This rule is describing what columns are allowed in a select clause if the query contains a group by clause. Only grouped columns or columns used in an aggregate function are allowed in the select clause.

If R is not a grouped table and any<value expression>includes a<set function specification>, then every <column specification> in every <value expression> shall be specified within a <set function specification>

The second part of this rule deals with non-grouped tables with an aggregate function in the select clause. If any columns in the select clause are included in an aggregate function then every column must also be included in an aggregate function.

So why do grouped tables and aggregate functions require special handling? Well, it has everything to do with avoiding amibiguity in the query result. Here are a couple examples.

Example Usage #

create table employee(
  id integer primary key,
  name text,
  age integer,
  company text,
  salary integer
);

insert into employee(name, age, company, salary)
values      ('Richard Stonebraker', 40, 'Apple', 500000),
            ('Raymond Chamberlin', 45, 'Apple', 100000),
            ('Donald Boyce', 80, 'Apple', 200000),
            ('Michael Hipp', 78, 'Adobe', 30000),
            ('Edgar Ellison', 23, 'Adobe', 70000),
            ('Larry Codd', 66, 'Microsoft', 90000);

Grouped Tables

select      company, avg(salary)
from        employee
group by    company

# +-----------+------------------+
# |  company  |   avg(salary)    |
# +-----------+------------------+
# | Adobe     | 50000.0          |
# | Apple     | 266666.666666667 |
# | Microsoft | 90000.0          |
# +-----------+------------------+
# Improper usage of grouped table and aggregate functions
select      company, name
from        employee
group by    company

# +-----------+---------------------+
# |  company  |        name         | <-- The non-grouped name column is ambiguous
# +-----------+---------------------+     in relation to the grouped company column
# | Adobe     | Michael Hipp        |
# | Apple     | Richard Stonebraker |
# | Microsoft | Larry Codd          |
# +-----------+---------------------+

Non-grouped Tables with Aggregate Functions

# Proper usage of non-grouped table and aggregate functions
select      avg(age), avg(salary)
from        employee;

# +------------------+-------------+
# |     avg(age)     | avg(salary) |
# +------------------+-------------+
# | 55.3333333333333 | 165000.0    |
# +------------------+-------------+
# Improper usage of non-grouped table and aggregate functions
select      avg(age), salary
from        employee;

# +------------------+--------+
# |     avg(age)     | salary | <-- The non-aggregated salary column is ambiguous
# +------------------+--------+     in relation to the avg(age) aggregated column
# | 55.3333333333333 | 500000 |
# +------------------+--------+

Gotchas #

MySQL and SQLite do not enforce this rule, which was part of my confusion. In addition, if you include a non-grouped column in a grouped table but every row is the same then you will get an ambiguously correct result.

create table employee(
  id integer primary key,
  name text,
  age integer,
  company text,
  salary integer
);

insert into employee(name, age, company, salary)
values      ('Richard Stonebraker', 40, 'Apple', 500000),
            ('Edgar Ellison', 40, 'Adobe', 70000),
            ('Larry Codd', 40, 'Microsoft', 90000);

select      company, age
from        employee
group by    company

# +-----------+-----+
# |  company  | age | <-- The age column is ambiguously correct.
# +-----------+-----+     We've been tricked!
# | Adobe     | 40  |
# | Apple     | 40  |
# | Microsoft | 40  |
# +-----------+-----+

This is an unlikely query but it demonstrates that not following standard SQL can easily get you in trouble.

References #


  1. National Bureau of Standards (1986). Federal Information Processing Standards Publication: for information systems - database language - SQL Retrieved from Internet Archive website ↩︎