Ambiguous Groups
Published 2025 January 3I have been writing some reports in sql and kept running into confusion about how the group by
clause works. The majority of my distress had to do with the interplay between columns in the group by
and the select
. Let's see what the 1986 SQL standard 1 has to say about it.
Query Specification Syntax Rules #
If R is a grouped table, then each
<column specification>
in each<value expression>
shall reference a grouping column or be specified within a<set function specification>
. If R is not a grouped table and any<value expression>
includes a<set function specification>
, then every<column specification>
in every<value expression>
shall be specified within a<set function specification>
Ok lets break this down into parts:
If R is a grouped table, then each
<column specification>
in each<value expression>
shall reference a grouping column or be specified within a<set function specification>
.
This rule is describing what columns are allowed in a select clause if the query contains a group by clause. Only grouped columns or columns used in an aggregate function are allowed in the select clause.
If R is not a grouped table and any
<value expression>
includes a<set function specification>
, then every<column specification>
in every<value expression>
shall be specified within a<set function specification>
The second part of this rule deals with non-grouped tables with an aggregate function in the select clause. If any columns in the select clause are included in an aggregate function then every column must also be included in an aggregate function.
So why do grouped tables and aggregate functions require special handling? Well, it has everything to do with avoiding amibiguity in the query result. Here are a couple examples.
Example Usage #
create table employee(
id integer primary key,
name text,
age integer,
company text,
salary integer
);
insert into employee(name, age, company, salary)
values ('Richard Stonebraker', 40, 'Apple', 500000),
('Raymond Chamberlin', 45, 'Apple', 100000),
('Donald Boyce', 80, 'Apple', 200000),
('Michael Hipp', 78, 'Adobe', 30000),
('Edgar Ellison', 23, 'Adobe', 70000),
('Larry Codd', 66, 'Microsoft', 90000);
Grouped Tables
select company, avg(salary)
from employee
group by company
# +-----------+------------------+
# | company | avg(salary) |
# +-----------+------------------+
# | Adobe | 50000.0 |
# | Apple | 266666.666666667 |
# | Microsoft | 90000.0 |
# +-----------+------------------+
# Improper usage of grouped table and aggregate functions
select company, name
from employee
group by company
# +-----------+---------------------+
# | company | name | <-- The non-grouped name column is ambiguous
# +-----------+---------------------+ in relation to the grouped company column
# | Adobe | Michael Hipp |
# | Apple | Richard Stonebraker |
# | Microsoft | Larry Codd |
# +-----------+---------------------+
Non-grouped Tables with Aggregate Functions
# Proper usage of non-grouped table and aggregate functions
select avg(age), avg(salary)
from employee;
# +------------------+-------------+
# | avg(age) | avg(salary) |
# +------------------+-------------+
# | 55.3333333333333 | 165000.0 |
# +------------------+-------------+
# Improper usage of non-grouped table and aggregate functions
select avg(age), salary
from employee;
# +------------------+--------+
# | avg(age) | salary | <-- The non-aggregated salary column is ambiguous
# +------------------+--------+ in relation to the avg(age) aggregated column
# | 55.3333333333333 | 500000 |
# +------------------+--------+
Gotchas #
MySQL and SQLite do not enforce this rule, which was part of my confusion. In addition, if you include a non-grouped column in a grouped table but every row is the same then you will get an ambiguously correct result.
create table employee(
id integer primary key,
name text,
age integer,
company text,
salary integer
);
insert into employee(name, age, company, salary)
values ('Richard Stonebraker', 40, 'Apple', 500000),
('Edgar Ellison', 40, 'Adobe', 70000),
('Larry Codd', 40, 'Microsoft', 90000);
select company, age
from employee
group by company
# +-----------+-----+
# | company | age | <-- The age column is ambiguously correct.
# +-----------+-----+ We've been tricked!
# | Adobe | 40 |
# | Apple | 40 |
# | Microsoft | 40 |
# +-----------+-----+
This is an unlikely query but it demonstrates that not following standard SQL can easily get you in trouble.
References #
-
National Bureau of Standards (1986). Federal Information Processing Standards Publication: for information systems - database language - SQL Retrieved from Internet Archive website ↩︎