Wednesday 25 May 2011

SQL revision - the GROUP BY clause

Every now and then I like to go back and have a look at something I think I know and double check my assumptions and understanding. In this case I’m going to look at the GROUP BY clause in SQL.

Partitions

Firstly, let’s look at partitions in sets. This is important because it directly relates to the ‘groups’ returned by a GROUP BY clause.

“A partition of a set X is a set of nonempty subsets of X such that every element x in X is in exactly one of these subsets.” *

We can derive certain properties of partitions:

  • The union of all the partitions (subsets) returns the original set
  • The intersection of the partitions (subsets) is empty

We can think of this like dividing a pizza into pieces. Each piece is a partition, and joining the partitions together gives us the original pizza.

image_thumb[3]

The ‘groups’ returned by a GROUP BY clause are effectively simple partitions of the original set of data.

The GROUP BY clause

When we use a GROUP BY clause we are taking the resulting set of data from a query (consisting of a FROM and WHERE clauses) and then put the rows into groups (partitions) based on the values of the columns specified in the GROUP BY clause.

Each group becomes a single row in the result table. Each column in the row must be a characteristic of the group not of a single row in the group. Therefore the SELECT list must be made up of grouping columns or optional aggregate functions. Note also that groups, by definition, must have at least one row (i.e. they can’t be empty). his means that the result of a COUNT will never return zero when used in a query against a non-empty table. Groups are also distinct.

The resulting table of a GROUP BY is called a group table. All subsequent operations are executed on the rows in the group table rather than the original rows.

NULL values are generally treated as a single group.

groupby

The columns in the SELECT statement can be a subset of the columns in the GROUP BY clause, but the columns in the GROUP BY clause can never be a subset of the columns in the SELECT statement.

References

* Partition of a set, Wikipedia

Wednesday 25 May 2011