SAS Proc SQL Group By returns multiple rows per group

SAS
SQL
Coding
Author

Vinh Nguyen

Published

April 6, 2013

Just wanted to note that for traditional SQL implementations (e.g., MySQL, MS-SQL), the Group By statement used to aggregate a variable by certain variable(s) returns 1 row for each group. When a column that is not unique within a group is also selected, then the row that's returned is determined somehow by the DB software.

In contrast, SAS's Proc SQL will return multiple rows for each group (the number of original rows), with the aggregated variable repeated for each row within a group. Here's an example:

```{sas eval=FALSE} data foo ; infile datalines dlm=” ” ; input name $ week $ sales ; datalines ; bob 1 20000 bob 2 30000 jane 1 40000 jane 2 50000 mike 1 60000 mike 2 70000 kevin 1 80000 kevin 2 90000 ; run ;

proc sql ; create table foo_agg as select a.name , a.week , sum(a.sales) as total_sales from foo as a group by name ; quit ; run ;

proc export data=foo_agg outfile=“foo_agg.csv” DBMS=csv REPLACE ; run ;




The content of `foo_agg.csv` looks like

name,week,total_sales bob,2,50000 bob,1,50000 jane,1,90000 jane,2,90000 kevin,1,170000 kevin,2,170000 mike,2,130000 mike,1,130000


An analogous return from the SQL code in MySQL or MS-SQL might look something like

name,week,total_sales bob,2,50000 jane,1,90000 kevin,1,170000 mike,2,130000 ```

In SAS's Proc SQL, one would need to use the Select Distinct statement in order to remove the duplicate rows.

Note that when combining the Group By statement with a Join, these multiple records per group still hold.

SAS's implementation is not necessarily bad as it gives the user's more flexibility in returning an aggregated variable with every row without re-joining the aggregated table with the original table. The user just has to remember this behavior ;).