Skip links and keyboard navigation

Code snippets

Counting rows and unique persons

Counting number of rows in a dataset

Standard SQL / SAS Proc SQLSAS Data stepR (Base R)Stata
select      count(*)
from        dset;
Use Proc SQL
nrow(dset)
count


Counting number of unique persons

Standard SQL / SAS Proc SQLSAS Data stepR (Base R)Stata
select    count(distinct id)
from      dset;
Use Proc SQL
length(unique(dset$id))
#or
length(unique(dset[["id"]]))
by id, sort: gen nvals = _n == 1
count if nvals

Basic Joins

​Type of JoinStandard SQL /
SAS Proc SQL
SAS Data stepR (Base R)Stata
​Inner Join
create table mergeddata as
select      d1.id,
            d2.var
from        dset1 d1
inner join  dset2 d2
on          d1.id = d2.id;

/* or */

create table mergeddata as
select      d1.id,
            d2.var
from        dset1 d1,
            dset2 d2
where       d1.id = d2.id;
data mergeddata;
  merge  dset1 (in = d1)
         dset2 (in = d2);
  by     id;
  if     d1 and d2;
run;
mergeddata <- merge(
  dset1,
  dset2,
  by="id"
)
merge 1:1 id using
"c:\data\dset2.dta"
​Left Join
create table mergeddata as
select      d1.id,
            d2.var
from        dset1 d1
left join   dset2 d2
on          d1.id = d2.id;
data mergeddata;
  merge  dset1 (in = d1)
         dset2 (in = d2);
  by     id;
  if     d1;
run;
mergeddata <- merge(
  dset1,
  dset2,
  by="id", all.x=TRUE
)
merge id,
keep(master matched)
​Full Join
create table mergeddata as
selectd1.id,
d2.var
fromdset1 d1
full outer joindset2 d2
ond1.id = d2.id;
data mergeddata;
  merge  dset1 (in = d1)
         dset2 (in = d2);
  by     id;
  if     d1 or d2;
run;
mergeddata <- merge(
  dset1,
  dset2,
  by="id", all=TRUE
)
merge id,keep(master
matched using)

Aggregating

A common task when dealing with data stored at multiple levels, for instance multiple admissions per person, is aggregating. In data terms, this refers to collapsing many values associated with a common identifier to a single value.

This typically involves the use of an aggregation function used to reduce many values down to a single value. Typical aggregation functions include numeric calculations like a count, sum, maximum, minimum or average (typically a mean). Calculations resulting in a TRUE or FALSE value can also be used in some programs.

Consider the process of making a patient-level flag for deceased patients.    

The logic to do this would be something like:

1. flag all rows in the episode table that are associated with a sepn_mode of 'died'

idsepn_modeflag
1transfer0
1home0
2home0
2died1
3home0
4res.care0

2. aggregate these flag values within each id group, returning a died flag for id's where any of the original values were flagged (a maximum aggregation function).

    iddied
    10
    21
    30
    40
SQL / SAS Proc SQL
create table deathflag as 
select    id,
          max(sepn_mode = 'died') as died
from      episode
group by  id;

/*
or if you are using Oracle SQL or similar systems that don’t have 
a boolean (i.e. TRUE/FALSE) variable type:
*/
create table deathflag as 
select    id,
          max(case when sepn_mode = 'died' then 1 else 0 end) as died
from      episode
group by  id;
R - Base R
deathflag <- aggregate(
  cbind(died = sepn_mode == 'died') ~ id,
  data=episode,
  FUN=any
)
R - data.table package*
deathflag <- episode[, .(died = any(sepn_mode == 'died')), by=id]
SAS Data Step / Procedures
data episode;
  set  episode;
  died = (sepn_mode = 'died');
run;
proc summary data=episode nway;
  var      died;
  class    id;
  output   out=deathflag (drop = _:)
           max=died;
run;
Stata
gen died = sepn_mode == 'died'
collapse (max) died, by(id) 

* - the data.table package is excellent for dealing with very large datasets in R, but is not strictly necessary.

Last updated: 4 January 2019