Code snippets
Counting rows and unique persons
Counting number of rows in a dataset
| Standard SQL / SAS Proc SQL | SAS Data step | R (Base R) | Stata |
select count(*) from dset; | Use Proc SQL | nrow(dset) | count |
Counting number of unique persons
| Standard SQL / SAS Proc SQL | SAS Data step | R (Base R) | Stata |
select count(distinct id) from dset; | Use Proc SQL | length(unique(dset$id)) #or length(unique(dset[["id"]])) | by id, sort: gen nvals = _n == 1 count if nvals |
Basic Joins
| Type of Join | Standard SQL / SAS Proc SQL | SAS Data step | R (Base R) | Stata |
| Inner Join | create table mergeddata as
select d1.id,
d2.var
from dset1 d1
inner join dset2 d2
on d1.id = d2.id;
/* or */
create table mergeddata as
select d1.id,
d2.var
from dset1 d1,
dset2 d2
where d1.id = d2.id; | data mergeddata;
merge dset1 (in = d1)
dset2 (in = d2);
by id;
if d1 and d2;
run;
| mergeddata <- merge( dset1, dset2, by="id" ) | merge 1:1 id using "c:\data\dset2.dta" |
| Left Join | create table mergeddata as
select d1.id,
d2.var
from dset1 d1
left join dset2 d2
on d1.id = d2.id;
| data mergeddata;
merge dset1 (in = d1)
dset2 (in = d2);
by id;
if d1;
run;
| mergeddata <- merge( dset1, dset2, by="id", all.x=TRUE ) | merge id, keep(master matched) |
| Full Join | create table mergeddata as selectd1.id, d2.var fromdset1 d1 full outer joindset2 d2 ond1.id = d2.id; | data mergeddata;
merge dset1 (in = d1)
dset2 (in = d2);
by id;
if d1 or d2;
run;
| mergeddata <- merge( dset1, dset2, by="id", all=TRUE ) | merge id,keep(master matched using) |
Aggregating
A common task when dealing with data stored at multiple levels, for instance multiple admissions per person, is aggregating. In data terms, this refers to collapsing many values associated with a common identifier to a single value.
This typically involves the use of an aggregation function used to reduce many values down to a single value. Typical aggregation functions include numeric calculations like a count, sum, maximum, minimum or average (typically a mean). Calculations resulting in a TRUE or FALSE value can also be used in some programs.
Consider the process of making a patient-level flag for deceased patients.
The logic to do this would be something like:
1. flag all rows in the episode table that are associated with a sepn_mode of 'died'
| id | sepn_mode | flag |
| 1 | transfer | 0 |
| 1 | home | 0 |
| 2 | home | 0 |
| 2 | died | 1 |
| 3 | home | 0 |
| 4 | res.care | 0 |
2. aggregate these flag values within each id group, returning a died flag for id's where any of the original values were flagged (a maximum aggregation function).
| id | died |
| 1 | 0 |
| 2 | 1 |
| 3 | 0 |
| 4 | 0 |
| SQL / SAS Proc SQL |
create table deathflag as
select id,
max(sepn_mode = 'died') as died
from episode
group by id;
/*
or if you are using Oracle SQL or similar systems that don’t have
a boolean (i.e. TRUE/FALSE) variable type:
*/
create table deathflag as
select id,
max(case when sepn_mode = 'died' then 1 else 0 end) as died
from episode
group by id;
|
| R - Base R |
deathflag <- aggregate( cbind(died = sepn_mode == 'died') ~ id, data=episode, FUN=any ) |
| R - data.table package* |
deathflag <- episode[, .(died = any(sepn_mode == 'died')), by=id] |
| SAS Data Step / Procedures |
data episode;
set episode;
died = (sepn_mode = 'died');
run;
proc summary data=episode nway;
var died;
class id;
output out=deathflag (drop = _:)
max=died;
run;
|
| Stata |
gen died = sepn_mode == 'died' collapse (max) died, by(id) |
* - the data.table package is excellent for dealing with very large datasets in R, but is not strictly necessary.