This linking of datasets forms the basis of relational data.
But, instead of producing one dataset that is re-used for every analysis, you can extract and combine the relevant details for each analysis from multiple datasets. This reduces the scope of data that you have to process or analyse at any one time, and allows you to better track each level of your data. Looking at the below hypothetical example, you can see that there is a cohort dataset, two hospital admissions data tables, and one table for emergency department presentations. Each of these tables is linked by the key variable representing each person: 'id', with the hospital admissions tables also linking by 'admit_ep', an indicator of the admission episode.
The 'Cohort' table provides a method of linking between a researcher's own person identifier and the identifier provided in the linkage data.
The 'Episode' table records details about a patient's time in the admitting facility, including the facility's 3-letter identifier, their dates of admission and separation, and their final mode of separation from the facility.
It also typically includes person-level details at the level of each admission, as recorded by each admitting facility. This includes their self-reported date of birth, and postcode of their usual residence, at the time of the admission.
The 'diagnosis' table contains a row for each diagnosis recorded by medical staff during the patient's care. While there is always a 'principal diagnosis', a varying number of 'other diagnoses' (OD) and 'external causes' (EX) can also be assigned against any admission. Each of the diagnoses is recorded using the standard ICD-10AM classification system.
Finally, the ED 'Presentation' table provides details of presentations in emergency departments such as the facility, presentation date and episode end status. Not all hospitals may have emergency departments, and in reality most presentations do not link to an admission.
This type of long, relational data has been more recently described by data researcher Hadley Wickham (2014) as 'tidy data'.
'Tidy data' is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table." (Wickham, 2014, pg 4)
Maintaining this consistent storage method for data means you can use a consistent, small number of operations to deal with data of any size, recorded at multiple levels. This logic to data analysis also underpins other introductory data training courses such as the Coursera 'Getting and Cleaning Data' course run by the Johns Hopkins Bloomberg School of Public Health - https://www.coursera.org/learn/data-cleaning