Format, coding and efficiency
'Wide' versus 'Long' data
You may frequently encounter data that is not stored in a tidy, relational structure, but rather collapsed to a single, 'wide' dataset. These datasets can be useful in certain contexts, but are associated with several issues which make them error-prone or difficult to work with when analysing linked data. See below for a visual representation of some of the hospital admissions data as a single 'wide' dataset, retaining all information present in the long files.
Admission data (selected variables) + Diagnosis data in 'wide' format
An intuitive way to present data is not necessarily an appropriate format to analyse data though. Paul Murrell (2013) in the 'Bad Data Handbook' notes the pitfalls of 'data intended for human consumption, not machine consumption'. While intuitively it might make sense to have all the details relating to a single hospital admission on a single line of data, this complicates interacting with the data programmatically:
Data is typically provided this way in order to allow a human to extract a particular message from the data. The problem is that we inevitably end up wanting to do more with the data, which means working with the data using software, which means explaining the format of the data to the software, which in turn means that we end up wishing that the data were formatted for consumption by a computer, not human eyeballs (Murrell, 2013, p31).
Most statistics programs rely on columns of data as the principal inputs to analysis. Consider a basic crosstabulation, and how this will typically require 2 columns, one to represent the row categories, and one to represent the column categories. As the wide file has split what is conceptually a single variable - 'diagnosis' - into multiple variables, these basic workflows are now disrupted (White et al., 2013).
Coding and logical efficiency
Keeping your analysis as simple and easily understandable as possible should be a primary aim when dealing with complex data - and linked data is almost always complex.
Let's use some pseudocode to describe the logic for analysing the diagnosis data in long and wide forms. A representative example may be to get the count of diagnoses at different levels.
In a long file you could do:
...and this will always work because diagnoses are stored in a single column in a dedicated table. You can change the level of analysis at will, because the diagnoses can be rolled up to admission or person level by simply changing what your groups are.
The underlying logic of the coding process remains simple throughout. The code in red is unchanged between the two examples.
In a wide file however, your diagnoses are now spread over many variables ('diagnosis_1' to 'diagnosis_5'), and are hardcoded to a storage level different to their data collection level (admission level vs. diagnosis level).To repeat the same analyses as above, you would need to do something like:
As you can see, this logic relies on multiple concepts. Selecting ranges of variables and looping to access column-wise diagnosis data stored at admission level, and then separate grouping logic to take newly created admission level data to a person level. Contrast this with the consistent use of grouping logic when dealing with diagnosis data stored at diagnosis level, as the lowest form is hierarchically nested below both admission and person levels.
Also, instead of being able to reference a single diagnosis variable (in this particular case, no variable selection was required!), you now have a hardcoded range from 'diagnosis_1' to 'diagnosis_5'. This means the code will only work for this particular dataset, as the number of diagnosis columns will be dependent on the maximum number of diagnoses for any one episode. You may suggest solving this issue by selecting all the variables with a similar starting stem - 'diagnosis_' in this case. This is however risky because you may inadvertently create a variable like the 'diagnosis_history' column included in the 'Episode' dataset which will also be selected. This method of analysis is therefore both non-transferrable and fragile, due to being based on an inconsistent underlying structure.
Storage efficiency for larger datasets
The first thing you may have noticed about the wide file is that you now have a large number of empty cells representing missing values, whereas the tidy data format consistently has no missing values. While the association between a diagnosis ('PD' or 'OD') and an external cause ('EX') could be determined by the external cause ('EX') immediately following a run of diagnosis codes ('PD' or 'OD'), the wide file requires adding an extra column for each diagnosis to make this relationship apparent. This is inefficient as 3 external cause ('EX') values are now represented by 30 cells.
On a very small file like this example, this overhead may not be significant. It is not uncommon however to be provided with many cases, for which these files may become very large and unwieldy.
Using a real example of large data that Queensland Health has provided, the tidy format used around 760Mb of memory when loaded. The wide format used 3980Mb (~ 4Gb). The limitations of this format were obvious as the client reported their statistical software as crashing as it tried to handle the 2 billion (note: billion not million) extra blank cells the wide format required. Wide format data does not scale well from small problems to the large problems that many health linkage projects are now requesting.
Using a tidy format, you would still have 5 columns in your 'diagnosis' file, with no empty cells, and this will always be the case because the data is stored at the same detail level as it is collected at (Borer et al., 2009).
Maybe you don't care about the amount of memory your data takes up (but you should!), but you almost certainly care about the simplicity of inspecting your data.
I'll set you a simple task to demonstrate. I would like you to identify an episode with a particular diagnosis by manually scanning through a file - 'E1172' perhaps? This is easy to find in your long file because you only have 1 column to scan down and you always will!
In the small wide file, you have 10 columns to scan, which is already more complex. Now imagine you are working with a larger dataset and have 150+ columns and it is unlikely you would be able to view a block of diagnoses for an episode on a single monitor.