Linkage Stories

The Cohort Identifier

The quality and accuracy of identifiers in data files provided for linkage are imperative to ensure successful and timely linkage.

The Statistical Services Branch (SSB) received a request to link data for a research project that involved data from three different sources. Of the three sources, two were internal Queensland Health data sources and one dataset was external to Queensland Health.

The external dataset provided to SSB contained 50,000+ records. One of the main identifiers (first and last name) in the provided dataset was formatted in a way that made matching the 50,000+ records challenging. The names were combined in one column rather than being separated into two columns (one for first name and one for last name).

This format made it difficult for the linkage team to determine patients’ correct first name and last name. All possible variations of first and last name had to be analysed and reviewed to find matches between the external dataset and the other two datasets.

Example

The linkage team spent 4 weeks using regular expressions and testing custom scripts against the SSB Master Linkage File and performing extensive grey area checking for all possible variations. These custom scripts, created by the linkage team, helped identify incorrect and lower ranked matches, including any duplicate records. This linkage required extensive teamwork and outside-the-box thinking that resulted in the linkage team successfully linking more than 95% of the 50,000+ records in the cohort file.

Lessons

Supplying the cohort in the preferred format of separate columns for first and last name, would have significantly reduced the complexity and the time needed to complete the linkage and extraction.

Linkage time frame

A breakdown of the linkage steps and time taken to complete this linkage:

  1. Receive and clean cohort data (2 weeks)
    • Breaking up the combined identifiers
  2. Link (2 weeks)
    • Grey area checking
  3. Preparation of content data request (1 week)
    • Compile the linked data and organise
      transfer of linked data file to client
  4. Output (1 day)
    • Linkage report - summary of project,
      explains methodology, lists data issues

This breakdown shows the impact of data quality on the time it takes to complete a linkage request. Supplying data that is formatted as required and that is as clean as possible can dramatically reduce the time the linkage team takes to prepare and link data for a request.

See dataformat.pdf (health.qld.gov.au) for more information about formatting a cohort file for linkage purposes in SSB.

Last updated: 25 August 2023