Heart Failure Clinical Records Synthetic Data Project

Author

Linus Chirchir

Published

7 September 2025

Show code

# Load necessary libraries
# --- Project and Performance Utilities ---
library(here)                 # Manage project-relative file paths
library(doParallel)           # Enable parallel processing for faster computations
library(tidyverse)            # Core data science toolkit (dplyr, tidyr, readr, purrr, etc.)
library(readxl)               # Import Excel (.xlsx, .xls) files into R

# --- Data Exploration and Summaries ---
library(psych)                # Produce descriptive statistics and psychometric analyses
library(knitr)                # Format dynamic tables and reports in Markdown/Quarto

# --- Visualisation ---
library(ggplot2)              # Build customizable, high-quality data visualizations
library(patchwork)            # Combine multiple ggplot2 plots into unified layouts
library(corrplot)             # Visualize correlation matrices with heatmaps and plots
library(GGally)               # Extend ggplot2 with correlation plots, pairwise plots, ggpairs()
library(VIM)                  # Visualize/impute missing values with advanced methods (e.g., matrix plots)
library(scales)

# --- Data Imputation and Synthesis ---
library(mice)                 # Perform multiple imputation by chained equations for missing data
library(synthpop)             # Generate, evaluate, and compare synthetic datasets

# --- Statistical and Information-Theoretic Measures ---
library(transport)            # Compute optimal transport distances (e.g., Wasserstein) for distribution comparison
library(infotheo)             # Calculate information-theoretic metrics (entropy, mutual information)

# --- Modeling and Machine Learning ---
library(caret)                # Streamline machine learning workflows (training, tuning, evaluation)
library(xgboost)              # Implement gradient boosting for classification/regression
library(SHAPforxgboost)       # Explain XGBoost predictions using SHAP values
library(FNN)                  # Perform fast k-nearest neighbor searches and distance calculations

# Set up parallel processing
num_cores <- detectCores() - 1  # Use one less than the total number of cores
cl <- makeCluster(num_cores)
registerDoParallel(cl)

1 Introduction

This project evaluates the quality of synthetic datasets derived from the Heart Failure Clinical Records data (299 patients, 13 variables). Our goal is to determine how well different generation strategies reproduce the statistical properties and analytic value of the original data while protecting privacy. We generate synthetic data using four approaches: (1) parametric imputation (MICE), (2) non-parametric imputation (CART via MICE), (3) distribution-driven synthesis (synthpop), and (4) metadata-guided rules. We then compare each synthetic dataset with the real data across three dimensions:

Fidelity – univariate and multivariate similarity (distributions, ranges, correlations), histogram similarity, and mutual information.
Utility – model transportability using XGBoost (TRTR vs. TSTR), feature-importance agreement, and SHAP-based behaviour.
Privacy / Disclosure risk – exact record matches, neighbour-proximity checks, and membership-inference sensitivity.

Results are presented through aligned tables and plots (structure checks, categorical level retention, density overlays, correlation matrices, SHAP summaries) with concise scores for each method. The report concludes with a comparative summary to guide method selection given different fidelity–utility–privacy trade-offs.

2 Heart Failure Clinical Records Dataset and Initial Exploration

The Heart Failure Clinical Records dataset, sourced from the UCI Machine Learning Repository, contains records of 299 patients with heart failure collected during their follow-up period. Each record includes 13 clinical features covering demographics, clinical conditions, and laboratory measures:

Variable	Description	Unit / Levels
age	Age of the patient	Years
anaemia	Reduction in red blood cells	No / Yes
creatinine_phosphokinase	Enzyme level in the blood	mcg/L
diabetes	Diabetes status	No / Yes
ejection_fraction	% of blood leaving the heart with each contraction	Percentage (%)
platelets	Platelet count	Kiloplatelets/mL
serum_creatinine	Serum creatinine level	mg/dL
serum_sodium	Serum sodium level	mEq/L
sex	Gender of the patient	Female / Male
smoking	Smoking status	No / Yes
hypertension	Hypertension status	No / Yes
deceased	Survival status during follow-up	No / Yes
follow_up	Duration of the follow-up period	Days

A preview of the first 10 rows of the dataset is shown below.

Show code

# Set seed for reproducibility
set.seed(123)

# Load the heart failure dataset from the local directory
heart_failure <- read.csv(here("data", "heart_failure.csv"))

# Let's preview the heart failure dataset
knitr::kable(
  head(heart_failure, 10),
  caption = "Preview of the first 10 rows of the Heart Failure dataset",
  align = rep("l", ncol(heart_failure))  # left align all columns
)

Preview of the first 10 rows of the Heart Failure dataset
age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	platelets	serum_creatinine	serum_sodium	sex	smoking	hypertension	deceased	follow_up
75	No	582	No	20	265000	1.9	130	Male	No	Yes	Yes	4
55	No	7861	No	NA	263358	1.1	NA	Male	No	No	Yes	6
65	No	NA	No	20	162000	NA	129	Male	Yes	No	Yes	7
NA	Yes	111	No	NA	NA	1.9	NA	Male	No	No	Yes	7
65	Yes	160	Yes	20	327000	2.7	116	Female	No	No	Yes	8
90	Yes	47	No	40	204000	NA	132	Male	Yes	Yes	Yes	8
75	Yes	246	No	15	NA	1.2	NA	Male	No	No	Yes	10
60	Yes	315	Yes	60	454000	1.1	131	Male	Yes	No	Yes	10
NA	No	NA	No	65	263358	NA	NA	Female	No	No	Yes	10
80	Yes	123	No	35	388000	9.4	133	Male	Yes	Yes	Yes	10

Here’s the preview of the structure of the heart failure dataset.

Show code

df <- data.frame(
  Variable = names(heart_failure),
  Type   = unname(sapply(heart_failure, \(x) paste(class(x), collapse = ", "))),
  Values = unname(sapply(heart_failure, \(x) paste(head(unique(x), 10), collapse = ", "))),
  row.names = NULL,
  check.names = FALSE
)

knitr::kable(
  df,
  caption = "Structure of the Heart Failure Dataset: Variables, Types, and Values",
  align = c("l","l","l"),
  row.names = FALSE
)

Structure of the Heart Failure Dataset: Variables, Types, and Values
Variable	Type	Values
age	numeric	75, 55, 65, NA, 90, 60, 80, 62, 45, 50
anaemia	character	No, Yes
creatinine_phosphokinase	integer	582, 7861, NA, 111, 160, 47, 246, 315, 123, 81
diabetes	character	No, Yes
ejection_fraction	integer	20, NA, 40, 15, 60, 65, 35, 38, 25, 30
platelets	numeric	265000, 263358.03, 162000, NA, 327000, 204000, 454000, 388000, 368000, 253000
serum_creatinine	numeric	1.9, 1.1, NA, 2.7, 1.2, 9.4, 4, 0.9, 1, 1.3
serum_sodium	integer	130, NA, 129, 116, 132, 131, 133, 140, 137, 138
sex	character	Male, Female
smoking	character	No, Yes
hypertension	character	Yes, No
deceased	character	Yes, No
follow_up	integer	4, 6, 7, 8, 10, 11, 12, 13, 14, 15

Let’s perform some basic data cleaning and preprocessing steps. Convert the following columns to factors: sex, anaemia, diabetes, smoking, hypertension and deceased. Here’s the new structure of the dataset after the basic cleaning.

Show code

# Convert the specified columns to factors
heart_failure$sex <- as.factor(heart_failure$sex)
heart_failure$anaemia <- as.factor(heart_failure$anaemia)
heart_failure$diabetes <- as.factor(heart_failure$diabetes)
heart_failure$smoking <- as.factor(heart_failure$smoking)
heart_failure$hypertension <- as.factor(heart_failure$hypertension)
heart_failure$deceased <- as.factor(heart_failure$deceased)

Here are the summary statistics of the heart failure dataset.

Show code

# Generate descriptive statistics and display as a single table
describe(heart_failure) |>
  kable(
    caption = "Summary Statistics of the Heart Failure Dataset",
    digits = 2,  # round to 2 decimal places
    align = "lrrrrrrrrrr"  # left for variable, right for numbers
  )

Summary Statistics of the Heart Failure Dataset
	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
age	1	284	60.80	12.08	60.0	60.15	14.83	40.0	95.0	55.0	0.42	-0.28	0.72
anaemia*	2	299	1.43	0.50	1.0	1.41	0.00	1.0	2.0	1.0	0.28	-1.93	0.03
creatinine_phosphokinase	3	285	578.81	981.99	249.0	358.95	268.35	23.0	7861.0	7838.0	4.45	24.52	58.17
diabetes*	4	299	1.42	0.49	1.0	1.40	0.00	1.0	2.0	1.0	0.33	-1.90	0.03
ejection_fraction	5	287	38.00	11.92	38.0	37.31	11.86	14.0	80.0	66.0	0.58	0.02	0.70
platelets	6	285	265163.83	97508.63	263358.0	259212.20	65765.22	25100.0	850000.0	824900.0	1.43	6.24	5775.91
serum_creatinine	7	281	1.39	1.05	1.1	1.18	0.30	0.5	9.4	8.9	4.44	25.03	0.06
serum_sodium	8	279	136.59	4.49	137.0	136.80	4.45	113.0	148.0	35.0	-1.05	3.88	0.27
sex*	9	299	1.65	0.48	2.0	1.68	0.00	1.0	2.0	1.0	-0.62	-1.62	0.03
smoking*	10	299	1.32	0.47	1.0	1.28	0.00	1.0	2.0	1.0	0.76	-1.42	0.03
hypertension*	11	299	1.35	0.48	1.0	1.32	0.00	1.0	2.0	1.0	0.62	-1.62	0.03
deceased*	12	299	1.32	0.47	1.0	1.28	0.00	1.0	2.0	1.0	0.76	-1.42	0.03
follow_up	13	299	130.26	77.61	115.0	129.28	105.26	4.0	285.0	281.0	0.13	-1.22	4.49

3 Generating Synthetic Data

The process of generating synthetic data is essential for tasks such as privacy preservation, testing machine learning models, and conducting simulations. In this section, we employ various techniques to generate synthetic data based on the real heart failure dataset. Each method serves a different purpose and offers distinct advantages, depending on the use case and the type of missing data or privacy concerns.

We explore four key approaches for generating synthetic data:

3.1 Parametric Imputation Using MICE

This method uses the Multivariate Imputation by Chained Equations (MICE) framework, with parametric imputation based on a normal distribution. MICE allows for handling missing data through multiple imputations and is widely used in data science and healthcare for its flexibility and statistical rigor. Here’s the preview of the first 10 rows of the synthetic data generated.

Show code

# Function to compute the 5th and 95th percentiles for numeric columns
compute_percentiles <- function(data, lower_percentile = 0.05, upper_percentile = 0.95) {
  percentiles <- lapply(data, function(column) {
    if (is.numeric(column)) {
      lower <- quantile(column, probs = lower_percentile, na.rm = TRUE)
      upper <- quantile(column, probs = upper_percentile, na.rm = TRUE)
      return(c(lower, upper))
    }
    return(c(NA, NA))
  })
  names(percentiles) <- names(data)
  return(percentiles)
}

# Function to scale numeric values based on the 5th and 95th percentiles
scale_to_percentiles <- function(data, percentiles) {
  scaled_data <- data
  for (col in colnames(data)) {
    if (is.numeric(data[[col]])) {
      # Apply scaling only if percentiles are available for this column
      if (!is.na(percentiles[[col]][1]) && !is.na(percentiles[[col]][2])) {
        min_val <- percentiles[[col]][1]
        max_val <- percentiles[[col]][2]
        # Clip values to lie within the 5th and 95th percentile range
        scaled_data[[col]] <- pmin(pmax(data[[col]], min_val), max_val)
      }
    }
  }
  return(scaled_data)
}

# Function to generate missingness in a dataset based on the real data's pattern
generate_missingness_based_on_real <- function(real_data, synthetic_data) {
  # Ensure the synthetic dataset has the same structure as the real
  if (!all(colnames(real_data) == colnames(synthetic_data))) {
    stop("Column names in real and synthetic data must match")
  }
  
  # Copy the real missingness pattern to the synthetic dataset
  synthetic_data_with_missingness <- synthetic_data
  for (col in colnames(real_data)) {
    # Apply missingness where it was present in the real data
    missing_indices <- is.na(real_data[[col]])
    synthetic_data_with_missingness[[col]][missing_indices] <- NA
  }
  
  return(synthetic_data_with_missingness)
}

# Compute the 5th and 95th percentiles for the real data
percentiles <- compute_percentiles(heart_failure)

# Generate synthetic data using MICE
where <- make.where(heart_failure, "all")
method <- make.method(heart_failure, where = where)
method[method == "pmm"] <- "norm"

# Perform multiple imputation (10 datasets)
syn_param <- mice(heart_failure, m = 10, maxit = 1, method = method, where = where, printFlag = FALSE)

# Extract the first synthetic dataset
syn_data_1 <- complete(syn_param, 1)

# Scale synthetic data to the 5th and 95th percentile ranges
syn_data_1 <- scale_to_percentiles(syn_data_1, percentiles)

# Apply the real missingness pattern to the synthetic dataset
syn_data_1 <- generate_missingness_based_on_real(heart_failure, syn_data_1)

# Round off all numeric columns to 0 decimal places
syn_data_1 <- syn_data_1 %>%
  mutate(across(where(is.numeric), round, digits = 0))

# Add a prefix 'synth_' to all synthetic dataset variable names
colnames(syn_data_1) <- paste("synth", colnames(syn_data_1), sep = "_")

# Display the structure of the dataset (fix double variable issue)
df_syn <- data.frame(
  Variable = names(syn_data_1),
  Type   = unname(sapply(syn_data_1, \(x) paste(class(x), collapse = ", "))),
  Values = unname(sapply(syn_data_1, \(x) paste(head(unique(x), 10), collapse = ", "))),
  row.names = NULL,
  check.names = FALSE
)

# Remove duplicated variables if any
df_syn <- df_syn[!duplicated(df_syn$Variable), ]

knitr::kable(
  df_syn,
  caption = "Structure of the Heart Failure Dataset: Variables, Types, and Values",
  align = c("l","l","l"),
  row.names = FALSE
)

Structure of the Heart Failure Dataset: Variables, Types, and Values
Variable	Type	Values
synth_age	numeric	56, 42, 57, NA, 69, 53, 72, 59, 82, 73
synth_anaemia	factor	No, Yes
synth_creatinine_phosphokinase	numeric	184, 1540, NA, 559, 941, 1926, 570, 1380, 59, 1531
synth_diabetes	factor	No, Yes
synth_ejection_fraction	numeric	40, NA, 52, 48, 49, 27, 33, 51, 30, 39
synth_platelets	numeric	284572, 235177, 175379, NA, 352540, 132200, 349886, 421200, 246142, 265262
synth_serum_creatinine	numeric	2, 1, NA, 3
synth_serum_sodium	numeric	138, NA, 131, 130, 137, 135, 142, 134, 141, 144
synth_sex	factor	Male, Female
synth_smoking	factor	Yes, No
synth_hypertension	factor	Yes, No
synth_deceased	factor	Yes, No
synth_follow_up	numeric	90, 13, 88, 40, 31, 129, 91, 51, 29, 118

3.2 Non-Parametric Imputation Using CART

The Classification and Regression Trees (CART) method, a non-parametric approach, is useful for imputation when the relationships between variables are complex or non-linear. By leveraging decision trees, CART is capable of capturing intricate patterns in the data without making strong assumptions about the underlying distributions. Here’s the preview of the first 10 rows of the synthetic data generated.

Show code

# Compute the 5th and 95th percentiles for the real data
percentiles <- compute_percentiles(heart_failure)

# Generate synthetic data using CART imputation with MICE
where <- make.where(heart_failure, "all")
method <- make.method(heart_failure, where = where)
method[method == "pmm"] <- "cart"  # Set the method to "cart" for non-parametric imputation

# Perform multiple imputation (10 datasets)
syn_cart <- mice(heart_failure, m = 10, maxit = 1, method = method, where = where, printFlag = FALSE)

# Extract the first synthetic dataset
syn_cart_1 <- complete(syn_cart, 1)

# Scale synthetic data to the 5th and 95th percentile ranges
syn_cart_1 <- scale_to_percentiles(syn_cart_1, percentiles)

# Apply the real missingness pattern to the synthetic dataset
syn_cart_1 <- generate_missingness_based_on_real(heart_failure, syn_cart_1)

# Round off all numeric columns to 0 decimal places
syn_cart_1 <- syn_cart_1 %>%
  mutate(across(where(is.numeric), round, digits = 0))

# Add a prefix 'synth_' to all synthetic dataset variable names
colnames(syn_cart_1) <- paste("synth", colnames(syn_cart_1), sep = "_")

# Display the structure of the dataset (fix double variable issue)
df_cart <- data.frame(
  Variable = names(syn_cart_1),
  Type   = unname(sapply(syn_cart_1, \(x) paste(class(x), collapse = ", "))),
  Values = unname(sapply(syn_cart_1, \(x) paste(head(unique(x), 10), collapse = ", "))),
  row.names = NULL,
  check.names = FALSE
)

# Remove duplicated variables if any
df_cart <- df_cart[!duplicated(df_cart$Variable), ]

knitr::kable(
  df_cart,
  caption = "Structure of the Heart Failure Dataset: Variables, Types, and Values (CART Imputation)",
  align = c("l","l","l"),
  row.names = FALSE
)

Structure of the Heart Failure Dataset: Variables, Types, and Values (CART Imputation)
Variable	Type	Values
synth_age	numeric	60, 50, 65, NA, 82, 53, 75, 69, 70, 45
synth_anaemia	factor	No, Yes
synth_creatinine_phosphokinase	numeric	582, 161, NA, 203, 112, 371, 246, 60, 59, 2221
synth_diabetes	factor	No, Yes
synth_ejection_fraction	numeric	30, NA, 38, 25, 50, 60, 35, 20, 45, 40
synth_platelets	numeric	132200, 263358, NA, 421200, 351000, 219000, 327000, 213000, 395000, 282000
synth_serum_creatinine	numeric	3, 1, NA, 2
synth_serum_sodium	numeric	133, NA, 130, 140, 138, 139, 137, 136, 134, 135
synth_sex	factor	Female, Male
synth_smoking	factor	Yes, No
synth_hypertension	factor	No, Yes
synth_deceased	factor	Yes, No
synth_follow_up	numeric	59, 30, 154, 24, 66, 13, 28, 65, 129, 72

3.3 Generating Synthetic Data Using Synthpop

The Synthpop package is designed for creating synthetic data based on the distribution of the real dataset. It is particularly useful for privacy-preserving data sharing, where the aim is to create a synthetic dataset that mimics the real data without exposing sensitive information. Here’s the preview of the first 10 rows of the synthetic data generated.

Show code

# Compute the 5th and 95th percentiles for the real data
percentiles <- compute_percentiles(heart_failure)

# Generate low-fidelity synthetic data using random sampling
syn_data_low_fidelity <- syn(heart_failure, method = "sample", seed = 123)


Synthesis
-----------
 age anaemia creatinine_phosphokinase diabetes ejection_fraction platelets serum_creatinine serum_sodium sex smoking
 hypertension deceased follow_up

Show code

# Convert the synthetic dataset to a data frame
syn_data_low_fidelity_synthpop <- syn_data_low_fidelity$syn

# Scale the synthetic data to the 5th and 95th percentile ranges
syn_data_low_fidelity_synthpop <- scale_to_percentiles(syn_data_low_fidelity_synthpop, percentiles)

# Apply the real missingness pattern to the synthetic dataset
syn_data_low_fidelity_synthpop <- generate_missingness_based_on_real(heart_failure, syn_data_low_fidelity_synthpop)

# Round off all numeric columns to 0 decimal places
syn_data_low_fidelity_synthpop <- syn_data_low_fidelity_synthpop %>%
  mutate(across(where(is.numeric), round, digits = 0))

# Add a prefix 'synth_' to all synthetic dataset variable names
colnames(syn_data_low_fidelity_synthpop) <- paste("synth", colnames(syn_data_low_fidelity_synthpop), sep = "_")

# Display the structure of the dataset (fix double variable issue)
df_synthpop <- data.frame(
  Variable = names(syn_data_low_fidelity_synthpop),
  Type   = unname(sapply(syn_data_low_fidelity_synthpop, \(x) paste(class(x), collapse = ", "))),
  Values = unname(sapply(syn_data_low_fidelity_synthpop, \(x) paste(head(unique(x), 10), collapse = ", "))),
  row.names = NULL,
  check.names = FALSE
)

# Remove duplicated variables if any
df_synthpop <- df_synthpop[!duplicated(df_synthpop$Variable), ]

knitr::kable(
  df_synthpop,
  caption = "Structure of the Heart Failure Dataset: Variables, Types, and Values (Synthpop)",
  align = c("l","l","l"),
  row.names = FALSE
)

Structure of the Heart Failure Dataset: Variables, Types, and Values (Synthpop)
Variable	Type	Values
synth_age	numeric	63, 50, 45, NA, 73, 57, 70, 52, 65, 80
synth_anaemia	factor	Yes, No
synth_creatinine_phosphokinase	numeric	427, 335, NA, 249, 64, 1610, 2221, 224, 910, 260
synth_diabetes	factor	No, Yes
synth_ejection_fraction	numeric	25, NA, 20, 40, 60, 30, 35, 38, 45, 50
synth_platelets	numeric	NA, 229000, 385000, 277000, 268000, 271000, 274000, 189000, 236000, 329000
synth_serum_creatinine	numeric	1, NA, 2, 3
synth_serum_sodium	numeric	142, NA, 139, 136, 137, 134, 140, 130, 144, 135
synth_sex	factor	Male, Female
synth_smoking	factor	Yes, No
synth_hypertension	factor	Yes, No
synth_deceased	factor	Yes, No
synth_follow_up	numeric	250, 13, 91, 121, 20, 113, 186, 85, 74, 50

3.4 Generating Synthetic Data Using Metadada

This method uses a data dictionary (metadata) to drive the generation of synthetic data. The metadata defines the structure and types of variables, which ensures that the generated synthetic data conforms to expected formats, such as ranges for continuous variables and categories for binary or categorical variables. Here’s the preview of the first 10 rows of the synthetic data generated.

Show code

# Compute the 5th and 95th percentiles for the real data
percentiles <- compute_percentiles(heart_failure)

# Generate synthetic data using metadata

# Load the metadata from the Excel file
heart_failure_metadata <- read_excel(here("data", "heart_failure_metadata.xlsx"))

# Create an empty data frame based on the metadata structure
n_rows <- nrow(heart_failure)  # Number of synthetic records to generate
variable_names <- heart_failure_metadata$`Variable Name`
data_types <- heart_failure_metadata$Type

syn_data_metadata <- data.frame(matrix(ncol = length(variable_names), nrow = n_rows))
colnames(syn_data_metadata) <- variable_names

# Function to generate data based on metadata
generate_data <- function(variable, real_data, type, n) {
  if (type == "Continuous" || type == "Integer") {
    if (is.numeric(real_data[[variable]])) {
      # Compute 5th and 95th percentiles for numeric variables
      p5 <- quantile(real_data[[variable]], probs = 0.05, na.rm = TRUE)
      p95 <- quantile(real_data[[variable]], probs = 0.95, na.rm = TRUE)
      
      if (type == "Continuous") {
        return(runif(n, min = p5, max = p95))
      } else if (type == "Integer") {
        return(sample(floor(p5):ceiling(p95), n, replace = TRUE))
      }
    }
  } else if (is.factor(real_data[[variable]])) {
    return(sample(levels(real_data[[variable]]), n, replace = TRUE))
  } else if (variable %in% c("anaemia", "diabetes", "hypertension", "smoking", "sex")) {
    return(sample(unique(real_data[[variable]]), n, replace = TRUE))
  } else {
    return(rep(NA, n))
  }
}

# Generate synthetic data based on the metadata
for (i in seq_along(variable_names)) {
  syn_data_metadata[[i]] <- generate_data(variable_names[i], heart_failure, data_types[i], n_rows)
}

# Scale the synthetic data to the 5th and 95th percentile ranges
syn_data_metadata <- scale_to_percentiles(syn_data_metadata, percentiles)

# Apply the real missingness pattern to the synthetic dataset
syn_data_metadata <- generate_missingness_based_on_real(heart_failure, syn_data_metadata)

# Round off all numeric columns to 0 decimal places
syn_data_metadata <- syn_data_metadata %>%
  mutate(across(where(is.numeric), round, digits = 0))

# Add a prefix 'synth_' to all synthetic dataset variable names
colnames(syn_data_metadata) <- paste("synth", colnames(syn_data_metadata), sep = "_")

# Display the structure of the dataset (fix double variable issue)
df_metadata <- data.frame(
  Variable = names(syn_data_metadata),
  Type   = unname(sapply(syn_data_metadata, \(x) paste(class(x), collapse = ", "))),
  Values = unname(sapply(syn_data_metadata, \(x) paste(head(unique(x), 10), collapse = ", "))),
  row.names = NULL,
  check.names = FALSE
)

# Remove duplicated variables if any
df_metadata <- df_metadata[!duplicated(df_metadata$Variable), ]

knitr::kable(
  df_metadata,
  caption = "Structure of the Heart Failure Dataset: Variables, Types, and Values (Metadata)",
  align = c("l","l","l"),
  row.names = FALSE
)

Structure of the Heart Failure Dataset: Variables, Types, and Values (Metadata)
Variable	Type	Values
synth_age	numeric	45, 51, 82, NA, 80, 66, 59, 71, 47, 57
synth_anaemia	character	Yes, No
synth_creatinine_phosphokinase	numeric	180, 1663, NA, 191, 571, 1228, 395, 1553, 812, 335
synth_diabetes	character	Yes, No
synth_ejection_fraction	numeric	31, NA, 58, 60, 29, 26, 37, 33, 24, 28
synth_platelets	numeric	210137, 203132, 257570, NA, 395609, 331643, 171789, 279683, 413390, 185403
synth_serum_creatinine	numeric	3, 1, NA, 2
synth_serum_sodium	numeric	136, NA, 138, 131, 130, 135, 134, 137, 143, 142
synth_sex	character	Female, Male
synth_smoking	character	Yes, No
synth_hypertension	character	Yes, No
synth_deceased	character	No, Yes
synth_follow_up	numeric	152, 231, 199, 46, 188, 117, 57, 248, 90, 237

4 Synthetic Data Identification

In this section, we generate tables to identify which columns in the synthetic datasets are synthetic (prefixed with “synth_”) and which remain unchanged from the real dataset. This helps ensure clarity on which variables have been altered during the synthetic data generation process.

For each dataset, we produce a table displaying the column names and whether the column is synthetic. A column is marked “Yes” if it was generated synthetically and “No” if it was retained from the real data.

Show code

# Function to create column info with dataset name
create_column_info <- function(dataset, dataset_name) {
  data.frame(
    column_name = colnames(dataset),
    dataset = dataset_name,
    is_synthetic = ifelse(grepl("^synth_", colnames(dataset)), "Yes", "No"),
    stringsAsFactors = FALSE
  )
}

# Collect column info for all datasets
all_column_info <- bind_rows(
  create_column_info(syn_data_1, "Parametric MICE"),
  create_column_info(syn_cart_1, "CART Imputation"),
  create_column_info(syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)"),
  create_column_info(syn_data_metadata, "Metadata-Based")
)

# Reshape into wide format so variables appear once
wide_column_info <- all_column_info %>%
  pivot_wider(
    names_from = dataset,
    values_from = is_synthetic
  )

# Display the combined wide table
kable(
  wide_column_info,
  caption = "Synthetic Indicators for Each Variable Across Datasets"
)

Synthetic Indicators for Each Variable Across Datasets
column_name	Parametric MICE	CART Imputation	Synthpop (Low Fidelity)	Metadata-Based
synth_age	Yes	Yes	Yes	Yes
synth_anaemia	Yes	Yes	Yes	Yes
synth_creatinine_phosphokinase	Yes	Yes	Yes	Yes
synth_diabetes	Yes	Yes	Yes	Yes
synth_ejection_fraction	Yes	Yes	Yes	Yes
synth_platelets	Yes	Yes	Yes	Yes
synth_serum_creatinine	Yes	Yes	Yes	Yes
synth_serum_sodium	Yes	Yes	Yes	Yes
synth_sex	Yes	Yes	Yes	Yes
synth_smoking	Yes	Yes	Yes	Yes
synth_hypertension	Yes	Yes	Yes	Yes
synth_deceased	Yes	Yes	Yes	Yes
synth_follow_up	Yes	Yes	Yes	Yes

5 Synthetic Dataset Structure

5.1 Dataset Size and Structure Comparison

This step evaluates whether the synthetic datasets match the real dataset in terms of the number of rows, columns, and feature types. Consistency in these dimensions is essential for the synthetic data to be a reliable stand-in for the real.

Number of Rows: Indicates if the synthetic data captures the same number of records as the real. Any mismatch can impact comparability.
Number of Columns: Ensures all variables are retained. Missing or extra columns suggest structural issues.
Feature Types: Confirms that categorical, numerical, and other variable types remain unchanged, preserving analytical consistency.

Show code

# Function to extract structure info
get_structure_info <- function(real_data, synthetic_data, dataset_name) {
  tibble(
    Dataset = dataset_name,
    Rows_Real = nrow(real_data),
    Cols_Real = ncol(real_data),
    Rows_Synthetic = nrow(synthetic_data),
    Cols_Synthetic = ncol(synthetic_data),
    Structure_Match = ifelse(
      nrow(real_data) == nrow(synthetic_data) & ncol(real_data) == ncol(synthetic_data),
      "✅ Yes", "❌ No"
    )
  )
}

# Collect results
structure_comparison <- bind_rows(
  get_structure_info(heart_failure, syn_data_1, "Parametric MICE"),
  get_structure_info(heart_failure, syn_cart_1, "CART Imputation"),
  get_structure_info(heart_failure, syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)"),
  get_structure_info(heart_failure, syn_data_metadata, "Metadata-Based")
)

# Display as a neat table
kable(
  structure_comparison,
  caption = "Comparison of Real vs Synthetic Dataset Sizes and Structures"
)

Comparison of Real vs Synthetic Dataset Sizes and Structures
Dataset	Rows_Real	Cols_Real	Rows_Synthetic	Cols_Synthetic	Structure_Match
Parametric MICE	299	13	299	13	✅ Yes
CART Imputation	299	13	299	13	✅ Yes
Synthpop (Low Fidelity)	299	13	299	13	✅ Yes
Metadata-Based	299	13	299	13	✅ Yes

5.2 Categorical Variables: All Levels Maintained and Comparison of Distribution

This section evaluates how well the synthetic datasets replicate the categorical variables in the real dataset, focusing on the preservation of all levels and their distributions.

Level Matching: For each categorical variable, the synthetic dataset should retain all levels (categories) present in the real dataset. Missing levels suggest that the synthetic data may be incomplete, while additional levels could indicate errors in data generation.
Distribution Comparison: The frequency of each categorical level in the synthetic data should closely mirror that of the real dataset. Significant deviations in category frequencies suggest that the synthetic data does not fully capture the true distribution of the categorical features.

Show code

# Function to collect categorical level info
get_cat_levels <- function(real_data, synthetic_data, dataset_name) {
  cat_vars <- colnames(real_data)[sapply(real_data, is.factor)]
  
  do.call(rbind, lapply(cat_vars, function(var) {
    levels_real <- unique(real_data[[var]])
    levels_synth <- unique(synthetic_data[[paste("synth", var, sep = "_")]])
    
    data.frame(
      Variable = var,
      Dataset = dataset_name,
      Real_Levels = paste(levels_real, collapse = ", "),
      Synthetic_Levels = paste(levels_synth, collapse = ", "),
      Match = ifelse(all(levels_real %in% levels_synth), "✅ Yes", "❌ No"),
      stringsAsFactors = FALSE
    )
  }))
}

# Collect results
cat_comparison <- bind_rows(
  get_cat_levels(heart_failure, syn_data_1, "Parametric MICE"),
  get_cat_levels(heart_failure, syn_cart_1, "CART Imputation"),
  get_cat_levels(heart_failure, syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)"),
  get_cat_levels(heart_failure, syn_data_metadata, "Metadata-Based")
)

# Display table
kable(cat_comparison,
      caption = "Comparison of Categorical Variable Levels in Real vs Synthetic Datasets")

Comparison of Categorical Variable Levels in Real vs Synthetic Datasets
Variable	Dataset	Real_Levels	Synthetic_Levels	Match
anaemia	Parametric MICE	No, Yes	No, Yes	✅ Yes
diabetes	Parametric MICE	No, Yes	No, Yes	✅ Yes
sex	Parametric MICE	Male, Female	Male, Female	✅ Yes
smoking	Parametric MICE	No, Yes	Yes, No	✅ Yes
hypertension	Parametric MICE	Yes, No	Yes, No	✅ Yes
deceased	Parametric MICE	Yes, No	Yes, No	✅ Yes
anaemia	CART Imputation	No, Yes	No, Yes	✅ Yes
diabetes	CART Imputation	No, Yes	No, Yes	✅ Yes
sex	CART Imputation	Male, Female	Female, Male	✅ Yes
smoking	CART Imputation	No, Yes	Yes, No	✅ Yes
hypertension	CART Imputation	Yes, No	No, Yes	✅ Yes
deceased	CART Imputation	Yes, No	Yes, No	✅ Yes
anaemia	Synthpop (Low Fidelity)	No, Yes	Yes, No	✅ Yes
diabetes	Synthpop (Low Fidelity)	No, Yes	No, Yes	✅ Yes
sex	Synthpop (Low Fidelity)	Male, Female	Male, Female	✅ Yes
smoking	Synthpop (Low Fidelity)	No, Yes	Yes, No	✅ Yes
hypertension	Synthpop (Low Fidelity)	Yes, No	Yes, No	✅ Yes
deceased	Synthpop (Low Fidelity)	Yes, No	Yes, No	✅ Yes
anaemia	Metadata-Based	No, Yes	Yes, No	✅ Yes
diabetes	Metadata-Based	No, Yes	Yes, No	✅ Yes
sex	Metadata-Based	Male, Female	Female, Male	✅ Yes
smoking	Metadata-Based	No, Yes	Yes, No	✅ Yes
hypertension	Metadata-Based	Yes, No	Yes, No	✅ Yes
deceased	Metadata-Based	Yes, No	No, Yes	✅ Yes

5.3 Numeric Variables: Range and Distribution Comparison

This analysis evaluates how well the numeric variables in the synthetic datasets replicate the real data in terms of range and distribution.

Range: For each numeric variable, the synthetic data should maintain values within the range of the real dataset. Any deviations, especially outliers, could indicate inaccuracies in the synthetic generation process.
Distribution Comparison: Density plots are used to compare the shape of the distributions between the real and synthetic datasets. Ideally, the synthetic data should closely mirror the real distribution, indicating that the underlying data generation model has captured the true variability of the numeric features.

Show code

# Compute per-variable stats for a (real, synthetic) pair
numeric_summary <- function(real, synth, dataset_name) {
  num_vars <- names(real)[sapply(real, is.numeric)]
  # real percentiles
  pctl <- map(num_vars, ~quantile(real[[.x]], probs = c(.05, .95), na.rm = TRUE)) |>
    setNames(num_vars)

  tibble(
    Variable = num_vars,
    Real_Min = map_dbl(num_vars, ~min(real[[.x]], na.rm = TRUE)),
    Real_P5  = map_dbl(num_vars, ~pctl[[.x]][1]),
    Real_P95 = map_dbl(num_vars, ~pctl[[.x]][2]),
    Real_Max = map_dbl(num_vars, ~max(real[[.x]], na.rm = TRUE)),
    Synth_Min = map_dbl(num_vars, ~min(synth[[paste0("synth_", .x)]], na.rm = TRUE)),
    Synth_Max = map_dbl(num_vars, ~max(synth[[paste0("synth_", .x)]], na.rm = TRUE))
  ) |>
    mutate(
      Within_P5_P95 = ifelse(Synth_Min >= Real_P5 & Synth_Max <= Real_P95, "✅ Yes", "❌ No"),
      Dataset = dataset_name,
      .before = 1
    )
}

# Build one continuous table for all synthetic datasets
num_table <- bind_rows(
  numeric_summary(heart_failure, syn_data_1, "Parametric MICE"),
  numeric_summary(heart_failure, syn_cart_1, "CART Imputation"),
  numeric_summary(heart_failure, syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)"),
  numeric_summary(heart_failure, syn_data_metadata, "Metadata-Based")
)

kable(
  num_table,
  caption = "Numeric variables: real ranges & percentiles vs. synthetic ranges (with P5–P95 containment check)"
)

Numeric variables: real ranges & percentiles vs. synthetic ranges (with P5–P95 containment check)
Within_P5_P95	Dataset	Variable	Real_Min	Real_P5	Real_P95	Real_Max	Synth_Min	Synth_Max
❌ No	Parametric MICE	age	40.0	42.15	82.0	95.0	42	82
❌ No	Parametric MICE	creatinine_phosphokinase	23.0	59.20	2220.8	7861.0	59	2221
✅ Yes	Parametric MICE	ejection_fraction	14.0	20.00	60.0	80.0	20	60
✅ Yes	Parametric MICE	platelets	25100.0	132200.00	421200.0	850000.0	132200	421200
❌ No	Parametric MICE	serum_creatinine	0.5	0.70	2.9	9.4	1	3
✅ Yes	Parametric MICE	serum_sodium	113.0	130.00	144.0	148.0	130	144
✅ Yes	Parametric MICE	follow_up	4.0	12.90	250.0	285.0	13	250
❌ No	CART Imputation	age	40.0	42.15	82.0	95.0	42	82
❌ No	CART Imputation	creatinine_phosphokinase	23.0	59.20	2220.8	7861.0	59	2221
✅ Yes	CART Imputation	ejection_fraction	14.0	20.00	60.0	80.0	20	60
✅ Yes	CART Imputation	platelets	25100.0	132200.00	421200.0	850000.0	132200	421200
❌ No	CART Imputation	serum_creatinine	0.5	0.70	2.9	9.4	1	3
✅ Yes	CART Imputation	serum_sodium	113.0	130.00	144.0	148.0	130	144
✅ Yes	CART Imputation	follow_up	4.0	12.90	250.0	285.0	13	250
❌ No	Synthpop (Low Fidelity)	age	40.0	42.15	82.0	95.0	42	82
❌ No	Synthpop (Low Fidelity)	creatinine_phosphokinase	23.0	59.20	2220.8	7861.0	59	2221
✅ Yes	Synthpop (Low Fidelity)	ejection_fraction	14.0	20.00	60.0	80.0	20	60
✅ Yes	Synthpop (Low Fidelity)	platelets	25100.0	132200.00	421200.0	850000.0	132200	421200
❌ No	Synthpop (Low Fidelity)	serum_creatinine	0.5	0.70	2.9	9.4	1	3
✅ Yes	Synthpop (Low Fidelity)	serum_sodium	113.0	130.00	144.0	148.0	130	144
✅ Yes	Synthpop (Low Fidelity)	follow_up	4.0	12.90	250.0	285.0	13	250
❌ No	Metadata-Based	age	40.0	42.15	82.0	95.0	42	82
✅ Yes	Metadata-Based	creatinine_phosphokinase	23.0	59.20	2220.8	7861.0	67	2216
✅ Yes	Metadata-Based	ejection_fraction	14.0	20.00	60.0	80.0	20	60
✅ Yes	Metadata-Based	platelets	25100.0	132200.00	421200.0	850000.0	134054	421118
❌ No	Metadata-Based	serum_creatinine	0.5	0.70	2.9	9.4	1	3
✅ Yes	Metadata-Based	serum_sodium	113.0	130.00	144.0	148.0	130	144
✅ Yes	Metadata-Based	follow_up	4.0	12.90	250.0	285.0	13	249

5.4 Missingness Comparison

In this section, we compare the proportions of missing values between the real dataset and each synthetic dataset. This comparison helps determine whether the synthetic datasets accurately represent the patterns of missingness in the real data.

Missingness Proportions: For each dataset, we calculate the proportion of missing values for every column and compare the results across the real and synthetic datasets.
Comparison Outcome: Ideally, the synthetic datasets should exhibit similar missingness proportions to the real data. Any deviations may indicate differences in how missing values were handled during the synthetic data generation process.

5.4.1 Introduction Plots

Show code

# Custom palette for the 5 metrics
metric_palette <- c(
  "Discrete Columns"     = "#B0D99B",
  "Continuous Columns"   = "#528AA8",
  "All Missing Columns"  = "#FFB6DB",
  "Complete Rows"        = "#264653",
  "Missing Observations" = "#2A9D8F"
)

# Strip 'synth_' for fair visuals
strip_synth_prefix <- function(df) { 
  names(df) <- sub("^synth_", "", names(df)) 
  df 
}

# Datasets to compare
datasets <- list(
  "Real Data"               = heart_failure,
  "Parametric MICE"         = strip_synth_prefix(syn_data_1),
  "CART Imputation"         = strip_synth_prefix(syn_cart_1),
  "Synthpop (Low Fidelity)" = strip_synth_prefix(syn_data_low_fidelity_synthpop),
  "Metadata-Based"          = strip_synth_prefix(syn_data_metadata)
)

# Function: horizontal bar chart for the 5 metrics
make_intro_plot <- function(df, title) {
  n_rows <- nrow(df)
  n_cols <- ncol(df)

  pct_discrete   <- 100 * sum(vapply(df, is.factor,  logical(1))) / n_cols
  pct_continuous <- 100 * sum(vapply(df, is.numeric, logical(1))) / n_cols
  pct_allmisscol <- 100 * sum(colSums(is.na(df)) == n_rows) / n_cols
  pct_completer  <- 100 * sum(complete.cases(df)) / n_rows
  pct_missobs    <- 100 * sum(is.na(df)) / (n_rows * n_cols)

  plot_df <- tibble::tibble(
    Metric = factor(
      c("Discrete Columns", "Continuous Columns", "All Missing Columns",
        "Complete Rows", "Missing Observations"),
      levels = rev(c("Discrete Columns", "Continuous Columns", "All Missing Columns",
                     "Complete Rows", "Missing Observations")) # reverse for nicer order
    ),
    Percentage = c(pct_discrete, pct_continuous, pct_allmisscol, pct_completer, pct_missobs)
  )

  ggplot(plot_df, aes(x = Percentage, y = Metric, fill = Metric)) +
    geom_col(alpha = 0.9, show.legend = FALSE) +
    geom_text(aes(label = paste0(round(Percentage, 1), "%")),
              hjust = -0.1, size = 3.5) +
    scale_x_continuous(
      labels = scales::percent_format(scale = 1),
      limits = c(0, 100),
      expand = expansion(mult = c(0, 0.05)) # headroom for labels
    ) +
    scale_fill_manual(values = metric_palette) +
    labs(title = title, x = "Percentage", y = NULL) +
    theme_light() +
    theme(
      axis.text.y  = element_text(size = 9),
      plot.title   = element_text(face = "bold")
    )
}

# Build plots and combine in one grid
wrap_plots(
  imap(datasets, ~ make_intro_plot(.x, .y)),
  ncol = 2
) +
  plot_annotation(
    title = "Dataset Overview: Discrete/Continuous Columns, All-Missing Columns, Complete Rows, Missing Observations",
    subtitle = "Percentages shown per dataset; axes standardized to 0–100%",
    theme = theme(plot.title = element_text(face = "bold", size = 12))
  )

5.4.2 Missingness Proportions and Comparison Outcome

Show code

# Remove 'synth_' prefix so names match the real dataset
strip_synth_prefix <- function(df) {
  names(df) <- sub("^synth_", "", names(df))
  df
}

# Compare against the real dataset's variable set
core_vars <- names(heart_failure)

# Function to compute missingness proportions (2 d.p.) for one dataset
get_missingness <- function(df, label) {
  df2 <- df %>% strip_synth_prefix()
  # keep only variables present in the real data (prevents mismatches)
  df2 <- df2 %>% select(any_of(core_vars))
  tibble(
    Variable   = names(df2),
    Proportion = round(colSums(is.na(df2)) / nrow(df2), 2),
    Dataset    = label
  )
}

# Build tidy table
missing_tbl <- bind_rows(
  get_missingness(heart_failure, "Real Data"),
  get_missingness(syn_data_1, "Parametric MICE"),
  get_missingness(syn_cart_1, "CART Imputation"),
  get_missingness(syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)"),
  get_missingness(syn_data_metadata, "Metadata-Based")
)

# Pivot wider for side-by-side comparison
missing_tbl_wide <- missing_tbl %>%
  pivot_wider(names_from = Dataset, values_from = Proportion) %>%
  arrange(Variable)

# Optional: add quick match flags (✅ if identical to Real Data)
# missing_tbl_wide <- missing_tbl_wide %>%
#   mutate(
#     `Parametric MICE Match` = ifelse(`Parametric MICE` == `Real Data`, "✅", "❌"),
#     `CART Imputation Match` = ifelse(`CART Imputation` == `Real Data`, "✅", "❌"),
#     `Synthpop (Low Fidelity) Match` = ifelse(`Synthpop (Low Fidelity)` == `Real Data`, "✅", "❌"),
#     `Metadata-Based Match` = ifelse(`Metadata-Based` == `Real Data`, "✅", "❌")
#   )

kable(
  missing_tbl_wide,
  caption = "Missingness Proportions by Variable (rounded to 2 d.p.)"
)

Missingness Proportions by Variable (rounded to 2 d.p.)
Variable	Real Data	Parametric MICE	CART Imputation	Synthpop (Low Fidelity)	Metadata-Based
age	0.05	0.05	0.05	0.09	0.05
anaemia	0.00	0.00	0.00	0.00	0.00
creatinine_phosphokinase	0.05	0.05	0.05	0.10	0.05
deceased	0.00	0.00	0.00	0.00	0.00
diabetes	0.00	0.00	0.00	0.00	0.00
ejection_fraction	0.04	0.04	0.04	0.05	0.04
follow_up	0.00	0.00	0.00	0.00	0.00
hypertension	0.00	0.00	0.00	0.00	0.00
platelets	0.05	0.05	0.05	0.10	0.05
serum_creatinine	0.06	0.06	0.06	0.12	0.06
serum_sodium	0.07	0.07	0.07	0.11	0.07
sex	0.00	0.00	0.00	0.00	0.00
smoking	0.00	0.00	0.00	0.00	0.00

5.4.3 Missingness Maps

Show code

# Generate missingness map for the real dataset
real_missing_map <- aggr(heart_failure, col = c("#B0D99B", "#528AA8"),
                             numbers = TRUE, sortVars = TRUE,
                             labels = names(heart_failure), cex.axis = .7,
                             gap = 3, ylab = c("Missing Data", "Pattern - real Dataset"))


 Variables sorted by number of missings: 
                 Variable      Count
             serum_sodium 0.06688963
         serum_creatinine 0.06020067
                      age 0.05016722
 creatinine_phosphokinase 0.04682274
                platelets 0.04682274
        ejection_fraction 0.04013378
                  anaemia 0.00000000
                 diabetes 0.00000000
                      sex 0.00000000
                  smoking 0.00000000
             hypertension 0.00000000
                 deceased 0.00000000
                follow_up 0.00000000

Show code

# Generate missingness map for Parametric MICE synthetic dataset
parametric_mice_missing_map <- aggr(syn_data_1, col = c("#B0D99B", "#528AA8"),
                                    numbers = TRUE, sortVars = TRUE,
                                    labels = names(syn_data_1), cex.axis = .7,
                                    gap = 3, ylab = c("Missing Data", "Pattern - Parametric MICE Dataset"))


 Variables sorted by number of missings: 
                       Variable      Count
             synth_serum_sodium 0.06688963
         synth_serum_creatinine 0.06020067
                      synth_age 0.05016722
 synth_creatinine_phosphokinase 0.04682274
                synth_platelets 0.04682274
        synth_ejection_fraction 0.04013378
                  synth_anaemia 0.00000000
                 synth_diabetes 0.00000000
                      synth_sex 0.00000000
                  synth_smoking 0.00000000
             synth_hypertension 0.00000000
                 synth_deceased 0.00000000
                synth_follow_up 0.00000000

Show code

# Generate missingness map for CART synthetic dataset
cart_missing_map <- aggr(syn_cart_1, col = c("#B0D99B", "#528AA8"),
                         numbers = TRUE, sortVars = TRUE,
                         labels = names(syn_cart_1), cex.axis = .7,
                         gap = 3, ylab = c("Missing Data", "Pattern - CART Dataset"))


 Variables sorted by number of missings: 
                       Variable      Count
             synth_serum_sodium 0.06688963
         synth_serum_creatinine 0.06020067
                      synth_age 0.05016722
 synth_creatinine_phosphokinase 0.04682274
                synth_platelets 0.04682274
        synth_ejection_fraction 0.04013378
                  synth_anaemia 0.00000000
                 synth_diabetes 0.00000000
                      synth_sex 0.00000000
                  synth_smoking 0.00000000
             synth_hypertension 0.00000000
                 synth_deceased 0.00000000
                synth_follow_up 0.00000000

Show code

# Generate missingness map for Synthpop synthetic dataset
synthpop_missing_map <- aggr(syn_data_low_fidelity_synthpop, col = c("#B0D99B", "#528AA8"),
                             numbers = TRUE, sortVars = TRUE,
                             labels = names(syn_data_low_fidelity_synthpop), cex.axis = .7,
                             gap = 3, ylab = c("Missing Data", "Pattern - Synthpop Dataset"))


 Variables sorted by number of missings: 
                       Variable      Count
         synth_serum_creatinine 0.11705686
             synth_serum_sodium 0.11371237
                synth_platelets 0.10367893
 synth_creatinine_phosphokinase 0.10033445
                      synth_age 0.09364548
        synth_ejection_fraction 0.04682274
                  synth_anaemia 0.00000000
                 synth_diabetes 0.00000000
                      synth_sex 0.00000000
                  synth_smoking 0.00000000
             synth_hypertension 0.00000000
                 synth_deceased 0.00000000
                synth_follow_up 0.00000000

Show code

# Generate missingness map for Metadata-based synthetic dataset
metadata_missing_map <- aggr(syn_data_metadata, col = c("#B0D99B", "#528AA8"),
                             numbers = TRUE, sortVars = TRUE,
                             labels = names(syn_data_metadata), cex.axis = .7,
                             gap = 3, ylab = c("Missing Data", "Pattern - Metadata-Based Dataset"))


 Variables sorted by number of missings: 
                       Variable      Count
             synth_serum_sodium 0.06688963
         synth_serum_creatinine 0.06020067
                      synth_age 0.05016722
 synth_creatinine_phosphokinase 0.04682274
                synth_platelets 0.04682274
        synth_ejection_fraction 0.04013378
                  synth_anaemia 0.00000000
                 synth_diabetes 0.00000000
                      synth_sex 0.00000000
                  synth_smoking 0.00000000
             synth_hypertension 0.00000000
                 synth_deceased 0.00000000
                synth_follow_up 0.00000000

5.5 Correlation Matrices Comparison

In this section, we assess how well the correlations between numeric variables are preserved in the synthetic datasets compared to the real dataset. By comparing the correlation matrices, we evaluate whether the relationships between variables in the real data are reflected in the synthetic versions. This assessment is essential because maintaining the real data’s correlation structure ensures that any statistical or machine learning models trained on synthetic data will behave similarly to those trained on real data.

Correlation Matrix: A correlation matrix quantifies the strength and direction of relationships between pairs of numeric variables. Each cell in the matrix represents the correlation coefficient between two variables, which can range from -1 (perfect negative correlation) to +1 (perfect positive correlation). It is crucial that synthetic data retains these relationships to ensure that any analyses based on these dependencies remain valid.
Comparison Process: The correlation matrices for both the real and synthetic datasets are computed and Visualised. Ideally, the structure and strength of correlations in the synthetic datasets should closely match those in the real dataset. Differences in correlation strength or direction indicate that the synthetic data may not fully capture the underlying relationships, which can affect the validity of any conclusions drawn from analyses using the synthetic data.

Maintaining similar correlation structures is particularly important for use cases where relationships between variables play a crucial role, such as in predictive modeling, feature selection, and understanding variable dependencies. The following sections outline different methods used to compare correlation matrices between the real and synthetic datasets.

5.5.1 Correlation Matrices Comparison using psych

The psych package provides tools for analyzing and visualizing correlation matrices. The corr.test() function is particularly useful for computing correlation coefficients while handling missing data and providing additional statistics, such as confidence intervals.

Show code

# Strip 'synth_' so names line up with real
strip_synth_prefix <- function(df) {
  names(df) <- sub("^synth_", "", names(df))
  df
}

# Build a tidy table of pairwise correlations for real vs one synthetic dataset
corr_compare_one <- function(real, synth, label) {
  real_num  <- real |> select(where(is.numeric))
  synth_num <- synth |> strip_synth_prefix() |> select(where(is.numeric))

  common <- intersect(names(real_num), names(synth_num))
  if (length(common) < 2) {
    return(tibble(Dataset = label, Var1 = character(), Var2 = character(),
                  Real = numeric(), Synthetic = numeric(), Diff = numeric(), AbsDiff = numeric()))
  }

  # Compute Pearson correlations with pairwise complete observations
  rc <- cor(real_num[common],  use = "pairwise.complete.obs", method = "pearson")
  sc <- cor(synth_num[common], use = "pairwise.complete.obs", method = "pearson")

  # Keep only upper triangle (unique pairs)
  ut <- upper.tri(rc, diag = FALSE)
  pairs <- which(ut, arr.ind = TRUE)
  tibble(
    Var1 = colnames(rc)[pairs[, 1]],
    Var2 = colnames(rc)[pairs[, 2]],
    Real = rc[pairs],
    Synthetic = sc[pairs]
  ) |>
    mutate(
      Dataset = label,
      Diff = Synthetic - Real,
      AbsDiff = abs(Diff)
    ) |>
    select(Dataset, Var1, Var2, Real, Synthetic, Diff, AbsDiff)
}

# Build one combined table for all synthetic datasets
corr_cmp_tbl <- bind_rows(
  corr_compare_one(heart_failure, syn_data_1, "Parametric MICE"),
  corr_compare_one(heart_failure, syn_cart_1, "CART Imputation"),
  corr_compare_one(heart_failure, syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)"),
  corr_compare_one(heart_failure, syn_data_metadata, "Metadata-Based")
) |>
  arrange(Dataset, desc(AbsDiff)) |>
  mutate(
    Real = round(Real, 2),
    Synthetic = round(Synthetic, 2),
    Diff = round(Diff, 2),
    AbsDiff = round(AbsDiff, 2)
  )

knitr::kable(
  corr_cmp_tbl,
  caption = "Pairwise Correlations: Real vs. Synthetic (sorted by largest absolute difference per dataset)",
  align = "lllrrrr"
)

Pairwise Correlations: Real vs. Synthetic (sorted by largest absolute difference per dataset)
Dataset	Var1	Var2	Real	Synthetic	Diff	AbsDiff
CART Imputation	creatinine_phosphokinase	platelets	0.01	-0.12	-0.13	0.13
CART Imputation	ejection_fraction	serum_creatinine	-0.01	-0.11	-0.10	0.10
CART Imputation	creatinine_phosphokinase	serum_sodium	0.06	-0.04	-0.10	0.10
CART Imputation	age	follow_up	-0.24	-0.15	0.09	0.09
CART Imputation	ejection_fraction	platelets	0.06	-0.02	-0.08	0.08
CART Imputation	age	serum_creatinine	0.17	0.25	0.08	0.08
CART Imputation	platelets	serum_creatinine	-0.05	0.03	0.08	0.08
CART Imputation	ejection_fraction	follow_up	0.03	-0.05	-0.08	0.08
CART Imputation	serum_sodium	follow_up	0.11	0.03	-0.07	0.07
CART Imputation	age	creatinine_phosphokinase	-0.08	-0.15	-0.07	0.07
CART Imputation	platelets	serum_sodium	0.06	-0.01	-0.07	0.07
CART Imputation	creatinine_phosphokinase	follow_up	-0.03	0.03	0.06	0.06
CART Imputation	platelets	follow_up	-0.01	0.05	0.06	0.06
CART Imputation	age	platelets	-0.05	-0.01	0.04	0.04
CART Imputation	age	serum_sodium	-0.04	-0.08	-0.04	0.04
CART Imputation	age	ejection_fraction	0.04	0.00	-0.04	0.04
CART Imputation	creatinine_phosphokinase	ejection_fraction	-0.05	-0.02	0.03	0.03
CART Imputation	creatinine_phosphokinase	serum_creatinine	0.00	-0.02	-0.02	0.02
CART Imputation	serum_creatinine	serum_sodium	-0.21	-0.23	-0.02	0.02
CART Imputation	ejection_fraction	serum_sodium	0.17	0.18	0.02	0.02
CART Imputation	serum_creatinine	follow_up	-0.15	-0.14	0.01	0.01
Metadata-Based	age	follow_up	-0.24	0.12	0.36	0.36
Metadata-Based	platelets	serum_sodium	0.06	-0.16	-0.23	0.23
Metadata-Based	serum_creatinine	serum_sodium	-0.21	0.01	0.21	0.21
Metadata-Based	creatinine_phosphokinase	ejection_fraction	-0.05	0.11	0.16	0.16
Metadata-Based	ejection_fraction	serum_sodium	0.17	0.03	-0.13	0.13
Metadata-Based	creatinine_phosphokinase	serum_creatinine	0.00	-0.12	-0.13	0.13
Metadata-Based	creatinine_phosphokinase	follow_up	-0.03	0.09	0.12	0.12
Metadata-Based	age	serum_creatinine	0.17	0.07	-0.10	0.10
Metadata-Based	serum_sodium	follow_up	0.11	0.00	-0.10	0.10
Metadata-Based	age	serum_sodium	-0.04	0.06	0.10	0.10
Metadata-Based	serum_creatinine	follow_up	-0.15	-0.05	0.09	0.09
Metadata-Based	creatinine_phosphokinase	platelets	0.01	-0.08	-0.09	0.09
Metadata-Based	ejection_fraction	platelets	0.06	-0.02	-0.08	0.08
Metadata-Based	age	platelets	-0.05	0.02	0.07	0.07
Metadata-Based	age	ejection_fraction	0.04	-0.03	-0.07	0.07
Metadata-Based	ejection_fraction	follow_up	0.03	0.08	0.05	0.05
Metadata-Based	creatinine_phosphokinase	serum_sodium	0.06	0.11	0.05	0.05
Metadata-Based	age	creatinine_phosphokinase	-0.08	-0.05	0.03	0.03
Metadata-Based	ejection_fraction	serum_creatinine	-0.01	0.01	0.02	0.02
Metadata-Based	platelets	follow_up	-0.01	-0.02	-0.01	0.01
Metadata-Based	platelets	serum_creatinine	-0.05	-0.06	-0.01	0.01
Parametric MICE	serum_creatinine	serum_sodium	-0.21	0.02	0.23	0.23
Parametric MICE	ejection_fraction	follow_up	0.03	0.20	0.17	0.17
Parametric MICE	age	follow_up	-0.24	-0.08	0.16	0.16
Parametric MICE	ejection_fraction	serum_sodium	0.17	0.03	-0.13	0.13
Parametric MICE	serum_sodium	follow_up	0.11	-0.01	-0.11	0.11
Parametric MICE	platelets	serum_creatinine	-0.05	0.06	0.11	0.11
Parametric MICE	creatinine_phosphokinase	serum_creatinine	0.00	-0.11	-0.11	0.11
Parametric MICE	age	serum_creatinine	0.17	0.08	-0.09	0.09
Parametric MICE	creatinine_phosphokinase	platelets	0.01	-0.07	-0.08	0.08
Parametric MICE	age	serum_sodium	-0.04	-0.11	-0.07	0.07
Parametric MICE	age	creatinine_phosphokinase	-0.08	-0.02	0.06	0.06
Parametric MICE	creatinine_phosphokinase	follow_up	-0.03	0.03	0.06	0.06
Parametric MICE	platelets	follow_up	-0.01	0.04	0.05	0.05
Parametric MICE	creatinine_phosphokinase	ejection_fraction	-0.05	-0.01	0.04	0.04
Parametric MICE	serum_creatinine	follow_up	-0.15	-0.11	0.04	0.04
Parametric MICE	platelets	serum_sodium	0.06	0.03	-0.03	0.03
Parametric MICE	creatinine_phosphokinase	serum_sodium	0.06	0.03	-0.03	0.03
Parametric MICE	ejection_fraction	serum_creatinine	-0.01	-0.04	-0.03	0.03
Parametric MICE	age	platelets	-0.05	-0.03	0.02	0.02
Parametric MICE	age	ejection_fraction	0.04	0.05	0.01	0.01
Parametric MICE	ejection_fraction	platelets	0.06	0.06	0.00	0.00
Synthpop (Low Fidelity)	age	follow_up	-0.24	0.04	0.28	0.28
Synthpop (Low Fidelity)	serum_creatinine	serum_sodium	-0.21	0.00	0.20	0.20
Synthpop (Low Fidelity)	age	serum_creatinine	0.17	-0.03	-0.20	0.20
Synthpop (Low Fidelity)	ejection_fraction	serum_sodium	0.17	-0.02	-0.19	0.19
Synthpop (Low Fidelity)	age	serum_sodium	-0.04	0.11	0.15	0.15
Synthpop (Low Fidelity)	serum_creatinine	follow_up	-0.15	0.00	0.15	0.15
Synthpop (Low Fidelity)	serum_sodium	follow_up	0.11	-0.03	-0.14	0.14
Synthpop (Low Fidelity)	age	creatinine_phosphokinase	-0.08	0.03	0.12	0.12
Synthpop (Low Fidelity)	creatinine_phosphokinase	ejection_fraction	-0.05	0.05	0.10	0.10
Synthpop (Low Fidelity)	ejection_fraction	platelets	0.06	0.15	0.09	0.09
Synthpop (Low Fidelity)	platelets	follow_up	-0.01	0.08	0.09	0.09
Synthpop (Low Fidelity)	ejection_fraction	follow_up	0.03	0.09	0.06	0.06
Synthpop (Low Fidelity)	creatinine_phosphokinase	platelets	0.01	-0.05	-0.06	0.06
Synthpop (Low Fidelity)	age	ejection_fraction	0.04	-0.02	-0.06	0.06
Synthpop (Low Fidelity)	ejection_fraction	serum_creatinine	-0.01	-0.06	-0.05	0.05
Synthpop (Low Fidelity)	creatinine_phosphokinase	serum_sodium	0.06	0.01	-0.05	0.05
Synthpop (Low Fidelity)	creatinine_phosphokinase	follow_up	-0.03	0.02	0.04	0.04
Synthpop (Low Fidelity)	platelets	serum_creatinine	-0.05	-0.08	-0.03	0.03
Synthpop (Low Fidelity)	creatinine_phosphokinase	serum_creatinine	0.00	-0.02	-0.03	0.03
Synthpop (Low Fidelity)	platelets	serum_sodium	0.06	0.08	0.02	0.02
Synthpop (Low Fidelity)	age	platelets	-0.05	-0.04	0.01	0.01

5.5.2 Correlation Matrices Comparison using Corrplot

The corrplot package is widely used for visualizing correlation matrices. It provides a variety of options for customization, making it easy to identify patterns and relationships between variables. This method is particularly effective for visual comparisons because it highlights differences in correlation strength and direction.

Show code

# Remove 'synth_' so names match the real dataset
strip_synth_prefix <- function(df) { names(df) <- sub("^synth_", "", names(df)); df }

# Side-by-side correlation matrices (full width)
compare_correlation_matrices <- function(real_data, synthetic_data, dataset_name) {
  # Numeric-only
  real_num  <- real_data[sapply(real_data, is.numeric)]
  synth_num <- strip_synth_prefix(synthetic_data)[sapply(strip_synth_prefix(synthetic_data), is.numeric)]

  # Use common numeric variables in same order
  common <- intersect(names(real_num), names(synth_num))
  if (length(common) < 2) {
    message(dataset_name, ": fewer than 2 common numeric columns — skipping.")
    return(invisible(NULL))
  }
  real_corr  <- cor(real_num[common],  use = "complete.obs")
  synth_corr <- cor(synth_num[common], use = "complete.obs")

  # Make two panels across the page
  op <- par(no.readonly = TRUE); on.exit(par(op), add = TRUE)
  par(mfrow = c(1, 2), mar = c(4, 4, 4, 2))

  corrplot(real_corr,  tl.cex = 0.8, mar = c(0,0,2,0))
  title(paste(dataset_name, "- Real Data"), line = 1)

  corrplot(synth_corr, tl.cex = 0.8, mar = c(0,0,2,0))
  title(paste(dataset_name, "- Synthetic Data"), line = 1)
}

# Run for each synthetic dataset
compare_correlation_matrices(heart_failure, syn_data_1, "Parametric MICE")

Show code

compare_correlation_matrices(heart_failure, syn_cart_1, "CART Imputation")

Show code

compare_correlation_matrices(heart_failure, syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)")

Show code

compare_correlation_matrices(heart_failure, syn_data_metadata, "Metadata-Based")

5.5.3 Correlation Matrices Comparison using GGally

The GGally package extends ggplot2 to create pairwise visualizations, including correlation matrix plots. Using ggpairs(), you can generate a matrix of scatterplots, histograms, and correlation coefficients, which provides a detailed view of the relationships between numeric variables in the real and synthetic datasets.

Show code

compare_correlation_matrices_ggally <- function(real_data, synthetic_data, dataset_name) {

  # Select only numeric columns from both datasets
  real_numeric <- real_data %>% select(where(is.numeric))

  # Remove prefix 'synth_' from synthetic data column names
  names(synthetic_data) <- gsub("^synth_", "", names(synthetic_data))
  synthetic_numeric <- synthetic_data %>% select(where(is.numeric))

  # Ensure that both datasets have the same columns
  common_columns <- intersect(names(real_numeric), names(synthetic_numeric))
  
  if (length(common_columns) == 0) {
    cat("\nNo common numeric columns found for comparison in", dataset_name, "\n")
    return()
  }

  real_numeric <- real_numeric[, common_columns, drop = FALSE]
  synthetic_numeric <- synthetic_numeric[, common_columns, drop = FALSE]

  # Add a column to differentiate between real and synthetic data
  real_numeric$Dataset <- "Real"
  synthetic_numeric$Dataset <- "Synth"

  # Combine the datasets
  combined_data <- rbind(real_numeric, synthetic_numeric)

  # Create correlation matrix plot using GGally if there are valid columns to plot
  if (ncol(combined_data) > 1) {
    p <- ggpairs(
      combined_data, 
      aes(color = Dataset, alpha = 0.5), 
      title = paste(dataset_name, "- Correlation Matrix Comparison"),
      columns = 1:(ncol(combined_data) - 1),  # Exclude the 'Dataset' column from the plot
      upper = list(continuous = wrap("cor", size = 3, alignPercent = 0.5)),  # Smaller correlation labels
      lower = list(continuous = wrap("smooth", alpha = 0.4, size = 0.5)),  # Smoother plots for readability
      diag = list(continuous = wrap("densityDiag", alpha = 0.5))  # Density plot with improved transparency
    ) +
      theme_minimal() +  # Use a cleaner theme
      theme(
        strip.text = element_text(size = 8, face = "bold"),  # Smaller strip labels
        axis.text.x = element_text(angle = 45, hjust = 1, size = 7),  # Rotate x-axis labels for better readability
        axis.text.y = element_text(size = 7),
        legend.position = "top",  # Move legend to a better location
        legend.title = element_blank()  # Remove unnecessary legend title
      )
    print(p)
  } else {
    cat("\nNot enough numeric columns to generate a correlation matrix plot for", dataset_name, "\n")
  }
}

# Apply the function to each synthetic dataset
compare_correlation_matrices_ggally(heart_failure, syn_data_1, "Parametric MICE")

Show code

compare_correlation_matrices_ggally(heart_failure, syn_cart_1, "CART Imputation")

Show code

compare_correlation_matrices_ggally(heart_failure, syn_data_low_fidelity_synthpop, "Synthpop Low Fidelity")

Show code

compare_correlation_matrices_ggally(heart_failure, syn_data_metadata, "Metadata-Based Synthetic Data")

6 Fidelity Assessment

6.1 Descriptive Statistics

This section provides a comparative analysis of the descriptive statistics between the real dataset and the synthetic datasets generated using different methods.

6.1.1 Real Dataset

The descriptive statistics for the real heart failure dataset include metrics such as mean, median, standard deviation, and range. This serves as the baseline for comparison with the synthetic datasets.

Show code

# Generate descriptive stats
desc_tbl <- describe(heart_failure) %>%
  as.data.frame() %>%
  select(n, mean, sd, min, median, max) %>%
  round(2) %>%
  tibble::rownames_to_column("Variable")

# Display as a neat table
kable(
  desc_tbl,
  caption = "Descriptive Statistics for the Heart Failure Dataset",
  align = "lrrrrrr"
)

Descriptive Statistics for the Heart Failure Dataset
Variable	n	mean	sd	min	median	max
age	284	60.80	12.08	40.0	60.0	95.0
anaemia*	299	1.43	0.50	1.0	1.0	2.0
creatinine_phosphokinase	285	578.81	981.99	23.0	249.0	7861.0
diabetes*	299	1.42	0.49	1.0	1.0	2.0
ejection_fraction	287	38.00	11.92	14.0	38.0	80.0
platelets	285	265163.83	97508.63	25100.0	263358.0	850000.0
serum_creatinine	281	1.39	1.05	0.5	1.1	9.4
serum_sodium	279	136.59	4.49	113.0	137.0	148.0
sex*	299	1.65	0.48	1.0	2.0	2.0
smoking*	299	1.32	0.47	1.0	1.0	2.0
hypertension*	299	1.35	0.48	1.0	1.0	2.0
deceased*	299	1.32	0.47	1.0	1.0	2.0
follow_up	299	130.26	77.61	4.0	115.0	285.0

6.1.2 Parametric Imputation (MICE)

The synthetic data generated using the MICE framework is analyzed to compare its summary statistics with the real dataset. This will help evaluate how well the parametric method captures the key characteristics of the data.

Show code

# Descriptive stats for Parametric MICE synthetic dataset
desc_mice <- describe(syn_data_1) %>%
  as.data.frame() %>%
  select(n, mean, sd, min, median, max) %>%
  round(2) %>%
  tibble::rownames_to_column("Variable")

# Display table
kable(
  desc_mice,
  caption = "Descriptive Statistics for Parametric MICE Synthetic Dataset",
  align = "lrrrrrr"
)

Descriptive Statistics for Parametric MICE Synthetic Dataset
Variable	n	mean	sd	min	median	max
synth_age	284	61.01	11.40	42	60	82
synth_anaemia*	299	1.45	0.50	1	1	2
synth_creatinine_phosphokinase	285	735.72	687.98	59	540	2221
synth_diabetes*	299	1.42	0.49	1	1	2
synth_ejection_fraction	287	38.44	10.70	20	38	60
synth_platelets	285	257523.78	86505.14	132200	254087	421200
synth_serum_creatinine	281	1.70	0.77	1	2	3
synth_serum_sodium	279	135.78	3.96	130	135	144
synth_sex*	299	1.64	0.48	1	2	2
synth_smoking*	299	1.29	0.45	1	1	2
synth_hypertension*	299	1.42	0.49	1	1	2
synth_deceased*	299	1.31	0.47	1	1	2
synth_follow_up	299	132.74	71.11	13	132	250

6.1.3 Non-Parametric Imputation (CART)

The CART method is a non-parametric technique that handles non-linear relationships. The descriptive statistics of the CART-imputed synthetic data provide insight into how this method captures the dataset’s distribution.

Show code

# Descriptive stats for CART synthetic dataset
desc_cart <- describe(syn_cart_1) %>%
  as.data.frame() %>%
  select(n, mean, sd, min, median, max) %>%
  round(2) %>%
  tibble::rownames_to_column("Variable")

# Display table
kable(
  desc_cart,
  caption = "Descriptive Statistics for CART Synthetic Dataset",
  align = "lrrrrrr"
)

Descriptive Statistics for CART Synthetic Dataset
Variable	n	mean	sd	min	median	max
synth_age	284	60.81	11.88	42	60	82
synth_anaemia*	299	1.42	0.49	1	1	2
synth_creatinine_phosphokinase	285	469.20	547.58	59	231	2221
synth_diabetes*	299	1.46	0.50	1	1	2
synth_ejection_fraction	287	38.56	11.85	20	35	60
synth_platelets	285	265126.32	76944.75	132200	263358	421200
synth_serum_creatinine	281	1.28	0.56	1	1	3
synth_serum_sodium	279	136.79	3.67	130	137	144
synth_sex*	299	1.67	0.47	1	2	2
synth_smoking*	299	1.37	0.48	1	1	2
synth_hypertension*	299	1.38	0.49	1	1	2
synth_deceased*	299	1.33	0.47	1	1	2
synth_follow_up	299	132.53	75.17	13	117	250

6.1.4 Synthpop-Based Synthetic Data

The Synthpop package generates synthetic data aimed at preserving data privacy. Descriptive statistics for this dataset help assess how well the key attributes of the real data are preserved.

Show code

# Descriptive stats for Synthpop synthetic dataset
desc_synthpop <- describe(syn_data_low_fidelity_synthpop) %>%
  as.data.frame() %>%
  select(n, mean, sd, min, median, max) %>%
  round(2) %>%
  tibble::rownames_to_column("Variable")

# Display as a clean table
kable(
  desc_synthpop,
  caption = "Descriptive Statistics for Synthpop Synthetic Dataset",
  align = "lrrrrrr"
)

Descriptive Statistics for Synthpop Synthetic Dataset
Variable	n	mean	sd	min	median	max
synth_age	271	61.15	11.13	42	60	82
synth_anaemia*	299	1.47	0.50	1	1	2
synth_creatinine_phosphokinase	269	435.35	498.44	59	232	2221
synth_diabetes*	299	1.37	0.48	1	1	2
synth_ejection_fraction	285	36.73	10.93	20	35	60
synth_platelets	268	269756.34	77738.41	132200	263358	421200
synth_serum_creatinine	264	1.33	0.63	1	1	3
synth_serum_sodium	265	136.92	3.69	130	137	144
synth_sex*	299	1.65	0.48	1	2	2
synth_smoking*	299	1.28	0.45	1	1	2
synth_hypertension*	299	1.37	0.48	1	1	2
synth_deceased*	299	1.31	0.46	1	1	2
synth_follow_up	299	121.22	74.28	13	108	250

6.1.5 Metadata-Based Synthetic Data

The synthetic data generated using metadata ensures that variable structures conform to the data dictionary. The summary statistics here show how well the dataset reflects the real’s attributes based on predefined metadata.

Show code

# Descriptive stats for Metadata-Based synthetic dataset
desc_metadata <- describe(syn_data_metadata) %>%
  as.data.frame() %>%
  select(n, mean, sd, min, median, max) %>%
  round(2) %>%
  tibble::rownames_to_column("Variable")

# Display as a clean table
kable(
  desc_metadata,
  caption = "Descriptive Statistics for Metadata-Based Synthetic Dataset",
  align = "lrrrrrr"
)

Descriptive Statistics for Metadata-Based Synthetic Dataset
Variable	n	mean	sd	min	median	max
synth_age	284	62.57	11.98	42	62.5	82
synth_anaemia*	299	1.52	0.50	1	2.0	2
synth_creatinine_phosphokinase	285	1129.11	594.90	67	1185.0	2216
synth_diabetes*	299	1.48	0.50	1	1.0	2
synth_ejection_fraction	287	39.81	12.08	20	38.0	60
synth_platelets	285	275686.74	79876.34	134054	280442.0	421118
synth_serum_creatinine	281	1.79	0.71	1	2.0	3
synth_serum_sodium	279	136.69	4.20	130	137.0	144
synth_sex*	299	1.49	0.50	1	1.0	2
synth_smoking*	299	1.51	0.50	1	2.0	2
synth_hypertension*	299	1.51	0.50	1	2.0	2
synth_deceased*	299	1.47	0.50	1	1.0	2
synth_follow_up	299	136.77	70.26	13	139.0	249

By comparing the descriptive statistics, you can evaluate how closely the synthetic datasets resemble the real dataset across key summary metrics. This is a critical step in assessing the quality and usability of synthetic data. Histogram Similarity Score ### Variable Exploration

6.1.6 Parametric Imputation (MICE)

The parametric imputation using the MICE framework generates synthetic data by imputing missing values based on multivariate normal distributions.

6.1.6.1 Bar Plots for Categorical Variables

To evaluate the preservation of categorical variable distributions in the synthetic data, we generate bar plots comparing the proportions of categorical variables between the real and synthetic datasets.

Show code

# Combine real and parametric MICE synthetic datasets
heart_failure$dataset <- "real"
syn_data_1$dataset <- "synthetic"

# Adjust the column names for synthetic data by removing the 'synth_' prefix
colnames(syn_data_1) <- gsub("synth_", "", colnames(syn_data_1))

# Combine both datasets (real and synthetic)
combined_data_param <- bind_rows(heart_failure, syn_data_1)

# Bar plots for categorical variables
bar_plots <- colnames(heart_failure)[map_lgl(heart_failure, is.factor)] %>%
  map(~ ggplot(combined_data_param, aes_string(.x, fill = 'dataset', group = 'dataset')) +
        geom_bar(aes(y = ..prop..), position = position_dodge2(), stat = "count") +
        scale_fill_manual(values = c("real" = "#B0D99B", "synthetic" = "#528AA8")) +
        labs(x = .x, y = "Proportion")) %>%
  patchwork::wrap_plots() +
  plot_annotation(title = "Comparison of Categorical Variables Between real and Synthetic Datasets",
                  theme = theme(plot.title = element_text(hjust = 0.5)))

# Print the combined plot with one main title
print(bar_plots)

6.1.6.2 Density Plots for Numeric Variables

The density plots illustrate the distribution of numeric variables between the real dataset and the parametric MICE synthetic data. The goal is to ensure that the synthetic data closely follows the real data’s numeric distribution.

Show code

# Density plots for numeric variables (parametric imputed data)
density_plots_param <- colnames(heart_failure)[map_lgl(heart_failure, is.numeric)] %>%
  map(~ ggplot(combined_data_param, aes_string(.x, fill = 'dataset', group = 'dataset')) +
        geom_density(alpha = 0.5) +
        scale_fill_manual(values = c("real" = "#B0D99B", "synthetic" = "#528AA8")) +
        labs(x = .x, y = "Density")) %>%
  patchwork::wrap_plots(ncol = 2) +  # Arrange plots in 2 columns
  plot_annotation(title = "Comparison of Numeric Variables Between real and Parametric Synthetic Data",
                  theme = theme(plot.title = element_text(hjust = 0.5)))

# Print the combined density plots with one main title
print(density_plots_param)

6.1.7 Non-Parametric Imputation (CART)

CART is a non-parametric approach for generating synthetic data that captures complex relationships between variables without making distributional assumptions.

6.1.7.1 Bar Plots for Categorical Variables

The bar plots compare the distribution of categorical variables in the real and synthetic datasets generated using the CART method.

Show code

# Combine real and non-parametric CART synthetic datasets
syn_cart_1$dataset <- "synthetic"

# Adjust the column names for synthetic data by removing the 'synth_' prefix
colnames(syn_cart_1) <- gsub("synth_", "", colnames(syn_cart_1))

# Combine both datasets (real and synthetic)
combined_data_cart <- bind_rows(heart_failure, syn_cart_1)

# Bar plots for categorical variables (CART-imputed data)
bar_plots_cart <- colnames(heart_failure)[map_lgl(heart_failure, is.factor)] %>%
  map(~ ggplot(combined_data_cart, aes_string(.x, fill = 'dataset', group = 'dataset')) +
        geom_bar(aes(y = ..prop..), position = position_dodge2(), stat = "count") +
        scale_fill_manual(values = c("real" = "#B0D99B", "synthetic" = "#528AA8")) +
        labs(x = .x, y = "Proportion")) %>%
  patchwork::wrap_plots() +
  plot_annotation(title = "Comparison of Categorical Variables Between real and Synthetic Data",
                  theme = theme(plot.title = element_text(hjust = 0.5)))

# Print the combined plot with one main title
print(bar_plots_cart)

6.1.7.2 Density Plots for Numeric Variables

The density plots display how well the numeric variables in the CART-imputed synthetic data align with the real dataset.

Show code

# Density plots for numeric variables (CART-imputed data)
density_plots_cart <- colnames(heart_failure)[map_lgl(heart_failure, is.numeric)] %>%
  map(~ ggplot(combined_data_cart, aes_string(.x, fill = 'dataset', group = 'dataset')) +
        geom_density(alpha = 0.5) +
        scale_fill_manual(values = c("real" = "#B0D99B", "synthetic" = "#528AA8")) +
        labs(x = .x, y = "Density")) %>%
  patchwork::wrap_plots(ncol = 2) +  # Arrange plots in 2 columns
  plot_annotation(title = "Comparison of Numeric Variables Between real and Synthetic Data",
                  theme = theme(plot.title = element_text(hjust = 0.5)))

# Print the combined density plots with one main title
print(density_plots_cart)

6.1.8 Synthpop-Based Synthetic Data

Synthpop is used to generate synthetic datasets that closely follow the distributions of the real data for privacy-preserving data sharing.

6.1.8.1 Bar Plots for Categorical Variables

These bar plots assess how well Synthpop-based synthetic data maintains the categorical distributions of the real dataset.

Show code

# Combine real and Synthpop-based synthetic datasets
heart_failure$dataset <- "real"
syn_data_low_fidelity_synthpop$dataset <- "synthetic"

# Adjust the column names for synthetic data by removing the 'synth_' prefix
colnames(syn_data_low_fidelity_synthpop) <- gsub("synth_", "", colnames(syn_data_low_fidelity_synthpop))

# Combine both datasets (real and synthetic)
combined_data_synthpop <- bind_rows(heart_failure, syn_data_low_fidelity_synthpop)

# Bar plots for categorical variables (Synthpop-based synthetic data)
bar_plots_synthpop <- colnames(heart_failure)[map_lgl(heart_failure, is.factor)] %>%
  map(~ ggplot(combined_data_synthpop, aes_string(.x, fill = 'dataset', group = 'dataset')) +
        geom_bar(aes(y = ..prop..), position = position_dodge2(), stat = "count") +
        scale_fill_manual(values = c("real" = "#B0D99B", "synthetic" = "#528AA8")) +
        labs(x = .x, y = "Proportion")) %>%
  patchwork::wrap_plots() +
  plot_annotation(title = "Comparison of Categorical Variables Between real and Synthetic Data",
                  theme = theme(plot.title = element_text(hjust = 0.5)))

# Print the combined plot with one main title
print(bar_plots_synthpop)

6.1.8.2 Density Plots for Numeric Variables

The density plots illustrate how closely Synthpop-based synthetic data matches the real dataset for numeric variables.

Show code

# Combine real and Synthpop-based synthetic datasets
heart_failure$dataset <- "real"
syn_data_low_fidelity_synthpop$dataset <- "synthetic"
combined_data_synthpop <- bind_rows(heart_failure, syn_data_low_fidelity_synthpop)

# Density plots for numeric variables (Synthpop-based synthetic data)
density_plots_synthpop <- colnames(heart_failure)[map_lgl(heart_failure, is.numeric)] %>%
  map(~ ggplot(combined_data_synthpop, aes_string(.x, fill = 'dataset', group = 'dataset')) +
        geom_density(alpha = 0.5) +
        scale_fill_manual(values = c("real" = "#B0D99B", "synthetic" = "#528AA8")) +
        labs(x = .x, y = "Density")) %>%
  patchwork::wrap_plots(ncol = 2) +  # Arrange plots in 2 columns
  plot_annotation(title = "Comparison of Numeric Variables Between real and Synthetic Data",
                  theme = theme(plot.title = element_text(hjust = 0.5)))

# Print the combined density plots with one main title
print(density_plots_synthpop)

6.1.9 Metadata-Based Synthetic Data

Metadata-based synthetic data generation uses a predefined data dictionary to ensure that the synthetic data follows the correct structure and distribution.

6.1.9.1 Bar Plots for Categorical Variables

We compare the categorical variables of the real and metadata-based synthetic datasets to assess distributional alignment.

Show code

# Combine real and Metadata-based synthetic datasets
heart_failure$dataset <- "real"
syn_data_metadata$dataset <- "synthetic"

# Adjust the column names for synthetic data by removing the 'synth_' prefix
colnames(syn_data_metadata) <- gsub("synth_", "", colnames(syn_data_metadata))

# Combine both datasets (real and synthetic)
combined_data_metadata <- bind_rows(heart_failure, syn_data_metadata)

# Bar plots for categorical variables (Metadata-based synthetic data)
bar_plots_metadata <- colnames(heart_failure)[map_lgl(heart_failure, is.factor)] %>%
  map(~ ggplot(combined_data_metadata, aes_string(.x, fill = 'dataset', group = 'dataset')) +
        geom_bar(aes(y = ..prop..), position = position_dodge2(), stat = "count") +
        scale_fill_manual(values = c("real" = "#B0D99B", "synthetic" = "#528AA8")) +
        labs(x = .x, y = "Proportion")) %>%
  patchwork::wrap_plots() +
  plot_annotation(title = "Comparison of Categorical Variables Between real and Synthetic Datasets",
                  theme = theme(plot.title = element_text(hjust = 0.5)))

# Print the combined plot with one main title
print(bar_plots_metadata)

6.1.9.2 Density Plots for Numeric Variables

The density plots for numeric variables compare the distribution of numeric values between the real and metadata-based synthetic datasets.

Show code

# Combine real and Metadata-based synthetic datasets
heart_failure$dataset <- "real"
syn_data_metadata$dataset <- "synthetic"
combined_data_metadata <- bind_rows(heart_failure, syn_data_metadata)

# Density plots for numeric variables (Metadata-based synthetic data)
density_plots_metadata <- colnames(heart_failure)[map_lgl(heart_failure, is.numeric)] %>%
  map(~ ggplot(combined_data_metadata, aes_string(.x, fill = 'dataset', group = 'dataset')) +
        geom_density(alpha = 0.5) +
        scale_fill_manual(values = c("real" = "#B0D99B", "synthetic" = "#528AA8")) +
        labs(x = .x, y = "Density")) %>%
  patchwork::wrap_plots(ncol = 2) +  # Arrange plots in 2 columns
  plot_annotation(title = "Comparison of Numeric Variables Between real and Synthetic Data",
                  theme = theme(plot.title = element_text(hjust = 0.5)))

# Print the combined density plots with one main title
print(density_plots_metadata)

6.2 Histogram Similarity Score

The Histogram Similarity Score measures how closely the marginal distributions of the synthetic data match the real data. Marginal distributions are critical for assessing whether key data characteristics such as spread, central tendency (mean/median), and shape (skewness/kurtosis) are preserved.

Ideal Ranges for Histogram Similarity:

Perfect Match: 1 (100%) indicates a perfect match between the distributions.
Good Match: 0.85–1 (85% to 100%) shows strong similarity.
Moderate Match: 0.65–0.85 suggests some differences that may need addressing.
Poor Match: Below 0.65 indicates significant deviation from the real data.

Show code

# Custom palette for datasets
dataset_palette <- c(
  "Parametric MICE"       = "#B0D99B",
  "CART Imputation"       = "#528AA8",
  "Synthpop (Low Fidelity)" = "#FFB6DB",
  "Metadata-Based"        = "#264653"
)

# Define function once
calculate_histogram_similarity <- function(real_data, synthetic_data) {
  real_numeric <- real_data[sapply(real_data, is.numeric)]
  synthetic_numeric <- synthetic_data[sapply(synthetic_data, is.numeric)]
  
  similarity_scores <- sapply(names(real_numeric), function(var) {
    real_hist <- hist(real_numeric[[var]], plot = FALSE)
    synthetic_hist <- hist(synthetic_numeric[[var]], plot = FALSE)
    
    # Wasserstein distance between histogram midpoints
    wasserstein_distance <- transport::wasserstein1d(real_hist$mids, synthetic_hist$mids)
    1 / (1 + wasserstein_distance)  # similarity score
  })
  
  mean(similarity_scores)
}

# Collect results in a tibble (convert to percentages)
hist_sim_tbl <- tibble::tibble(
  Dataset = c("Parametric MICE", "CART Imputation", "Synthpop (Low Fidelity)", "Metadata-Based"),
  Histogram_Similarity = c(
    calculate_histogram_similarity(heart_failure, syn_data_1),
    calculate_histogram_similarity(heart_failure, syn_cart_1),
    calculate_histogram_similarity(heart_failure, syn_data_low_fidelity_synthpop),
    calculate_histogram_similarity(heart_failure, syn_data_metadata)
  )
) %>%
  mutate(Histogram_Similarity = round(Histogram_Similarity * 100, 1))  # percentage

# Display as table
kable(
  hist_sim_tbl,
  caption = "Histogram Similarity Scores (%): higher = more similar to real data",
  align = "lr"
)

Histogram Similarity Scores (%): higher = more similar to real data
Dataset	Histogram_Similarity
Parametric MICE	9.5
CART Imputation	9.5
Synthpop (Low Fidelity)	9.5
Metadata-Based	9.5

Show code

# Horizontal bar chart with custom colours
ggplot(hist_sim_tbl, aes(y = Dataset, x = Histogram_Similarity, fill = Dataset)) +
  geom_col(alpha = 0.9, show.legend = FALSE) +
  geom_text(aes(label = paste0(Histogram_Similarity, "%")), 
            hjust = -0.1, size = 3.3) +
  scale_fill_manual(values = dataset_palette) +
  scale_x_continuous(labels = scales::percent_format(scale = 1), limits = c(0, 100)) +
  labs(
    title = "Histogram Similarity by Synthetic Method",
    subtitle = "Based on Wasserstein distance of variable histograms",
    x = "Similarity (%)", y = NULL
  ) +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 10))

6.3 Mutual Information Score

The Mutual Information (MI) Score measures the shared information between pairs of features in the real and synthetic datasets, assessing how well relationships between variables are preserved.

Ideal Ranges for Mutual Information:

Perfect Alignment: MI scores for synthetic data should ideally match the real within 10-20% for critical feature pairs.
Lower MI: Consistently lower MI scores suggest that noise has been added, weakening relationships.
Overfitting Risk: Higher MI scores than the real may indicate overfitting, increasing re-identification risk.

Show code

# Custom palette for datasets
dataset_palette <- c(
  "Parametric MICE"         = "#B0D99B",
  "CART Imputation"         = "#528AA8",
  "Synthpop (Low Fidelity)" = "#FFB6DB",
  "Metadata-Based"          = "#264653"
)

# ------- Helpers -------
# Strip 'synth_' so variable names align with the real dataset
strip_synth_prefix <- function(df) {
  names(df) <- sub("^synth_", "", names(df))
  df
}

# Compute pairwise MI matrix for a data.frame of discretized columns
compute_mi_matrix <- function(df_disc) {
  p <- ncol(df_disc)
  if (p < 2) return(matrix(numeric(0), nrow = p, ncol = p,
                           dimnames = list(colnames(df_disc), colnames(df_disc))))
  M <- matrix(0, nrow = p, ncol = p,
              dimnames = list(colnames(df_disc), colnames(df_disc)))
  for (i in seq_len(p)) {
    for (j in i:p) {
      # safe MI (skip NA rows; zero if not enough variation)
      ok <- is.finite(df_disc[[i]]) & is.finite(df_disc[[j]])
      xi <- df_disc[[i]][ok]; xj <- df_disc[[j]][ok]
      mi <- if (length(unique(xi)) < 2 || length(unique(xj)) < 2) 0 else
        suppressWarnings(infotheo::mutinformation(xi, xj))
      M[i, j] <- mi
      M[j, i] <- mi
    }
  }
  M
}

# Compare two MI matrices (upper triangle only) -> 0–1 similarity score (higher = closer)
mi_similarity_score <- function(M_real, M_synth) {
  if (length(M_real) == 0 || length(M_synth) == 0) return(NA_real_)
  common <- intersect(colnames(M_real), colnames(M_synth))
  if (length(common) < 2) return(NA_real_)
  R <- M_real[common, common, drop = FALSE]
  S <- M_synth[common, common, drop = FALSE]
  ut <- upper.tri(R, diag = FALSE)
  mean_abs_diff <- mean(abs(R[ut] - S[ut]))
  1 / (1 + mean_abs_diff)
}

# Discretize numeric columns (equal-frequency bins); keep only shared numeric vars
Mutual_Information <- function(real_df, synth_df, nbins = 10) {
  real_num  <- dplyr::select(real_df,  where(is.numeric))
  synth_num <- strip_synth_prefix(synth_df) |> dplyr::select(where(is.numeric))

  common <- intersect(names(real_num), names(synth_num))
  if (length(common) < 2) return(NA_real_)

  real_num  <- real_num[common]
  synth_num <- synth_num[common]

  # Discretize numerics (returns integer codes)
  real_disc  <- as.data.frame(lapply(real_num,  \(x) infotheo::discretize(x, disc = "equalfreq", nbins = nbins)))
  synth_disc <- as.data.frame(lapply(synth_num, \(x) infotheo::discretize(x, disc = "equalfreq", nbins = nbins)))

  # MI matrices and similarity score
  M_real  <- compute_mi_matrix(real_disc)
  M_synth <- compute_mi_matrix(synth_disc)
  mi_similarity_score(M_real, M_synth)
}

# ------- Compute scores and present nicely (percentages) -------
mi_results <- tibble::tibble(
  Dataset = c("Parametric MICE", "CART Imputation", "Synthpop (Low Fidelity)", "Metadata-Based"),
  Mutual_Information = c(
    Mutual_Information(heart_failure, syn_data_1),
    Mutual_Information(heart_failure, syn_cart_1),
    Mutual_Information(heart_failure, syn_data_low_fidelity_synthpop),
    Mutual_Information(heart_failure, syn_data_metadata)
  )
) |>
  mutate(Mutual_Information = round(Mutual_Information * 100, 1))  # convert to %

# User-friendly table
knitr::kable(
  mi_results,
  caption = "Mutual Information Similarity (%): higher = synthetic preserves dependency structure more closely",
  align = "lr"
)

Mutual Information Similarity (%): higher = synthetic preserves dependency structure more closely
Dataset	Mutual_Information
Parametric MICE	93.3
CART Imputation	94.3
Synthpop (Low Fidelity)	91.5
Metadata-Based	93.2

Show code

# Horizontal bar chart with custom colours
ggplot(mi_results, aes(y = Dataset, x = Mutual_Information, fill = Dataset)) +
  geom_col(alpha = 0.9, show.legend = FALSE) +
  geom_text(aes(label = paste0(Mutual_Information, "%")), 
            hjust = -0.1, size = 3.3) +
  scale_fill_manual(values = dataset_palette) +
  scale_x_continuous(labels = scales::percent_format(scale = 1), limits = c(0, 100)) +
  labs(
    title = "Mutual Information Similarity by Synthetic Method",
    subtitle = "Based on pairwise MI matrices of discretized numeric variables (equal-frequency bins)",
    x = "Similarity (%)", y = NULL
  ) +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 10))

6.4 Correlation Score

The Correlation Score measures how well the relationships (correlations) between variables are preserved in the synthetic dataset compared to the real.

Ideal Ranges for Correlation Score:

Perfect Alignment: Correlations in the synthetic data should match the real closely, with deviations of 5-10% being acceptable.
Weaker Correlations: If synthetic correlations are weaker, it may indicate loss of fidelity due to noise.
Stronger Correlations: Stronger correlations suggest overfitting, which may increase privacy risks.

Show code

# Custom palette for datasets
dataset_palette <- c(
  "Parametric MICE"         = "#B0D99B",
  "CART Imputation"         = "#528AA8",
  "Synthpop (Low Fidelity)" = "#FFB6DB",
  "Metadata-Based"          = "#264653"
)

# --- Strip 'synth_' so names align
strip_synth_prefix <- function(df) {
  names(df) <- sub("^synth_", "", names(df))
  df
}

# --- Function to calculate Correlation Similarity Score
calculate_correlation_score <- function(real_data, synthetic_data) {
  # Keep numeric columns only
  real_num  <- real_data[sapply(real_data, is.numeric)]
  synth_num <- strip_synth_prefix(synthetic_data)[sapply(strip_synth_prefix(synthetic_data), is.numeric)]
  
  # Use only common variables
  common <- intersect(names(real_num), names(synth_num))
  if (length(common) < 2) return(NA_real_)
  
  real_corr  <- cor(real_num[common],  use = "complete.obs")
  synth_corr <- cor(synth_num[common], use = "complete.obs")
  
  # Frobenius norm difference → similarity score in [0,1]
  corr_diff <- norm(real_corr - synth_corr, type = "F")
  1 / (1 + corr_diff)
}

# --- Compute scores for all synthetic datasets
corr_scores <- tibble::tibble(
  Dataset = c("Parametric MICE", "CART Imputation", "Synthpop (Low Fidelity)", "Metadata-Based"),
  Correlation_Score = c(
    calculate_correlation_score(heart_failure, syn_data_1),
    calculate_correlation_score(heart_failure, syn_cart_1),
    calculate_correlation_score(heart_failure, syn_data_low_fidelity_synthpop),
    calculate_correlation_score(heart_failure, syn_data_metadata)
  )
) %>%
  mutate(Correlation_Score = round(Correlation_Score * 100, 1))   # convert to %

# --- User-friendly table
kable(
  corr_scores,
  caption = "Correlation Similarity Scores (%): higher = synthetic preserves correlation structure more closely",
  align = "lr"
)

Correlation Similarity Scores (%): higher = synthetic preserves correlation structure more closely
Dataset	Correlation_Score
Parametric MICE	58.3
CART Imputation	67.2
Synthpop (Low Fidelity)	53.4
Metadata-Based	55.5

Show code

# --- Horizontal bar chart with custom colours
ggplot(corr_scores, aes(y = Dataset, x = Correlation_Score, fill = Dataset)) +
  geom_col(alpha = 0.9, show.legend = FALSE) +
  geom_text(aes(label = paste0(Correlation_Score, "%")), 
            hjust = -0.1, size = 3.3) +
  scale_fill_manual(values = dataset_palette) +
  scale_x_continuous(labels = scales::percent_format(scale = 1), limits = c(0, 100)) +
  labs(
    title = "Correlation Similarity by Synthetic Method",
    subtitle = "Based on Frobenius norm difference between real and synthetic correlation matrices",
    x = "Similarity (%)", y = NULL
  ) +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 10))

7 Utility Assessment

7.1 Feature Importance Consistency Assessment

The Feature Importance Consistency Assessment evaluates how well the synthetic datasets retain the predictive significance of features compared to the real dataset. This assessment is essential for ensuring that the synthetic data accurately reflects the underlying relationships that contribute to the model’s predictive performance.

Feature importance measures each variable’s contribution to model predictions. Consistency in feature importance rankings between the real and synthetic datasets serves as a crucial indicator of data quality. Significant discrepancies may indicate deficiencies in the synthetic data generation process, potentially affecting the performance of models trained on synthetic data.

Ideal Ranges for Feature Importance Consistency Assessment:

Perfect Match: Feature importance scores in the synthetic dataset should closely align with those in the real dataset, particularly for highly predictive features.
Acceptable Deviation: A deviation of 10-15% is generally acceptable for most features. For variables identified as highly influential, stricter thresholds may be needed to maintain predictive accuracy.

Types of Deviations

Underrepresentation: If an important feature has lower importance in the synthetic dataset, it may suggest that critical predictive relationships are not well-preserved.
Overrepresentation: If a feature’s importance is higher in the synthetic data, this could indicate noise or overfitting, potentially distorting the model’s perception of feature relationships.

The following visualisation techniques will be used to compare feature importance:

Bar Plots for Feature Importance Comparison: Bar plots will compare feature importance scores between the real and synthetic datasets. This visual comparison can reveal significant deviations.
SHAP Summary Plots: SHAP (SHapley Additive exPlanations) summary plots provide a comprehensive view of the distribution of SHAP values for each feature, allowing for a direct comparison between the real and synthetic datasets.

Show code

# Prepare the data for XGBoost
set.seed(123)  # For reproducibility
train_indices <- sample(seq_len(nrow(heart_failure)), size = 0.7 * nrow(heart_failure))
train_real <- heart_failure[train_indices, ]
test_real <- heart_failure[-train_indices, ]

train_real_matrix <- model.matrix(deceased ~ age + sex + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + platelets + serum_creatinine + serum_sodium + smoking + hypertension + follow_up - 1, data = train_real)
train_real_label <- train_real[rownames(train_real_matrix), "deceased"]
train_real_label <- as.numeric(train_real_label) - 1  # Convert factor to 0 and 1

train_synth <- syn_data_1[train_indices, ]
train_synth_matrix <- model.matrix(deceased ~ age + sex + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + platelets + serum_creatinine + serum_sodium + smoking + hypertension + follow_up - 1, data = train_synth)
train_synth_label <- train_synth[rownames(train_synth_matrix), "deceased"]
train_synth_label <- as.numeric(train_synth_label) - 1  # Convert factor to 0 and 1

# Convert to DMatrix format, which is required for XGBoost
dtrain_real <- xgb.DMatrix(data = train_real_matrix, label = train_real_label)
dtrain_synth <- xgb.DMatrix(data = train_synth_matrix, label = train_synth_label)

# Set up parameters for XGBoost
params <- list(
  objective = "binary:logistic",
  eval_metric = "auc",  # Use AUC as an evaluation metric
  max_depth = 3,
  eta = 0.1
)

# Tune hyperparameters using cross-validation (with parallel processing)
cv_real <- xgb.cv(
  params = params,
  data = dtrain_real,
  nrounds = 100,
  nfold = 5,
  verbose = 0,
  early_stopping_rounds = 10,
  nthread = num_cores  # Number of threads for parallel processing
)

best_nrounds_real <- cv_real$best_iteration

cv_synth <- xgb.cv(
  params = params,
  data = dtrain_synth,
  nrounds = 100,
  nfold = 5,
  verbose = 0,
  early_stopping_rounds = 10,
  nthread = num_cores  # Number of threads for parallel processing
)

best_nrounds_synth <- cv_synth$best_iteration

# Train XGBoost on real data (TRTR)
xgb_model_trtr <- xgboost(params = params, data = dtrain_real, nrounds = best_nrounds_real, verbose = 0, nthread = num_cores)

# Train XGBoost on synthetic data (TSTR)
xgb_model_tstr <- xgboost(params = params, data = dtrain_synth, nrounds = best_nrounds_synth, verbose = 0, nthread = num_cores)

# Get feature importance for TRTR
importance_trtr <- xgb.importance(feature_names = colnames(train_real_matrix), model = xgb_model_trtr)
cat("Feature Importance for TRTR:\n")

Feature Importance for TRTR:

Show code

print(importance_trtr)

                    Feature        Gain      Cover  Frequency
                     <char>       <num>      <num>      <num>
1:                follow_up 0.550035467 0.35582680 0.26470588
2:         serum_creatinine 0.179450673 0.25780255 0.22794118
3:        ejection_fraction 0.107840968 0.17061583 0.12500000
4: creatinine_phosphokinase 0.100830923 0.12847956 0.20588235
5:             serum_sodium 0.032740748 0.04233723 0.08823529
6:                      age 0.022510598 0.03446459 0.04411765
7:                platelets 0.006590623 0.01047345 0.04411765

Show code

# Get feature importance for TSTR
importance_tstr <- xgb.importance(feature_names = colnames(train_synth_matrix), model = xgb_model_tstr)
cat("\nFeature Importance for TSTR:\n")


Feature Importance for TSTR:

Show code

print(importance_tstr)

                    Feature       Gain      Cover Frequency
                     <char>      <num>      <num>     <num>
1:                follow_up 0.37991328 0.37325270      0.20
2:        ejection_fraction 0.22999025 0.21347069      0.32
3:             serum_sodium 0.20146359 0.17350782      0.16
4:                platelets 0.13419963 0.10862788      0.12
5:         serum_creatinine 0.04007483 0.12057396      0.16
6: creatinine_phosphokinase 0.01435841 0.01056696      0.04

Show code

# Visualise feature importance for TRTR and TSTR
importance_trtr$Dataset <- "Real Data"
importance_tstr$Dataset <- "Synthetic Data"

importance_combined <- rbind(importance_trtr, importance_tstr)

# Plot feature importance comparison
ggplot(importance_combined, aes(x = reorder(Feature, Gain), y = Gain, fill = Dataset)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  labs(title = "Feature Importance Comparison (TRTR vs TSTR)",
       x = "Feature",
       y = "Gain") +
  scale_fill_manual(values = c("Real Data" = "#B0D99B", "Synthetic Data" = "#528AA8")) +
  theme_minimal()

Show code

# Correlation analysis between feature importance rankings

importance_trtr <- importance_trtr %>% mutate(rank = rank(-Gain))
importance_tstr <- importance_tstr %>% mutate(rank = rank(-Gain))

importance_ranking <- inner_join(importance_trtr, importance_tstr, by = "Feature", suffix = c("_trtr", "_tstr"))

correlation <- cor(importance_ranking$rank_trtr, importance_ranking$rank_tstr, method = "spearman")
cat("\nSpearman correlation Between TRTR and TSTR Feature Importance Rankings:\n")


Spearman correlation Between TRTR and TSTR Feature Importance Rankings:

Show code

print(correlation)

[1] 0.3714286

Show code

# Calculate SHAP values for better interpretability

# SHAP values for TRTR
shap_values_trtr <- shap.values(xgb_model_trtr, dtrain_real)
shap_long_trtr <- shap.prep(shap_contrib = shap_values_trtr$shap_score, X_train = train_real_matrix)

cat("\nSHAP Summary for TRTR:\n")


SHAP Summary for TRTR:

Show code

shap_values_trtr$mean_shap_score %>% print()

               follow_up         serum_creatinine        ejection_fraction 
              0.92293409               0.52780225               0.39932806 
creatinine_phosphokinase             serum_sodium                      age 
              0.18850937               0.10067811               0.06711965 
               platelets                sexFemale                  sexMale 
              0.02991459               0.00000000               0.00000000 
              anaemiaYes              diabetesYes               smokingYes 
              0.00000000               0.00000000               0.00000000 
         hypertensionYes 
              0.00000000

Show code

# SHAP values for TSTR
shap_values_tstr <- shap.values(xgb_model_tstr, dtrain_synth)
shap_long_tstr <- shap.prep(shap_contrib = shap_values_tstr$shap_score, X_train = train_synth_matrix)

cat("\nSHAP Summary for TSTR:\n")


SHAP Summary for TSTR:

Show code

shap_values_tstr$mean_shap_score %>% print()

               follow_up        ejection_fraction                platelets 
             0.210816300              0.069274587              0.067060756 
            serum_sodium         serum_creatinine creatinine_phosphokinase 
             0.058853411              0.035087248              0.006947092 
                     age                sexFemale                  sexMale 
             0.000000000              0.000000000              0.000000000 
              anaemiaYes              diabetesYes               smokingYes 
             0.000000000              0.000000000              0.000000000 
         hypertensionYes 
             0.000000000

Show code

# SHAP visualisation
shap.plot.summary(shap_long_trtr)

Show code

shap.plot.summary(shap_long_tstr)

7.2 Prediction Score

The Prediction Score assesses synthetic data utility by comparing models trained on real data (Train Real Test Real (TRTR)) versus synthetic data (Train Synthetic Test Real (TSTR)) using two metrics:

Accuracy: The proportion of correct predictions made by the model.
pMSE (Prediction Mean Squared Error): Measures the average of the squared differences between predicted and actual values.

Ideal Ranges for Prediction Score:

Accuracy: TSTR accuracy should be within 5-10% of TRTR accuracy. Lower accuracy signals that synthetic data isn’t fully representative of the real dataset.
pMSE: TSTR pMSE should be close to TRTR pMSE. A 5-10% deviation is acceptable; a higher pMSE suggests synthetic data may not be accurately reflecting the true relationships in the real data.

This section assesses the utility of synthetic data generated through various methods, using XGBoost to calculate both Accuracy and pMSE for TRTR and TSTR.

Show code

set.seed(123)

# ---- helpers ----
strip_synth_prefix <- function(df) { names(df) <- sub("^synth_", "", names(df)); df }

align_target_levels <- function(df, levels_ref = NULL, positive = NULL) {
  if (!is.factor(df$deceased)) {
    if (is.numeric(df$deceased)) df$deceased <- factor(as.character(df$deceased)) else df$deceased <- factor(df$deceased)
  }
  if (!is.null(levels_ref)) df$deceased <- factor(df$deceased, levels = levels_ref)
  lev <- levels(df$deceased)
  if (length(lev) < 2) stop("Target 'deceased' must have two levels.")
  pos <- if (!is.null(positive) && positive %in% lev) positive else if ("1" %in% lev) "1" else lev[2]
  df$deceased <- stats::relevel(df$deceased, ref = pos)
  list(data = df, levels = levels(df$deceased), positive = pos)
}

evaluate_model <- function(train_data, test_data, label, tune_grid, train_control, seed = 123) {
  test_aln  <- align_target_levels(test_data)
  train_aln <- align_target_levels(train_data, levels_ref = test_aln$levels, positive = test_aln$positive)
  train_data <- train_aln$data
  test_data  <- test_aln$data

  set.seed(seed)
  model <- caret::train(
    deceased ~ follow_up + serum_creatinine + ejection_fraction +
      creatinine_phosphokinase + serum_sodium + age + platelets,
    data      = train_data,
    method    = "xgbTree",
    trControl = train_control,
    tuneGrid  = tune_grid,
    na.action = na.pass
  )

  pred <- predict(model, test_data)
  acc  <- mean(pred == test_data$deceased, na.rm = TRUE)
  yhat <- as.numeric(pred)
  y    <- as.numeric(test_data$deceased)
  pMSE <- mean((yhat - y)^2, na.rm = TRUE)

  tibble::tibble(Dataset = label, Accuracy = acc, pMSE = pMSE)
}

# ---- CV setup ----
train_control <- caret::trainControl(method = "cv", number = 5)
tune_grid <- expand.grid(
  nrounds = 50,
  max_depth = 3,
  eta = 0.1,
  gamma = 0,
  colsample_bytree = 0.8,
  min_child_weight = 1,
  subsample = 0.7
)

# ---- prepare synthetic splits ----
syn_mice      <- strip_synth_prefix(syn_data_1)
syn_cart      <- strip_synth_prefix(syn_cart_1)
syn_synthpop  <- strip_synth_prefix(syn_data_low_fidelity_synthpop)
syn_metadata  <- strip_synth_prefix(syn_data_metadata)

train_synth_mice     <- syn_mice[train_indices, , drop = FALSE]
train_synth_cart     <- syn_cart[train_indices, , drop = FALSE]
train_synth_synthpop <- syn_synthpop[train_indices, , drop = FALSE]
train_synth_metadata <- syn_metadata[train_indices, , drop = FALSE]

# ---- results table ----
results <- dplyr::bind_rows(
  evaluate_model(train_real,               test_real, "TRTR (Real→Real)",     tune_grid, train_control),
  evaluate_model(train_synth_mice,         test_real, "TSTR (MICE→Real)",     tune_grid, train_control),
  evaluate_model(train_synth_cart,         test_real, "TSTR (CART→Real)",     tune_grid, train_control),
  evaluate_model(train_synth_synthpop,     test_real, "TSTR (Synthpop→Real)", tune_grid, train_control),
  evaluate_model(train_synth_metadata,     test_real, "TSTR (Metadata→Real)", tune_grid, train_control)
) %>%
  mutate(
    Accuracy = scales::percent(Accuracy, accuracy = 0.1),
    pMSE     = round(pMSE, 3)
  )

knitr::kable(
  results,
  caption = "Prediction Score Summary (XGBoost, 5-fold CV): TRTR vs TSTR",
  align = "lrr"
)

Prediction Score Summary (XGBoost, 5-fold CV): TRTR vs TSTR
Dataset	Accuracy	pMSE
TRTR (Real→Real)	67.8%	0.322
TSTR (MICE→Real)	70.0%	0.300
TSTR (CART→Real)	70.0%	0.300
TSTR (Synthpop→Real)	66.7%	0.333
TSTR (Metadata→Real)	56.7%	0.433

Show code

# ---- horizontal Accuracy bar chart ----
palette5 <- c("#B0D99B", "#528AA8", "#FFB6DB", "#264653", "#2A9D8F")

plot_df <- results %>%
  mutate(Accuracy_num = readr::parse_number(Accuracy) / 100)

ggplot(plot_df, aes(y = Dataset, x = Accuracy_num, fill = Dataset)) +
  geom_col(alpha = 0.9, show.legend = FALSE) +
  geom_text(aes(label = Accuracy), hjust = -0.1, size = 3.5) +
  scale_fill_manual(values = setNames(palette5, unique(plot_df$Dataset))) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1)) +
  labs(title = "Model Accuracy: TRTR vs TSTR", x = "Accuracy", y = NULL) +
  theme_minimal()

7.3 Quality Score (QScore)

The QScore provides a comprehensive evaluation of synthetic data by aggregating multiple metrics, such as distribution similarity, feature importance, and correlation preservation, into a single score. This offers a clear, overall assessment of how well the synthetic data replicates the real dataset.

Ideal Ranges for QScore:

Excellent Quality (0.8 - 1.0): Indicates the synthetic data closely mirrors the real and can be confidently used for most analyses.
Good Quality (0.6 - 0.8): Reflects good alignment, with some minor deviations. Suitable for most use cases, but some caution is advised.
Moderate Quality (0.4 - 0.6): The synthetic data may be useful for exploratory analysis but shows significant deviations in key metrics.
Poor Quality (< 0.4): The synthetic data poorly aligns with the real and is likely unsuitable for most analytic or modeling tasks.

Show code

# ---- QScore helper ----
calculate_qscore <- function(real_data, synthetic_data, num_queries = 10) {
  numeric_real_vars  <- names(real_data)[sapply(real_data, is.numeric)]
  numeric_synth_vars <- names(synthetic_data)[sapply(synthetic_data, is.numeric)]
  common_vars <- intersect(numeric_real_vars, numeric_synth_vars)
  if (length(common_vars) == 0) return(NA_real_)

  qscores <- replicate(num_queries, {
    var_name <- sample(common_vars, 1)
    fun_name <- sample(c("mean", "sum"), 1)
    fun      <- match.fun(fun_name)

    real_val  <- fun(real_data[[var_name]],  na.rm = TRUE)
    synth_val <- fun(synthetic_data[[var_name]], na.rm = TRUE)

    if (!is.na(real_val) && real_val != 0 && !is.na(synth_val)) {
      abs(real_val - synth_val) / abs(real_val)  # proportion difference (0–1)
    } else {
      0
    }
  })

  mean(qscores, na.rm = TRUE)
}

# ---- Compute and present ----
qscore_tbl <- tibble::tibble(
  Dataset = c("Parametric MICE", "CART Imputation",
              "Synthpop (Low Fidelity)", "Metadata-Based"),
  QScore = c(
    calculate_qscore(heart_failure, syn_data_1),
    calculate_qscore(heart_failure, syn_cart_1),
    calculate_qscore(heart_failure, syn_data_low_fidelity_synthpop),
    calculate_qscore(heart_failure, syn_data_metadata)
  )
) %>%
  dplyr::mutate(
    QScore_num = QScore,                                   # keep numeric 0–1
    QScore_pct = scales::percent(QScore_num, accuracy = 0.1) # pretty %
  )

# --- User-friendly table (as percentages) ---
knitr::kable(
  qscore_tbl %>% dplyr::select(Dataset, `QScore (%)` = QScore_pct),
  caption = "Quality Score (QScore) — average % difference in aggregates (lower is better)",
  align = "lr"
)

Quality Score (QScore) — average % difference in aggregates (lower is better)
Dataset	QScore (%)
Parametric MICE	8.8%
CART Imputation	4.7%
Synthpop (Low Fidelity)	10.6%
Metadata-Based	3.0%

Show code

# --- Horizontal bar chart (percent axis) ---
pal <- c("#B0D99B", "#528AA8", "#FFB6DB", "#264653", "#2A9D8F")

ggplot(qscore_tbl, aes(y = Dataset, x = QScore_num, fill = Dataset)) +
  geom_col(alpha = 0.9, show.legend = FALSE) +
  geom_text(aes(label = QScore_pct), hjust = -0.1, size = 3.5) +
  scale_fill_manual(values = setNames(pal[seq_len(nrow(qscore_tbl))], qscore_tbl$Dataset)) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1),
                     expand = expansion(mult = c(0, 0.05))) +
  labs(
    title = "QScore Comparison (Lower is Better)",
    subtitle = "Average percentage difference between real and synthetic aggregates across random queries",
    x = "QScore (%)", y = NULL
  ) +
  theme_minimal()

8 Privacy / Disclosure Risk

8.1 Check for Replication of Real Value Combinations

This check identifies if any records in the synthetic data are exact duplicates of those in the real dataset. High replication increases privacy risks, so minimal or no duplication is ideal for preserving privacy.

Show code

# ---- Replicated rows count ----
check_replicated_rows <- function(real_data, synthetic_data) {
  combined <- rbind(real_data, synthetic_data)
  sum(duplicated(combined))
}

# ---- Build results table ----
replication_tbl <- tibble::tibble(
  Dataset = c("Parametric MICE", "Non-Parametric CART", 
              "Synthpop (Low Fidelity)", "Metadata-Based"),
  Replicated_Rows = c(
    check_replicated_rows(heart_failure, syn_data_1),
    check_replicated_rows(heart_failure, syn_cart_1),
    check_replicated_rows(heart_failure, syn_data_low_fidelity_synthpop),
    check_replicated_rows(heart_failure, syn_data_metadata)
  )
) %>%
  mutate(
    Total_Rows = nrow(heart_failure),
    Replication_Percent = round((Replicated_Rows / Total_Rows) * 100, 2),
    Status = ifelse(Replicated_Rows > 0, "⚠️ Replication Found", "✅ No Replication")
  )

# ---- User-friendly table ----
knitr::kable(
  replication_tbl,
  caption = "Replication Check: Exact Matching Rows Between Real and Synthetic Datasets",
  align = "lrrrl"
)

Replication Check: Exact Matching Rows Between Real and Synthetic Datasets
Dataset	Total_Rows	Status
Parametric MICE	299	✅ No Replication
Non-Parametric CART	299	✅ No Replication
Synthpop (Low Fidelity)	299	✅ No Replication
Metadata-Based	299	✅ No Replication

Show code

# ---- Optional: bar chart if any replication > 0 ----
if (any(replication_tbl$Replicated_Rows > 0)) {
  pal <- c("#B0D99B", "#528AA8", "#FFB6DB", "#264653")
  
  ggplot(replication_tbl, aes(y = Dataset, x = Replicated_Rows, fill = Dataset)) +
    geom_col(alpha = 0.9, show.legend = FALSE) +
    geom_text(aes(label = Replicated_Rows), hjust = -0.1, size = 3.5) +
    scale_fill_manual(values = pal[seq_len(nrow(replication_tbl))]) +
    labs(
      title = "Replication of Real Rows in Synthetic Data",
      subtitle = "Counts of exact duplicated rows across real and synthetic datasets",
      x = "Number of Replicated Rows", y = NULL
    ) +
    theme_minimal()
} else {
  cat("✅ No replicated rows were found across any synthetic dataset. No plot is generated.\n")
}

✅ No replicated rows were found across any synthetic dataset. No plot is generated.

8.2 Check for Unique Value Combinations

This analysis compares the number of unique row combinations between the real and synthetic datasets. A high number of unique combinations in the synthetic data suggests that it captures diverse patterns while minimising the replication of sensitive information, enhancing privacy protection.

Show code

# ---- Count unique rows ----
unique_combinations <- function(data) {
  nrow(unique(data))
}

# ---- Build results table ----
unique_tbl <- tibble::tibble(
  Dataset = c("Real Data", "Parametric MICE", 
              "Non-Parametric CART", "Synthpop (Low Fidelity)", 
              "Metadata-Based"),
  Unique_Combinations = c(
    unique_combinations(heart_failure),
    unique_combinations(syn_data_1),
    unique_combinations(syn_cart_1),
    unique_combinations(syn_data_low_fidelity_synthpop),
    unique_combinations(syn_data_metadata)
  )
) %>%
  mutate(
    Total_Rows = c(nrow(heart_failure), rep(nrow(heart_failure), 4)),
    Percent_Unique = round((Unique_Combinations / Total_Rows) * 100, 2)
  )

# ---- User-friendly table ----
knitr::kable(
  unique_tbl,
  caption = "Unique Value Combinations in Real vs. Synthetic Datasets",
  align = "lrr"
)

Unique Value Combinations in Real vs. Synthetic Datasets
Dataset	Unique_Combinations	Total_Rows	Percent_Unique
Real Data	299	299	100
Parametric MICE	299	299	100
Non-Parametric CART	299	299	100
Synthpop (Low Fidelity)	299	299	100
Metadata-Based	299	299	100

Show code

# ---- Plot only if any dataset has > 0 unique rows ----
if (any(unique_tbl$Unique_Combinations > 0, na.rm = TRUE)) {
  pal <- c("#B0D99B", "#528AA8", "#FFB6DB", "#264653", "#2A9D8F")

  ggplot(unique_tbl, aes(y = Dataset, x = Percent_Unique, fill = Dataset)) +
    geom_col(alpha = 0.9, show.legend = FALSE) +
    geom_text(aes(label = paste0(Percent_Unique, "%")), hjust = -0.1, size = 3.5) +
    scale_fill_manual(values = pal[seq_len(nrow(unique_tbl))]) +
    scale_x_continuous(limits = c(0, 100)) +
    labs(
      title = "Percentage of Unique Row Combinations",
      subtitle = "Higher % means fewer duplicates within the dataset",
      x = "Percent of Rows that are Unique", y = NULL
    ) +
    theme_minimal()
} else {
  cat("✅ No unique combinations detected across datasets.  No plot is generated. \n")
}

8.3 Exact Match Score

The Exact Match Score measures how many individual records in the synthetic data are identical to records in the real dataset. Lower scores are ideal for privacy preservation, as high exact matches may pose privacy risks.

Ideal Ranges for Exact Match Score:

0 - 0.1: Strong privacy protection with minimal or no exact matches, ideal for most privacy-preserving synthetic data.
0.1 - 0.3: Moderate exact matches, indicating some privacy risk but generally acceptable for many use cases.
0.3 - 0.5: Higher exact matches, raising privacy concerns. Review is needed to ensure proper data generation.
Above 0.5: Significant privacy risk, as the synthetic data too closely replicates the real.

Show code

# ---- Membership Inference Score ----
calculate_membership_inference_score <- function(real_data, synthetic_data, k = 5, threshold = 0.01) {
  real_num  <- real_data[sapply(real_data, is.numeric)]
  synth_num <- synthetic_data[sapply(synthetic_data, is.numeric)]
  if (ncol(real_num) == 0 || ncol(synth_num) == 0) return(NA_real_)
  common <- intersect(names(real_num), names(synth_num))
  if (length(common) == 0) return(NA_real_)
  real_num  <- real_num[common]
  synth_num <- synth_num[common]
  real_cc  <- real_num[complete.cases(real_num), , drop = FALSE]
  synth_cc <- synth_num[complete.cases(synth_num), , drop = FALSE]
  if (nrow(real_cc) < k || nrow(synth_cc) == 0) return(NA_real_)
  real_scaled  <- scale(real_cc)
  synth_scaled <- scale(synth_cc)
  nn <- FNN::get.knnx(real_scaled, synth_scaled, k)$nn.dist
  risky <- sum(rowMeans(nn) < threshold)
  risky / nrow(synth_scaled)
}

# ---- Results table ----
mis_tbl <- tibble::tibble(
  Dataset = c("Parametric MICE", "Non-Parametric CART",
              "Synthpop (Low Fidelity)", "Metadata-Based"),
  Membership_Inference_Score = c(
    calculate_membership_inference_score(heart_failure, syn_data_1),
    calculate_membership_inference_score(heart_failure, syn_cart_1),
    calculate_membership_inference_score(heart_failure, syn_data_low_fidelity_synthpop),
    calculate_membership_inference_score(heart_failure, syn_data_metadata)
  )
) %>%
  mutate(
    Membership_Inference_Score = round(Membership_Inference_Score, 4),
    Status = dplyr::case_when(
      is.na(Membership_Inference_Score)              ~ "— Insufficient data",
      Membership_Inference_Score == 0                ~ "✅ No risky memberships",
      Membership_Inference_Score < 0.01              ~ "⚠️ Low risk",
      TRUE                                           ~ "❌ High risk"
    )
  )

# ---- User-friendly table ----
knitr::kable(
  mis_tbl,
  caption = "Membership Inference Score: Proportion of synthetic records that are very close to real records (lower is better for privacy)",
  align   = "lrr"
)

Membership Inference Score: Proportion of synthetic records that are very close to real records (lower is better for privacy)
Dataset	Membership_Inference_Score	Status
Parametric MICE	0	✅ No risky memberships
Non-Parametric CART	0	✅ No risky memberships
Synthpop (Low Fidelity)	0	✅ No risky memberships
Metadata-Based	0	✅ No risky memberships

Show code

# ---- Conditional plot ----
if (any(mis_tbl$Membership_Inference_Score > 0, na.rm = TRUE)) {
  pal <- c("#B0D99B", "#528AA8", "#FFB6DB", "#264653")
  ggplot(mis_tbl, aes(y = Dataset, x = Membership_Inference_Score, fill = Dataset)) +
    geom_col(alpha = 0.9, show.legend = FALSE, na.rm = TRUE) +
    geom_text(
      aes(label = ifelse(is.na(Membership_Inference_Score), "NA",
                         scales::percent(Membership_Inference_Score, accuracy = 0.01))),
      hjust = -0.1, size = 3.5
    ) +
    scale_fill_manual(values = pal[seq_len(nrow(mis_tbl))]) +
    scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
    labs(
      title    = "Membership Inference Score by Synthetic Method",
      subtitle = "Lower values indicate better privacy protection",
      x = "Score (Proportion of risky memberships)",
      y = NULL
    ) +
    theme_minimal()
} else {
  cat("✅ All Membership Inference Scores are 0. No privacy risks detected, so no plot is generated\n")
}

✅ All Membership Inference Scores are 0. No privacy risks detected, so no plot is generated

8.4 Neighbors’ Privacy Score

The Neighbors’ Privacy Score identifies real records that are “too similar” to synthetic records by using a nearest-neighbors search. This metric helps evaluate privacy risk by detecting synthetic records that closely resemble real data, which could compromise privacy.

Ideal Ranges for Neighbors’ Privacy Score:

0 - 0.1: Indicates strong privacy protection, with very few synthetic records too similar to real ones.
0.1 - 0.3: Moderate similarity, where privacy risk exists but remains within acceptable limits for many cases.
Above 0.3: Higher similarity, raising privacy concerns and requiring review to ensure the synthetic data is not too closely mimicking real data.

Show code

# ---- Neighbors' Privacy Score ----
calculate_neighbors_privacy_score <- function(real_data, synthetic_data, k = 5, threshold = 0.01) {
  # Numeric-only columns
  real_num  <- real_data[sapply(real_data, is.numeric)]
  synth_num <- synthetic_data[sapply(synthetic_data, is.numeric)]
  
  if (ncol(real_num) == 0 || ncol(synth_num) == 0) return(NA_real_)
  
  # Keep only common columns
  common <- intersect(names(real_num), names(synth_num))
  if (length(common) == 0) return(NA_real_)
  real_num  <- real_num[common]
  synth_num <- synth_num[common]
  
  # Complete cases
  real_cc  <- real_num[complete.cases(real_num), , drop = FALSE]
  synth_cc <- synth_num[complete.cases(synth_num), , drop = FALSE]
  
  if (nrow(real_cc) < k || nrow(synth_cc) == 0) return(NA_real_)
  
  # Standardise
  real_scaled  <- scale(real_cc)
  synth_scaled <- scale(synth_cc)
  
  # k-NN distances from synthetic → real
  neighbors <- FNN::get.knnx(real_scaled, synth_scaled, k)$nn.dist
  
  # Privacy risk: synthetic records "too close" to real
  close_matches <- sum(rowMeans(neighbors) < threshold)
  score <- close_matches / nrow(synth_scaled)
  
  return(score)
}

# ---- Build results table ----
neighbors_tbl <- tibble::tibble(
  Dataset = c("Parametric MICE", "Non-Parametric CART",
              "Synthpop (Low Fidelity)", "Metadata-Based"),
  Neighbors_Privacy_Score = c(
    calculate_neighbors_privacy_score(heart_failure, syn_data_1),
    calculate_neighbors_privacy_score(heart_failure, syn_cart_1),
    calculate_neighbors_privacy_score(heart_failure, syn_data_low_fidelity_synthpop),
    calculate_neighbors_privacy_score(heart_failure, syn_data_metadata)
  )
) %>%
  mutate(
    Neighbors_Privacy_Score = round(Neighbors_Privacy_Score, 4),
    Status = case_when(
      Neighbors_Privacy_Score == 0 ~ "✅ No risky neighbors",
      Neighbors_Privacy_Score < 0.01 ~ "⚠️ Low risk",
      TRUE ~ "❌ High risk"
    )
  )

# ---- User-friendly table ----
knitr::kable(
  neighbors_tbl,
  caption = "Neighbors' Privacy Score: Proportion of synthetic records too close to real records (lower is better for privacy)",
  align = "lrr"
)

Neighbors’ Privacy Score: Proportion of synthetic records too close to real records (lower is better for privacy)
Dataset	Neighbors_Privacy_Score	Status
Parametric MICE	0	✅ No risky neighbors
Non-Parametric CART	0	✅ No risky neighbors
Synthpop (Low Fidelity)	0	✅ No risky neighbors
Metadata-Based	0	✅ No risky neighbors

Show code

# ---- Plot only if any score > 0 ----
if (any(neighbors_tbl$Neighbors_Privacy_Score > 0)) {
  pal <- c("#B0D99B", "#528AA8", "#FFB6DB", "#264653")
  
  ggplot(neighbors_tbl, aes(y = Dataset, x = Neighbors_Privacy_Score, fill = Dataset)) +
    geom_col(alpha = 0.9, show.legend = FALSE) +
    geom_text(aes(label = Neighbors_Privacy_Score), hjust = -0.1, size = 3.5) +
    scale_fill_manual(values = pal[seq_len(nrow(neighbors_tbl))]) +
    scale_x_continuous(labels = scales::percent_format(accuracy = 0.1)) +
    labs(
      title = "Neighbors' Privacy Score by Synthetic Method",
      subtitle = "Lower values mean synthetic records are less likely to be close replicas of real individuals",
      x = "Privacy Score (Proportion of risky neighbors)", y = NULL
    ) +
    theme_minimal()
} else {
  cat("✅ All Neighbors' Privacy Scores are 0. No risky neighbors detected. No plot is generated.\n")
}

✅ All Neighbors' Privacy Scores are 0. No risky neighbors detected. No plot is generated.

8.5 Membership Inference Score

The Membership Inference Score evaluates the risk of membership inference attacks, which attempt to determine whether a specific record belongs to the real dataset. This metric helps assess the vulnerability of synthetic data to privacy breaches.

Ideal Ranges for Membership Inference Score:

0 - 0.1: Indicates low risk, where it is unlikely that membership inference attacks can accurately identify real records.
0.1 - 0.3: Moderate risk, where some vulnerability to membership inference exists but may still be acceptable for certain use cases.
Above 0.3: High risk, suggesting significant privacy concerns as the synthetic data could reveal membership information about the real records.

Show code

# ---- Membership Inference Score ----
# Interpreted as the proportion of synthetic records that are "too close" to the real data.
# Lower is better (safer).
calculate_membership_inference_score <- function(real_data, synthetic_data, k = 5, threshold = 0.01) {
  # Numeric-only columns
  real_num  <- real_data[sapply(real_data, is.numeric)]
  synth_num <- synthetic_data[sapply(synthetic_data, is.numeric)]

  # Guardrails
  if (ncol(real_num) == 0 || ncol(synth_num) == 0) return(NA_real_)

  # Keep only common numeric variables
  common <- intersect(names(real_num), names(synth_num))
  if (length(common) == 0) return(NA_real_)
  real_num  <- real_num[common]
  synth_num <- synth_num[common]

  # Use complete cases
  real_cc  <- real_num[complete.cases(real_num), , drop = FALSE]
  synth_cc <- synth_num[complete.cases(synth_num), , drop = FALSE]
  if (nrow(real_cc) < k || nrow(synth_cc) == 0) return(NA_real_)

  # Standardize
  real_scaled  <- scale(real_cc)
  synth_scaled <- scale(synth_cc)

  # k-NN distances from synthetic → real
  nn <- FNN::get.knnx(real_scaled, synth_scaled, k)$nn.dist

  # Score: proportion of synthetic records whose average distance to k nearest reals is below threshold
  risky <- sum(rowMeans(nn) < threshold)
  risky / nrow(synth_scaled)
}

# ---- Results table ----
mis_tbl <- tibble::tibble(
  Dataset = c("Parametric MICE", "Non-Parametric CART",
              "Synthpop (Low Fidelity)", "Metadata-Based"),
  Membership_Inference_Score = c(
    calculate_membership_inference_score(heart_failure, syn_data_1),
    calculate_membership_inference_score(heart_failure, syn_cart_1),
    calculate_membership_inference_score(heart_failure, syn_data_low_fidelity_synthpop),
    calculate_membership_inference_score(heart_failure, syn_data_metadata)
  )
) %>%
  mutate(
    Membership_Inference_Score = round(Membership_Inference_Score, 4),
    Status = dplyr::case_when(
      is.na(Membership_Inference_Score)              ~ "— Insufficient data",
      Membership_Inference_Score == 0                ~ "✅ No risky memberships",
      Membership_Inference_Score < 0.01              ~ "⚠️ Low risk",
      TRUE                                           ~ "❌ High risk"
    )
  )

# ---- User-friendly table ----
knitr::kable(
  mis_tbl,
  caption = "Membership Inference Score: Proportion of synthetic records that are very close to real records (lower is better for privacy)",
  align   = "lrr"
)

Membership Inference Score: Proportion of synthetic records that are very close to real records (lower is better for privacy)
Dataset	Membership_Inference_Score	Status
Parametric MICE	0	✅ No risky memberships
Non-Parametric CART	0	✅ No risky memberships
Synthpop (Low Fidelity)	0	✅ No risky memberships
Metadata-Based	0	✅ No risky memberships

Show code

# ---- Plot only if any risky memberships (> 0) ----
if (any(mis_tbl$Membership_Inference_Score > 0, na.rm = TRUE)) {
  pal <- c("#B0D99B", "#528AA8", "#FFB6DB", "#264653")
  
  ggplot(mis_tbl, aes(y = Dataset, x = Membership_Inference_Score, fill = Dataset)) +
    geom_col(alpha = 0.9, show.legend = FALSE, na.rm = TRUE) +
    geom_text(
      aes(label = ifelse(is.na(Membership_Inference_Score), "NA",
                         scales::percent(Membership_Inference_Score, accuracy = 0.01))),
      hjust = -0.1, size = 3.5
    ) +
    scale_fill_manual(values = pal[seq_len(nrow(mis_tbl))]) +
    scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
    labs(
      title    = "Membership Inference Score by Synthetic Method",
      subtitle = "Lower values indicate better privacy protection",
      x = "Score (Proportion of risky memberships)",
      y = NULL
    ) +
    theme_minimal()
} else {
  cat("✅ All Membership Inference Scores are 0. No risky memberships detected. No plot is generated. \n")
}

✅ All Membership Inference Scores are 0. No risky memberships detected. No plot is generated.

9 Conclusions

This report evaluated the fidelity, utility, privacy, and overall quality of synthetic data generated using parametric MICE, non-parametric CART, Synthpop, and metadata-based methods. Key findings include:

Data Structure Preservation: All methods preserved the structure of the real dataset, though minor variations in variable distributions were observed.
Categorical Variables: Most synthetic datasets retained categorical levels, but frequency distribution discrepancies were particularly notable in Synthpop data.
Numeric Variables: The range and distribution of numeric variables were largely maintained across methods, with slight variations in density plots.
Correlation & Mutual Information: While most methods captured variable relationships, weaker correlations and lower mutual information scores suggested some loss of feature dependencies in certain synthetic datasets.
Utility (Prediction Score): Synthetic data performed similarly to real data in predictive modeling, though slight reductions in accuracy and increased pMSE indicated less precise representation of the real dataset’s predictive capabilities.
Privacy (Exact Match & Neighbors’ Privacy Scores): All synthetic datasets demonstrated strong privacy protection, with zero values for both Exact Match and Neighbors’ Privacy Scores. This indicates that no synthetic records were identical to or highly similar to any real records, affirming robust privacy across all methods.

In conclusion, synthetic data generated through these methods provides a balance between utility and privacy, though method selection should align with specific analytic or privacy needs.

Show code

# Stop the parallel cluster
stopCluster(cl)
registerDoSEQ()