# Load necessary libraries# --- Project and Performance Utilities ---library(here) # Manage project-relative file pathslibrary(doParallel) # Enable parallel processing for faster computationslibrary(tidyverse) # Core data science toolkit (dplyr, tidyr, readr, purrr, etc.)library(readxl) # Import Excel (.xlsx, .xls) files into R# --- Data Exploration and Summaries ---library(psych) # Produce descriptive statistics and psychometric analyseslibrary(knitr) # Format dynamic tables and reports in Markdown/Quarto# --- Visualisation ---library(ggplot2) # Build customizable, high-quality data visualizationslibrary(patchwork) # Combine multiple ggplot2 plots into unified layoutslibrary(corrplot) # Visualize correlation matrices with heatmaps and plotslibrary(GGally) # Extend ggplot2 with correlation plots, pairwise plots, ggpairs()library(VIM) # Visualize/impute missing values with advanced methods (e.g., matrix plots)library(scales)# --- Data Imputation and Synthesis ---library(mice) # Perform multiple imputation by chained equations for missing datalibrary(synthpop) # Generate, evaluate, and compare synthetic datasets# --- Statistical and Information-Theoretic Measures ---library(transport) # Compute optimal transport distances (e.g., Wasserstein) for distribution comparisonlibrary(infotheo) # Calculate information-theoretic metrics (entropy, mutual information)# --- Modeling and Machine Learning ---library(caret) # Streamline machine learning workflows (training, tuning, evaluation)library(xgboost) # Implement gradient boosting for classification/regressionlibrary(SHAPforxgboost) # Explain XGBoost predictions using SHAP valueslibrary(FNN) # Perform fast k-nearest neighbor searches and distance calculations# Set up parallel processingnum_cores <-detectCores() -1# Use one less than the total number of corescl <-makeCluster(num_cores)registerDoParallel(cl)
1 Introduction
This project evaluates the quality of synthetic datasets derived from the Heart Failure Clinical Records data (299 patients, 13 variables). Our goal is to determine how well different generation strategies reproduce the statistical properties and analytic value of the original data while protecting privacy. We generate synthetic data using four approaches: (1) parametric imputation (MICE), (2) non-parametric imputation (CART via MICE), (3) distribution-driven synthesis (synthpop), and (4) metadata-guided rules. We then compare each synthetic dataset with the real data across three dimensions:
Fidelity – univariate and multivariate similarity (distributions, ranges, correlations), histogram similarity, and mutual information.
Utility – model transportability using XGBoost (TRTR vs. TSTR), feature-importance agreement, and SHAP-based behaviour.
Privacy / Disclosure risk – exact record matches, neighbour-proximity checks, and membership-inference sensitivity.
Results are presented through aligned tables and plots (structure checks, categorical level retention, density overlays, correlation matrices, SHAP summaries) with concise scores for each method. The report concludes with a comparative summary to guide method selection given different fidelity–utility–privacy trade-offs.
2 Heart Failure Clinical Records Dataset and Initial Exploration
The Heart Failure Clinical Records dataset, sourced from the UCI Machine Learning Repository, contains records of 299 patients with heart failure collected during their follow-up period. Each record includes 13 clinical features covering demographics, clinical conditions, and laboratory measures:
Variable
Description
Unit / Levels
age
Age of the patient
Years
anaemia
Reduction in red blood cells
No / Yes
creatinine_phosphokinase
Enzyme level in the blood
mcg/L
diabetes
Diabetes status
No / Yes
ejection_fraction
% of blood leaving the heart with each contraction
Percentage (%)
platelets
Platelet count
Kiloplatelets/mL
serum_creatinine
Serum creatinine level
mg/dL
serum_sodium
Serum sodium level
mEq/L
sex
Gender of the patient
Female / Male
smoking
Smoking status
No / Yes
hypertension
Hypertension status
No / Yes
deceased
Survival status during follow-up
No / Yes
follow_up
Duration of the follow-up period
Days
A preview of the first 10 rows of the dataset is shown below.
Show code
# Set seed for reproducibilityset.seed(123)# Load the heart failure dataset from the local directoryheart_failure <-read.csv(here("data", "heart_failure.csv"))# Let's preview the heart failure datasetknitr::kable(head(heart_failure, 10),caption ="Preview of the first 10 rows of the Heart Failure dataset",align =rep("l", ncol(heart_failure)) # left align all columns)
Preview of the first 10 rows of the Heart Failure dataset
age
anaemia
creatinine_phosphokinase
diabetes
ejection_fraction
platelets
serum_creatinine
serum_sodium
sex
smoking
hypertension
deceased
follow_up
75
No
582
No
20
265000
1.9
130
Male
No
Yes
Yes
4
55
No
7861
No
NA
263358
1.1
NA
Male
No
No
Yes
6
65
No
NA
No
20
162000
NA
129
Male
Yes
No
Yes
7
NA
Yes
111
No
NA
NA
1.9
NA
Male
No
No
Yes
7
65
Yes
160
Yes
20
327000
2.7
116
Female
No
No
Yes
8
90
Yes
47
No
40
204000
NA
132
Male
Yes
Yes
Yes
8
75
Yes
246
No
15
NA
1.2
NA
Male
No
No
Yes
10
60
Yes
315
Yes
60
454000
1.1
131
Male
Yes
No
Yes
10
NA
No
NA
No
65
263358
NA
NA
Female
No
No
Yes
10
80
Yes
123
No
35
388000
9.4
133
Male
Yes
Yes
Yes
10
Here’s the preview of the structure of the heart failure dataset.
Let’s perform some basic data cleaning and preprocessing steps. Convert the following columns to factors: sex, anaemia, diabetes, smoking, hypertension and deceased. Here’s the new structure of the dataset after the basic cleaning.
Show code
# Convert the specified columns to factorsheart_failure$sex <-as.factor(heart_failure$sex)heart_failure$anaemia <-as.factor(heart_failure$anaemia)heart_failure$diabetes <-as.factor(heart_failure$diabetes)heart_failure$smoking <-as.factor(heart_failure$smoking)heart_failure$hypertension <-as.factor(heart_failure$hypertension)heart_failure$deceased <-as.factor(heart_failure$deceased)
Here are the summary statistics of the heart failure dataset.
Show code
# Generate descriptive statistics and display as a single tabledescribe(heart_failure) |>kable(caption ="Summary Statistics of the Heart Failure Dataset",digits =2, # round to 2 decimal placesalign ="lrrrrrrrrrr"# left for variable, right for numbers )
Summary Statistics of the Heart Failure Dataset
vars
n
mean
sd
median
trimmed
mad
min
max
range
skew
kurtosis
se
age
1
284
60.80
12.08
60.0
60.15
14.83
40.0
95.0
55.0
0.42
-0.28
0.72
anaemia*
2
299
1.43
0.50
1.0
1.41
0.00
1.0
2.0
1.0
0.28
-1.93
0.03
creatinine_phosphokinase
3
285
578.81
981.99
249.0
358.95
268.35
23.0
7861.0
7838.0
4.45
24.52
58.17
diabetes*
4
299
1.42
0.49
1.0
1.40
0.00
1.0
2.0
1.0
0.33
-1.90
0.03
ejection_fraction
5
287
38.00
11.92
38.0
37.31
11.86
14.0
80.0
66.0
0.58
0.02
0.70
platelets
6
285
265163.83
97508.63
263358.0
259212.20
65765.22
25100.0
850000.0
824900.0
1.43
6.24
5775.91
serum_creatinine
7
281
1.39
1.05
1.1
1.18
0.30
0.5
9.4
8.9
4.44
25.03
0.06
serum_sodium
8
279
136.59
4.49
137.0
136.80
4.45
113.0
148.0
35.0
-1.05
3.88
0.27
sex*
9
299
1.65
0.48
2.0
1.68
0.00
1.0
2.0
1.0
-0.62
-1.62
0.03
smoking*
10
299
1.32
0.47
1.0
1.28
0.00
1.0
2.0
1.0
0.76
-1.42
0.03
hypertension*
11
299
1.35
0.48
1.0
1.32
0.00
1.0
2.0
1.0
0.62
-1.62
0.03
deceased*
12
299
1.32
0.47
1.0
1.28
0.00
1.0
2.0
1.0
0.76
-1.42
0.03
follow_up
13
299
130.26
77.61
115.0
129.28
105.26
4.0
285.0
281.0
0.13
-1.22
4.49
3 Generating Synthetic Data
The process of generating synthetic data is essential for tasks such as privacy preservation, testing machine learning models, and conducting simulations. In this section, we employ various techniques to generate synthetic data based on the real heart failure dataset. Each method serves a different purpose and offers distinct advantages, depending on the use case and the type of missing data or privacy concerns.
We explore four key approaches for generating synthetic data:
3.1 Parametric Imputation Using MICE
This method uses the Multivariate Imputation by Chained Equations (MICE) framework, with parametric imputation based on a normal distribution. MICE allows for handling missing data through multiple imputations and is widely used in data science and healthcare for its flexibility and statistical rigor. Here’s the preview of the first 10 rows of the synthetic data generated.
Show code
# Function to compute the 5th and 95th percentiles for numeric columnscompute_percentiles <-function(data, lower_percentile =0.05, upper_percentile =0.95) { percentiles <-lapply(data, function(column) {if (is.numeric(column)) { lower <-quantile(column, probs = lower_percentile, na.rm =TRUE) upper <-quantile(column, probs = upper_percentile, na.rm =TRUE)return(c(lower, upper)) }return(c(NA, NA)) })names(percentiles) <-names(data)return(percentiles)}# Function to scale numeric values based on the 5th and 95th percentilesscale_to_percentiles <-function(data, percentiles) { scaled_data <- datafor (col incolnames(data)) {if (is.numeric(data[[col]])) {# Apply scaling only if percentiles are available for this columnif (!is.na(percentiles[[col]][1]) &&!is.na(percentiles[[col]][2])) { min_val <- percentiles[[col]][1] max_val <- percentiles[[col]][2]# Clip values to lie within the 5th and 95th percentile range scaled_data[[col]] <-pmin(pmax(data[[col]], min_val), max_val) } } }return(scaled_data)}# Function to generate missingness in a dataset based on the real data's patterngenerate_missingness_based_on_real <-function(real_data, synthetic_data) {# Ensure the synthetic dataset has the same structure as the realif (!all(colnames(real_data) ==colnames(synthetic_data))) {stop("Column names in real and synthetic data must match") }# Copy the real missingness pattern to the synthetic dataset synthetic_data_with_missingness <- synthetic_datafor (col incolnames(real_data)) {# Apply missingness where it was present in the real data missing_indices <-is.na(real_data[[col]]) synthetic_data_with_missingness[[col]][missing_indices] <-NA }return(synthetic_data_with_missingness)}# Compute the 5th and 95th percentiles for the real datapercentiles <-compute_percentiles(heart_failure)# Generate synthetic data using MICEwhere <-make.where(heart_failure, "all")method <-make.method(heart_failure, where = where)method[method =="pmm"] <-"norm"# Perform multiple imputation (10 datasets)syn_param <-mice(heart_failure, m =10, maxit =1, method = method, where = where, printFlag =FALSE)# Extract the first synthetic datasetsyn_data_1 <-complete(syn_param, 1)# Scale synthetic data to the 5th and 95th percentile rangessyn_data_1 <-scale_to_percentiles(syn_data_1, percentiles)# Apply the real missingness pattern to the synthetic datasetsyn_data_1 <-generate_missingness_based_on_real(heart_failure, syn_data_1)# Round off all numeric columns to 0 decimal placessyn_data_1 <- syn_data_1 %>%mutate(across(where(is.numeric), round, digits =0))# Add a prefix 'synth_' to all synthetic dataset variable namescolnames(syn_data_1) <-paste("synth", colnames(syn_data_1), sep ="_")# Display the structure of the dataset (fix double variable issue)df_syn <-data.frame(Variable =names(syn_data_1),Type =unname(sapply(syn_data_1, \(x) paste(class(x), collapse =", "))),Values =unname(sapply(syn_data_1, \(x) paste(head(unique(x), 10), collapse =", "))),row.names =NULL,check.names =FALSE)# Remove duplicated variables if anydf_syn <- df_syn[!duplicated(df_syn$Variable), ]knitr::kable( df_syn,caption ="Structure of the Heart Failure Dataset: Variables, Types, and Values",align =c("l","l","l"),row.names =FALSE)
Structure of the Heart Failure Dataset: Variables, Types, and Values
The Classification and Regression Trees (CART) method, a non-parametric approach, is useful for imputation when the relationships between variables are complex or non-linear. By leveraging decision trees, CART is capable of capturing intricate patterns in the data without making strong assumptions about the underlying distributions. Here’s the preview of the first 10 rows of the synthetic data generated.
Show code
# Compute the 5th and 95th percentiles for the real datapercentiles <-compute_percentiles(heart_failure)# Generate synthetic data using CART imputation with MICEwhere <-make.where(heart_failure, "all")method <-make.method(heart_failure, where = where)method[method =="pmm"] <-"cart"# Set the method to "cart" for non-parametric imputation# Perform multiple imputation (10 datasets)syn_cart <-mice(heart_failure, m =10, maxit =1, method = method, where = where, printFlag =FALSE)# Extract the first synthetic datasetsyn_cart_1 <-complete(syn_cart, 1)# Scale synthetic data to the 5th and 95th percentile rangessyn_cart_1 <-scale_to_percentiles(syn_cart_1, percentiles)# Apply the real missingness pattern to the synthetic datasetsyn_cart_1 <-generate_missingness_based_on_real(heart_failure, syn_cart_1)# Round off all numeric columns to 0 decimal placessyn_cart_1 <- syn_cart_1 %>%mutate(across(where(is.numeric), round, digits =0))# Add a prefix 'synth_' to all synthetic dataset variable namescolnames(syn_cart_1) <-paste("synth", colnames(syn_cart_1), sep ="_")# Display the structure of the dataset (fix double variable issue)df_cart <-data.frame(Variable =names(syn_cart_1),Type =unname(sapply(syn_cart_1, \(x) paste(class(x), collapse =", "))),Values =unname(sapply(syn_cart_1, \(x) paste(head(unique(x), 10), collapse =", "))),row.names =NULL,check.names =FALSE)# Remove duplicated variables if anydf_cart <- df_cart[!duplicated(df_cart$Variable), ]knitr::kable( df_cart,caption ="Structure of the Heart Failure Dataset: Variables, Types, and Values (CART Imputation)",align =c("l","l","l"),row.names =FALSE)
Structure of the Heart Failure Dataset: Variables, Types, and Values (CART Imputation)
The Synthpop package is designed for creating synthetic data based on the distribution of the real dataset. It is particularly useful for privacy-preserving data sharing, where the aim is to create a synthetic dataset that mimics the real data without exposing sensitive information. Here’s the preview of the first 10 rows of the synthetic data generated.
Show code
# Compute the 5th and 95th percentiles for the real datapercentiles <-compute_percentiles(heart_failure)# Generate low-fidelity synthetic data using random samplingsyn_data_low_fidelity <-syn(heart_failure, method ="sample", seed =123)
Synthesis
-----------
age anaemia creatinine_phosphokinase diabetes ejection_fraction platelets serum_creatinine serum_sodium sex smoking
hypertension deceased follow_up
Show code
# Convert the synthetic dataset to a data framesyn_data_low_fidelity_synthpop <- syn_data_low_fidelity$syn# Scale the synthetic data to the 5th and 95th percentile rangessyn_data_low_fidelity_synthpop <-scale_to_percentiles(syn_data_low_fidelity_synthpop, percentiles)# Apply the real missingness pattern to the synthetic datasetsyn_data_low_fidelity_synthpop <-generate_missingness_based_on_real(heart_failure, syn_data_low_fidelity_synthpop)# Round off all numeric columns to 0 decimal placessyn_data_low_fidelity_synthpop <- syn_data_low_fidelity_synthpop %>%mutate(across(where(is.numeric), round, digits =0))# Add a prefix 'synth_' to all synthetic dataset variable namescolnames(syn_data_low_fidelity_synthpop) <-paste("synth", colnames(syn_data_low_fidelity_synthpop), sep ="_")# Display the structure of the dataset (fix double variable issue)df_synthpop <-data.frame(Variable =names(syn_data_low_fidelity_synthpop),Type =unname(sapply(syn_data_low_fidelity_synthpop, \(x) paste(class(x), collapse =", "))),Values =unname(sapply(syn_data_low_fidelity_synthpop, \(x) paste(head(unique(x), 10), collapse =", "))),row.names =NULL,check.names =FALSE)# Remove duplicated variables if anydf_synthpop <- df_synthpop[!duplicated(df_synthpop$Variable), ]knitr::kable( df_synthpop,caption ="Structure of the Heart Failure Dataset: Variables, Types, and Values (Synthpop)",align =c("l","l","l"),row.names =FALSE)
Structure of the Heart Failure Dataset: Variables, Types, and Values (Synthpop)
This method uses a data dictionary (metadata) to drive the generation of synthetic data. The metadata defines the structure and types of variables, which ensures that the generated synthetic data conforms to expected formats, such as ranges for continuous variables and categories for binary or categorical variables. Here’s the preview of the first 10 rows of the synthetic data generated.
Show code
# Compute the 5th and 95th percentiles for the real datapercentiles <-compute_percentiles(heart_failure)# Generate synthetic data using metadata# Load the metadata from the Excel fileheart_failure_metadata <-read_excel(here("data", "heart_failure_metadata.xlsx"))# Create an empty data frame based on the metadata structuren_rows <-nrow(heart_failure) # Number of synthetic records to generatevariable_names <- heart_failure_metadata$`Variable Name`data_types <- heart_failure_metadata$Typesyn_data_metadata <-data.frame(matrix(ncol =length(variable_names), nrow = n_rows))colnames(syn_data_metadata) <- variable_names# Function to generate data based on metadatagenerate_data <-function(variable, real_data, type, n) {if (type =="Continuous"|| type =="Integer") {if (is.numeric(real_data[[variable]])) {# Compute 5th and 95th percentiles for numeric variables p5 <-quantile(real_data[[variable]], probs =0.05, na.rm =TRUE) p95 <-quantile(real_data[[variable]], probs =0.95, na.rm =TRUE)if (type =="Continuous") {return(runif(n, min = p5, max = p95)) } elseif (type =="Integer") {return(sample(floor(p5):ceiling(p95), n, replace =TRUE)) } } } elseif (is.factor(real_data[[variable]])) {return(sample(levels(real_data[[variable]]), n, replace =TRUE)) } elseif (variable %in%c("anaemia", "diabetes", "hypertension", "smoking", "sex")) {return(sample(unique(real_data[[variable]]), n, replace =TRUE)) } else {return(rep(NA, n)) }}# Generate synthetic data based on the metadatafor (i inseq_along(variable_names)) { syn_data_metadata[[i]] <-generate_data(variable_names[i], heart_failure, data_types[i], n_rows)}# Scale the synthetic data to the 5th and 95th percentile rangessyn_data_metadata <-scale_to_percentiles(syn_data_metadata, percentiles)# Apply the real missingness pattern to the synthetic datasetsyn_data_metadata <-generate_missingness_based_on_real(heart_failure, syn_data_metadata)# Round off all numeric columns to 0 decimal placessyn_data_metadata <- syn_data_metadata %>%mutate(across(where(is.numeric), round, digits =0))# Add a prefix 'synth_' to all synthetic dataset variable namescolnames(syn_data_metadata) <-paste("synth", colnames(syn_data_metadata), sep ="_")# Display the structure of the dataset (fix double variable issue)df_metadata <-data.frame(Variable =names(syn_data_metadata),Type =unname(sapply(syn_data_metadata, \(x) paste(class(x), collapse =", "))),Values =unname(sapply(syn_data_metadata, \(x) paste(head(unique(x), 10), collapse =", "))),row.names =NULL,check.names =FALSE)# Remove duplicated variables if anydf_metadata <- df_metadata[!duplicated(df_metadata$Variable), ]knitr::kable( df_metadata,caption ="Structure of the Heart Failure Dataset: Variables, Types, and Values (Metadata)",align =c("l","l","l"),row.names =FALSE)
Structure of the Heart Failure Dataset: Variables, Types, and Values (Metadata)
In this section, we generate tables to identify which columns in the synthetic datasets are synthetic (prefixed with “synth_”) and which remain unchanged from the real dataset. This helps ensure clarity on which variables have been altered during the synthetic data generation process.
For each dataset, we produce a table displaying the column names and whether the column is synthetic. A column is marked “Yes” if it was generated synthetically and “No” if it was retained from the real data.
Show code
# Function to create column info with dataset namecreate_column_info <-function(dataset, dataset_name) {data.frame(column_name =colnames(dataset),dataset = dataset_name,is_synthetic =ifelse(grepl("^synth_", colnames(dataset)), "Yes", "No"),stringsAsFactors =FALSE )}# Collect column info for all datasetsall_column_info <-bind_rows(create_column_info(syn_data_1, "Parametric MICE"),create_column_info(syn_cart_1, "CART Imputation"),create_column_info(syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)"),create_column_info(syn_data_metadata, "Metadata-Based"))# Reshape into wide format so variables appear oncewide_column_info <- all_column_info %>%pivot_wider(names_from = dataset,values_from = is_synthetic )# Display the combined wide tablekable( wide_column_info,caption ="Synthetic Indicators for Each Variable Across Datasets")
Synthetic Indicators for Each Variable Across Datasets
column_name
Parametric MICE
CART Imputation
Synthpop (Low Fidelity)
Metadata-Based
synth_age
Yes
Yes
Yes
Yes
synth_anaemia
Yes
Yes
Yes
Yes
synth_creatinine_phosphokinase
Yes
Yes
Yes
Yes
synth_diabetes
Yes
Yes
Yes
Yes
synth_ejection_fraction
Yes
Yes
Yes
Yes
synth_platelets
Yes
Yes
Yes
Yes
synth_serum_creatinine
Yes
Yes
Yes
Yes
synth_serum_sodium
Yes
Yes
Yes
Yes
synth_sex
Yes
Yes
Yes
Yes
synth_smoking
Yes
Yes
Yes
Yes
synth_hypertension
Yes
Yes
Yes
Yes
synth_deceased
Yes
Yes
Yes
Yes
synth_follow_up
Yes
Yes
Yes
Yes
5 Synthetic Dataset Structure
5.1 Dataset Size and Structure Comparison
This step evaluates whether the synthetic datasets match the real dataset in terms of the number of rows, columns, and feature types. Consistency in these dimensions is essential for the synthetic data to be a reliable stand-in for the real.
Number of Rows: Indicates if the synthetic data captures the same number of records as the real. Any mismatch can impact comparability.
Number of Columns: Ensures all variables are retained. Missing or extra columns suggest structural issues.
Feature Types: Confirms that categorical, numerical, and other variable types remain unchanged, preserving analytical consistency.
Show code
# Function to extract structure infoget_structure_info <-function(real_data, synthetic_data, dataset_name) {tibble(Dataset = dataset_name,Rows_Real =nrow(real_data),Cols_Real =ncol(real_data),Rows_Synthetic =nrow(synthetic_data),Cols_Synthetic =ncol(synthetic_data),Structure_Match =ifelse(nrow(real_data) ==nrow(synthetic_data) &ncol(real_data) ==ncol(synthetic_data),"✅ Yes", "❌ No" ) )}# Collect resultsstructure_comparison <-bind_rows(get_structure_info(heart_failure, syn_data_1, "Parametric MICE"),get_structure_info(heart_failure, syn_cart_1, "CART Imputation"),get_structure_info(heart_failure, syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)"),get_structure_info(heart_failure, syn_data_metadata, "Metadata-Based"))# Display as a neat tablekable( structure_comparison,caption ="Comparison of Real vs Synthetic Dataset Sizes and Structures")
Comparison of Real vs Synthetic Dataset Sizes and Structures
Dataset
Rows_Real
Cols_Real
Rows_Synthetic
Cols_Synthetic
Structure_Match
Parametric MICE
299
13
299
13
✅ Yes
CART Imputation
299
13
299
13
✅ Yes
Synthpop (Low Fidelity)
299
13
299
13
✅ Yes
Metadata-Based
299
13
299
13
✅ Yes
5.2 Categorical Variables: All Levels Maintained and Comparison of Distribution
This section evaluates how well the synthetic datasets replicate the categorical variables in the real dataset, focusing on the preservation of all levels and their distributions.
Level Matching: For each categorical variable, the synthetic dataset should retain all levels (categories) present in the real dataset. Missing levels suggest that the synthetic data may be incomplete, while additional levels could indicate errors in data generation.
Distribution Comparison: The frequency of each categorical level in the synthetic data should closely mirror that of the real dataset. Significant deviations in category frequencies suggest that the synthetic data does not fully capture the true distribution of the categorical features.
Comparison of Categorical Variable Levels in Real vs Synthetic Datasets
Variable
Dataset
Real_Levels
Synthetic_Levels
Match
anaemia
Parametric MICE
No, Yes
No, Yes
✅ Yes
diabetes
Parametric MICE
No, Yes
No, Yes
✅ Yes
sex
Parametric MICE
Male, Female
Male, Female
✅ Yes
smoking
Parametric MICE
No, Yes
Yes, No
✅ Yes
hypertension
Parametric MICE
Yes, No
Yes, No
✅ Yes
deceased
Parametric MICE
Yes, No
Yes, No
✅ Yes
anaemia
CART Imputation
No, Yes
No, Yes
✅ Yes
diabetes
CART Imputation
No, Yes
No, Yes
✅ Yes
sex
CART Imputation
Male, Female
Female, Male
✅ Yes
smoking
CART Imputation
No, Yes
Yes, No
✅ Yes
hypertension
CART Imputation
Yes, No
No, Yes
✅ Yes
deceased
CART Imputation
Yes, No
Yes, No
✅ Yes
anaemia
Synthpop (Low Fidelity)
No, Yes
Yes, No
✅ Yes
diabetes
Synthpop (Low Fidelity)
No, Yes
No, Yes
✅ Yes
sex
Synthpop (Low Fidelity)
Male, Female
Male, Female
✅ Yes
smoking
Synthpop (Low Fidelity)
No, Yes
Yes, No
✅ Yes
hypertension
Synthpop (Low Fidelity)
Yes, No
Yes, No
✅ Yes
deceased
Synthpop (Low Fidelity)
Yes, No
Yes, No
✅ Yes
anaemia
Metadata-Based
No, Yes
Yes, No
✅ Yes
diabetes
Metadata-Based
No, Yes
Yes, No
✅ Yes
sex
Metadata-Based
Male, Female
Female, Male
✅ Yes
smoking
Metadata-Based
No, Yes
Yes, No
✅ Yes
hypertension
Metadata-Based
Yes, No
Yes, No
✅ Yes
deceased
Metadata-Based
Yes, No
No, Yes
✅ Yes
5.3 Numeric Variables: Range and Distribution Comparison
This analysis evaluates how well the numeric variables in the synthetic datasets replicate the real data in terms of range and distribution.
Range: For each numeric variable, the synthetic data should maintain values within the range of the real dataset. Any deviations, especially outliers, could indicate inaccuracies in the synthetic generation process.
Distribution Comparison: Density plots are used to compare the shape of the distributions between the real and synthetic datasets. Ideally, the synthetic data should closely mirror the real distribution, indicating that the underlying data generation model has captured the true variability of the numeric features.
Numeric variables: real ranges & percentiles vs. synthetic ranges (with P5–P95 containment check)
Within_P5_P95
Dataset
Variable
Real_Min
Real_P5
Real_P95
Real_Max
Synth_Min
Synth_Max
❌ No
Parametric MICE
age
40.0
42.15
82.0
95.0
42
82
❌ No
Parametric MICE
creatinine_phosphokinase
23.0
59.20
2220.8
7861.0
59
2221
✅ Yes
Parametric MICE
ejection_fraction
14.0
20.00
60.0
80.0
20
60
✅ Yes
Parametric MICE
platelets
25100.0
132200.00
421200.0
850000.0
132200
421200
❌ No
Parametric MICE
serum_creatinine
0.5
0.70
2.9
9.4
1
3
✅ Yes
Parametric MICE
serum_sodium
113.0
130.00
144.0
148.0
130
144
✅ Yes
Parametric MICE
follow_up
4.0
12.90
250.0
285.0
13
250
❌ No
CART Imputation
age
40.0
42.15
82.0
95.0
42
82
❌ No
CART Imputation
creatinine_phosphokinase
23.0
59.20
2220.8
7861.0
59
2221
✅ Yes
CART Imputation
ejection_fraction
14.0
20.00
60.0
80.0
20
60
✅ Yes
CART Imputation
platelets
25100.0
132200.00
421200.0
850000.0
132200
421200
❌ No
CART Imputation
serum_creatinine
0.5
0.70
2.9
9.4
1
3
✅ Yes
CART Imputation
serum_sodium
113.0
130.00
144.0
148.0
130
144
✅ Yes
CART Imputation
follow_up
4.0
12.90
250.0
285.0
13
250
❌ No
Synthpop (Low Fidelity)
age
40.0
42.15
82.0
95.0
42
82
❌ No
Synthpop (Low Fidelity)
creatinine_phosphokinase
23.0
59.20
2220.8
7861.0
59
2221
✅ Yes
Synthpop (Low Fidelity)
ejection_fraction
14.0
20.00
60.0
80.0
20
60
✅ Yes
Synthpop (Low Fidelity)
platelets
25100.0
132200.00
421200.0
850000.0
132200
421200
❌ No
Synthpop (Low Fidelity)
serum_creatinine
0.5
0.70
2.9
9.4
1
3
✅ Yes
Synthpop (Low Fidelity)
serum_sodium
113.0
130.00
144.0
148.0
130
144
✅ Yes
Synthpop (Low Fidelity)
follow_up
4.0
12.90
250.0
285.0
13
250
❌ No
Metadata-Based
age
40.0
42.15
82.0
95.0
42
82
✅ Yes
Metadata-Based
creatinine_phosphokinase
23.0
59.20
2220.8
7861.0
67
2216
✅ Yes
Metadata-Based
ejection_fraction
14.0
20.00
60.0
80.0
20
60
✅ Yes
Metadata-Based
platelets
25100.0
132200.00
421200.0
850000.0
134054
421118
❌ No
Metadata-Based
serum_creatinine
0.5
0.70
2.9
9.4
1
3
✅ Yes
Metadata-Based
serum_sodium
113.0
130.00
144.0
148.0
130
144
✅ Yes
Metadata-Based
follow_up
4.0
12.90
250.0
285.0
13
249
5.4 Missingness Comparison
In this section, we compare the proportions of missing values between the real dataset and each synthetic dataset. This comparison helps determine whether the synthetic datasets accurately represent the patterns of missingness in the real data.
Missingness Proportions: For each dataset, we calculate the proportion of missing values for every column and compare the results across the real and synthetic datasets.
Comparison Outcome: Ideally, the synthetic datasets should exhibit similar missingness proportions to the real data. Any deviations may indicate differences in how missing values were handled during the synthetic data generation process.
5.4.1 Introduction Plots
Show code
# Custom palette for the 5 metricsmetric_palette <-c("Discrete Columns"="#B0D99B","Continuous Columns"="#528AA8","All Missing Columns"="#FFB6DB","Complete Rows"="#264653","Missing Observations"="#2A9D8F")# Strip 'synth_' for fair visualsstrip_synth_prefix <-function(df) { names(df) <-sub("^synth_", "", names(df)) df }# Datasets to comparedatasets <-list("Real Data"= heart_failure,"Parametric MICE"=strip_synth_prefix(syn_data_1),"CART Imputation"=strip_synth_prefix(syn_cart_1),"Synthpop (Low Fidelity)"=strip_synth_prefix(syn_data_low_fidelity_synthpop),"Metadata-Based"=strip_synth_prefix(syn_data_metadata))# Function: horizontal bar chart for the 5 metricsmake_intro_plot <-function(df, title) { n_rows <-nrow(df) n_cols <-ncol(df) pct_discrete <-100*sum(vapply(df, is.factor, logical(1))) / n_cols pct_continuous <-100*sum(vapply(df, is.numeric, logical(1))) / n_cols pct_allmisscol <-100*sum(colSums(is.na(df)) == n_rows) / n_cols pct_completer <-100*sum(complete.cases(df)) / n_rows pct_missobs <-100*sum(is.na(df)) / (n_rows * n_cols) plot_df <- tibble::tibble(Metric =factor(c("Discrete Columns", "Continuous Columns", "All Missing Columns","Complete Rows", "Missing Observations"),levels =rev(c("Discrete Columns", "Continuous Columns", "All Missing Columns","Complete Rows", "Missing Observations")) # reverse for nicer order ),Percentage =c(pct_discrete, pct_continuous, pct_allmisscol, pct_completer, pct_missobs) )ggplot(plot_df, aes(x = Percentage, y = Metric, fill = Metric)) +geom_col(alpha =0.9, show.legend =FALSE) +geom_text(aes(label =paste0(round(Percentage, 1), "%")),hjust =-0.1, size =3.5) +scale_x_continuous(labels = scales::percent_format(scale =1),limits =c(0, 100),expand =expansion(mult =c(0, 0.05)) # headroom for labels ) +scale_fill_manual(values = metric_palette) +labs(title = title, x ="Percentage", y =NULL) +theme_light() +theme(axis.text.y =element_text(size =9),plot.title =element_text(face ="bold") )}# Build plots and combine in one gridwrap_plots(imap(datasets, ~make_intro_plot(.x, .y)),ncol =2) +plot_annotation(title ="Dataset Overview: Discrete/Continuous Columns, All-Missing Columns, Complete Rows, Missing Observations",subtitle ="Percentages shown per dataset; axes standardized to 0–100%",theme =theme(plot.title =element_text(face ="bold", size =12)) )
5.4.2 Missingness Proportions and Comparison Outcome
Show code
# Remove 'synth_' prefix so names match the real datasetstrip_synth_prefix <-function(df) {names(df) <-sub("^synth_", "", names(df)) df}# Compare against the real dataset's variable setcore_vars <-names(heart_failure)# Function to compute missingness proportions (2 d.p.) for one datasetget_missingness <-function(df, label) { df2 <- df %>%strip_synth_prefix()# keep only variables present in the real data (prevents mismatches) df2 <- df2 %>%select(any_of(core_vars))tibble(Variable =names(df2),Proportion =round(colSums(is.na(df2)) /nrow(df2), 2),Dataset = label )}# Build tidy tablemissing_tbl <-bind_rows(get_missingness(heart_failure, "Real Data"),get_missingness(syn_data_1, "Parametric MICE"),get_missingness(syn_cart_1, "CART Imputation"),get_missingness(syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)"),get_missingness(syn_data_metadata, "Metadata-Based"))# Pivot wider for side-by-side comparisonmissing_tbl_wide <- missing_tbl %>%pivot_wider(names_from = Dataset, values_from = Proportion) %>%arrange(Variable)# Optional: add quick match flags (✅ if identical to Real Data)# missing_tbl_wide <- missing_tbl_wide %>%# mutate(# `Parametric MICE Match` = ifelse(`Parametric MICE` == `Real Data`, "✅", "❌"),# `CART Imputation Match` = ifelse(`CART Imputation` == `Real Data`, "✅", "❌"),# `Synthpop (Low Fidelity) Match` = ifelse(`Synthpop (Low Fidelity)` == `Real Data`, "✅", "❌"),# `Metadata-Based Match` = ifelse(`Metadata-Based` == `Real Data`, "✅", "❌")# )kable( missing_tbl_wide,caption ="Missingness Proportions by Variable (rounded to 2 d.p.)")
Missingness Proportions by Variable (rounded to 2 d.p.)
Variable
Real Data
Parametric MICE
CART Imputation
Synthpop (Low Fidelity)
Metadata-Based
age
0.05
0.05
0.05
0.09
0.05
anaemia
0.00
0.00
0.00
0.00
0.00
creatinine_phosphokinase
0.05
0.05
0.05
0.10
0.05
deceased
0.00
0.00
0.00
0.00
0.00
diabetes
0.00
0.00
0.00
0.00
0.00
ejection_fraction
0.04
0.04
0.04
0.05
0.04
follow_up
0.00
0.00
0.00
0.00
0.00
hypertension
0.00
0.00
0.00
0.00
0.00
platelets
0.05
0.05
0.05
0.10
0.05
serum_creatinine
0.06
0.06
0.06
0.12
0.06
serum_sodium
0.07
0.07
0.07
0.11
0.07
sex
0.00
0.00
0.00
0.00
0.00
smoking
0.00
0.00
0.00
0.00
0.00
5.4.3 Missingness Maps
Show code
# Generate missingness map for the real datasetreal_missing_map <-aggr(heart_failure, col =c("#B0D99B", "#528AA8"),numbers =TRUE, sortVars =TRUE,labels =names(heart_failure), cex.axis = .7,gap =3, ylab =c("Missing Data", "Pattern - real Dataset"))
Variables sorted by number of missings:
Variable Count
serum_sodium 0.06688963
serum_creatinine 0.06020067
age 0.05016722
creatinine_phosphokinase 0.04682274
platelets 0.04682274
ejection_fraction 0.04013378
anaemia 0.00000000
diabetes 0.00000000
sex 0.00000000
smoking 0.00000000
hypertension 0.00000000
deceased 0.00000000
follow_up 0.00000000
In this section, we assess how well the correlations between numeric variables are preserved in the synthetic datasets compared to the real dataset. By comparing the correlation matrices, we evaluate whether the relationships between variables in the real data are reflected in the synthetic versions. This assessment is essential because maintaining the real data’s correlation structure ensures that any statistical or machine learning models trained on synthetic data will behave similarly to those trained on real data.
Correlation Matrix: A correlation matrix quantifies the strength and direction of relationships between pairs of numeric variables. Each cell in the matrix represents the correlation coefficient between two variables, which can range from -1 (perfect negative correlation) to +1 (perfect positive correlation). It is crucial that synthetic data retains these relationships to ensure that any analyses based on these dependencies remain valid.
Comparison Process: The correlation matrices for both the real and synthetic datasets are computed and Visualised. Ideally, the structure and strength of correlations in the synthetic datasets should closely match those in the real dataset. Differences in correlation strength or direction indicate that the synthetic data may not fully capture the underlying relationships, which can affect the validity of any conclusions drawn from analyses using the synthetic data.
Maintaining similar correlation structures is particularly important for use cases where relationships between variables play a crucial role, such as in predictive modeling, feature selection, and understanding variable dependencies. The following sections outline different methods used to compare correlation matrices between the real and synthetic datasets.
5.5.1 Correlation Matrices Comparison using psych
The psych package provides tools for analyzing and visualizing correlation matrices. The corr.test() function is particularly useful for computing correlation coefficients while handling missing data and providing additional statistics, such as confidence intervals.
Show code
# Strip 'synth_' so names line up with realstrip_synth_prefix <-function(df) {names(df) <-sub("^synth_", "", names(df)) df}# Build a tidy table of pairwise correlations for real vs one synthetic datasetcorr_compare_one <-function(real, synth, label) { real_num <- real |>select(where(is.numeric)) synth_num <- synth |>strip_synth_prefix() |>select(where(is.numeric)) common <-intersect(names(real_num), names(synth_num))if (length(common) <2) {return(tibble(Dataset = label, Var1 =character(), Var2 =character(),Real =numeric(), Synthetic =numeric(), Diff =numeric(), AbsDiff =numeric())) }# Compute Pearson correlations with pairwise complete observations rc <-cor(real_num[common], use ="pairwise.complete.obs", method ="pearson") sc <-cor(synth_num[common], use ="pairwise.complete.obs", method ="pearson")# Keep only upper triangle (unique pairs) ut <-upper.tri(rc, diag =FALSE) pairs <-which(ut, arr.ind =TRUE)tibble(Var1 =colnames(rc)[pairs[, 1]],Var2 =colnames(rc)[pairs[, 2]],Real = rc[pairs],Synthetic = sc[pairs] ) |>mutate(Dataset = label,Diff = Synthetic - Real,AbsDiff =abs(Diff) ) |>select(Dataset, Var1, Var2, Real, Synthetic, Diff, AbsDiff)}# Build one combined table for all synthetic datasetscorr_cmp_tbl <-bind_rows(corr_compare_one(heart_failure, syn_data_1, "Parametric MICE"),corr_compare_one(heart_failure, syn_cart_1, "CART Imputation"),corr_compare_one(heart_failure, syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)"),corr_compare_one(heart_failure, syn_data_metadata, "Metadata-Based")) |>arrange(Dataset, desc(AbsDiff)) |>mutate(Real =round(Real, 2),Synthetic =round(Synthetic, 2),Diff =round(Diff, 2),AbsDiff =round(AbsDiff, 2) )knitr::kable( corr_cmp_tbl,caption ="Pairwise Correlations: Real vs. Synthetic (sorted by largest absolute difference per dataset)",align ="lllrrrr")
Pairwise Correlations: Real vs. Synthetic (sorted by largest absolute difference per dataset)
Dataset
Var1
Var2
Real
Synthetic
Diff
AbsDiff
CART Imputation
creatinine_phosphokinase
platelets
0.01
-0.12
-0.13
0.13
CART Imputation
ejection_fraction
serum_creatinine
-0.01
-0.11
-0.10
0.10
CART Imputation
creatinine_phosphokinase
serum_sodium
0.06
-0.04
-0.10
0.10
CART Imputation
age
follow_up
-0.24
-0.15
0.09
0.09
CART Imputation
ejection_fraction
platelets
0.06
-0.02
-0.08
0.08
CART Imputation
age
serum_creatinine
0.17
0.25
0.08
0.08
CART Imputation
platelets
serum_creatinine
-0.05
0.03
0.08
0.08
CART Imputation
ejection_fraction
follow_up
0.03
-0.05
-0.08
0.08
CART Imputation
serum_sodium
follow_up
0.11
0.03
-0.07
0.07
CART Imputation
age
creatinine_phosphokinase
-0.08
-0.15
-0.07
0.07
CART Imputation
platelets
serum_sodium
0.06
-0.01
-0.07
0.07
CART Imputation
creatinine_phosphokinase
follow_up
-0.03
0.03
0.06
0.06
CART Imputation
platelets
follow_up
-0.01
0.05
0.06
0.06
CART Imputation
age
platelets
-0.05
-0.01
0.04
0.04
CART Imputation
age
serum_sodium
-0.04
-0.08
-0.04
0.04
CART Imputation
age
ejection_fraction
0.04
0.00
-0.04
0.04
CART Imputation
creatinine_phosphokinase
ejection_fraction
-0.05
-0.02
0.03
0.03
CART Imputation
creatinine_phosphokinase
serum_creatinine
0.00
-0.02
-0.02
0.02
CART Imputation
serum_creatinine
serum_sodium
-0.21
-0.23
-0.02
0.02
CART Imputation
ejection_fraction
serum_sodium
0.17
0.18
0.02
0.02
CART Imputation
serum_creatinine
follow_up
-0.15
-0.14
0.01
0.01
Metadata-Based
age
follow_up
-0.24
0.12
0.36
0.36
Metadata-Based
platelets
serum_sodium
0.06
-0.16
-0.23
0.23
Metadata-Based
serum_creatinine
serum_sodium
-0.21
0.01
0.21
0.21
Metadata-Based
creatinine_phosphokinase
ejection_fraction
-0.05
0.11
0.16
0.16
Metadata-Based
ejection_fraction
serum_sodium
0.17
0.03
-0.13
0.13
Metadata-Based
creatinine_phosphokinase
serum_creatinine
0.00
-0.12
-0.13
0.13
Metadata-Based
creatinine_phosphokinase
follow_up
-0.03
0.09
0.12
0.12
Metadata-Based
age
serum_creatinine
0.17
0.07
-0.10
0.10
Metadata-Based
serum_sodium
follow_up
0.11
0.00
-0.10
0.10
Metadata-Based
age
serum_sodium
-0.04
0.06
0.10
0.10
Metadata-Based
serum_creatinine
follow_up
-0.15
-0.05
0.09
0.09
Metadata-Based
creatinine_phosphokinase
platelets
0.01
-0.08
-0.09
0.09
Metadata-Based
ejection_fraction
platelets
0.06
-0.02
-0.08
0.08
Metadata-Based
age
platelets
-0.05
0.02
0.07
0.07
Metadata-Based
age
ejection_fraction
0.04
-0.03
-0.07
0.07
Metadata-Based
ejection_fraction
follow_up
0.03
0.08
0.05
0.05
Metadata-Based
creatinine_phosphokinase
serum_sodium
0.06
0.11
0.05
0.05
Metadata-Based
age
creatinine_phosphokinase
-0.08
-0.05
0.03
0.03
Metadata-Based
ejection_fraction
serum_creatinine
-0.01
0.01
0.02
0.02
Metadata-Based
platelets
follow_up
-0.01
-0.02
-0.01
0.01
Metadata-Based
platelets
serum_creatinine
-0.05
-0.06
-0.01
0.01
Parametric MICE
serum_creatinine
serum_sodium
-0.21
0.02
0.23
0.23
Parametric MICE
ejection_fraction
follow_up
0.03
0.20
0.17
0.17
Parametric MICE
age
follow_up
-0.24
-0.08
0.16
0.16
Parametric MICE
ejection_fraction
serum_sodium
0.17
0.03
-0.13
0.13
Parametric MICE
serum_sodium
follow_up
0.11
-0.01
-0.11
0.11
Parametric MICE
platelets
serum_creatinine
-0.05
0.06
0.11
0.11
Parametric MICE
creatinine_phosphokinase
serum_creatinine
0.00
-0.11
-0.11
0.11
Parametric MICE
age
serum_creatinine
0.17
0.08
-0.09
0.09
Parametric MICE
creatinine_phosphokinase
platelets
0.01
-0.07
-0.08
0.08
Parametric MICE
age
serum_sodium
-0.04
-0.11
-0.07
0.07
Parametric MICE
age
creatinine_phosphokinase
-0.08
-0.02
0.06
0.06
Parametric MICE
creatinine_phosphokinase
follow_up
-0.03
0.03
0.06
0.06
Parametric MICE
platelets
follow_up
-0.01
0.04
0.05
0.05
Parametric MICE
creatinine_phosphokinase
ejection_fraction
-0.05
-0.01
0.04
0.04
Parametric MICE
serum_creatinine
follow_up
-0.15
-0.11
0.04
0.04
Parametric MICE
platelets
serum_sodium
0.06
0.03
-0.03
0.03
Parametric MICE
creatinine_phosphokinase
serum_sodium
0.06
0.03
-0.03
0.03
Parametric MICE
ejection_fraction
serum_creatinine
-0.01
-0.04
-0.03
0.03
Parametric MICE
age
platelets
-0.05
-0.03
0.02
0.02
Parametric MICE
age
ejection_fraction
0.04
0.05
0.01
0.01
Parametric MICE
ejection_fraction
platelets
0.06
0.06
0.00
0.00
Synthpop (Low Fidelity)
age
follow_up
-0.24
0.04
0.28
0.28
Synthpop (Low Fidelity)
serum_creatinine
serum_sodium
-0.21
0.00
0.20
0.20
Synthpop (Low Fidelity)
age
serum_creatinine
0.17
-0.03
-0.20
0.20
Synthpop (Low Fidelity)
ejection_fraction
serum_sodium
0.17
-0.02
-0.19
0.19
Synthpop (Low Fidelity)
age
serum_sodium
-0.04
0.11
0.15
0.15
Synthpop (Low Fidelity)
serum_creatinine
follow_up
-0.15
0.00
0.15
0.15
Synthpop (Low Fidelity)
serum_sodium
follow_up
0.11
-0.03
-0.14
0.14
Synthpop (Low Fidelity)
age
creatinine_phosphokinase
-0.08
0.03
0.12
0.12
Synthpop (Low Fidelity)
creatinine_phosphokinase
ejection_fraction
-0.05
0.05
0.10
0.10
Synthpop (Low Fidelity)
ejection_fraction
platelets
0.06
0.15
0.09
0.09
Synthpop (Low Fidelity)
platelets
follow_up
-0.01
0.08
0.09
0.09
Synthpop (Low Fidelity)
ejection_fraction
follow_up
0.03
0.09
0.06
0.06
Synthpop (Low Fidelity)
creatinine_phosphokinase
platelets
0.01
-0.05
-0.06
0.06
Synthpop (Low Fidelity)
age
ejection_fraction
0.04
-0.02
-0.06
0.06
Synthpop (Low Fidelity)
ejection_fraction
serum_creatinine
-0.01
-0.06
-0.05
0.05
Synthpop (Low Fidelity)
creatinine_phosphokinase
serum_sodium
0.06
0.01
-0.05
0.05
Synthpop (Low Fidelity)
creatinine_phosphokinase
follow_up
-0.03
0.02
0.04
0.04
Synthpop (Low Fidelity)
platelets
serum_creatinine
-0.05
-0.08
-0.03
0.03
Synthpop (Low Fidelity)
creatinine_phosphokinase
serum_creatinine
0.00
-0.02
-0.03
0.03
Synthpop (Low Fidelity)
platelets
serum_sodium
0.06
0.08
0.02
0.02
Synthpop (Low Fidelity)
age
platelets
-0.05
-0.04
0.01
0.01
5.5.2 Correlation Matrices Comparison using Corrplot
The corrplot package is widely used for visualizing correlation matrices. It provides a variety of options for customization, making it easy to identify patterns and relationships between variables. This method is particularly effective for visual comparisons because it highlights differences in correlation strength and direction.
Show code
# Remove 'synth_' so names match the real datasetstrip_synth_prefix <-function(df) { names(df) <-sub("^synth_", "", names(df)); df }# Side-by-side correlation matrices (full width)compare_correlation_matrices <-function(real_data, synthetic_data, dataset_name) {# Numeric-only real_num <- real_data[sapply(real_data, is.numeric)] synth_num <-strip_synth_prefix(synthetic_data)[sapply(strip_synth_prefix(synthetic_data), is.numeric)]# Use common numeric variables in same order common <-intersect(names(real_num), names(synth_num))if (length(common) <2) {message(dataset_name, ": fewer than 2 common numeric columns — skipping.")return(invisible(NULL)) } real_corr <-cor(real_num[common], use ="complete.obs") synth_corr <-cor(synth_num[common], use ="complete.obs")# Make two panels across the page op <-par(no.readonly =TRUE); on.exit(par(op), add =TRUE)par(mfrow =c(1, 2), mar =c(4, 4, 4, 2))corrplot(real_corr, tl.cex =0.8, mar =c(0,0,2,0))title(paste(dataset_name, "- Real Data"), line =1)corrplot(synth_corr, tl.cex =0.8, mar =c(0,0,2,0))title(paste(dataset_name, "- Synthetic Data"), line =1)}# Run for each synthetic datasetcompare_correlation_matrices(heart_failure, syn_data_1, "Parametric MICE")
5.5.3 Correlation Matrices Comparison using GGally
The GGally package extends ggplot2 to create pairwise visualizations, including correlation matrix plots. Using ggpairs(), you can generate a matrix of scatterplots, histograms, and correlation coefficients, which provides a detailed view of the relationships between numeric variables in the real and synthetic datasets.
Show code
compare_correlation_matrices_ggally <-function(real_data, synthetic_data, dataset_name) {# Select only numeric columns from both datasets real_numeric <- real_data %>%select(where(is.numeric))# Remove prefix 'synth_' from synthetic data column namesnames(synthetic_data) <-gsub("^synth_", "", names(synthetic_data)) synthetic_numeric <- synthetic_data %>%select(where(is.numeric))# Ensure that both datasets have the same columns common_columns <-intersect(names(real_numeric), names(synthetic_numeric))if (length(common_columns) ==0) {cat("\nNo common numeric columns found for comparison in", dataset_name, "\n")return() } real_numeric <- real_numeric[, common_columns, drop =FALSE] synthetic_numeric <- synthetic_numeric[, common_columns, drop =FALSE]# Add a column to differentiate between real and synthetic data real_numeric$Dataset <-"Real" synthetic_numeric$Dataset <-"Synth"# Combine the datasets combined_data <-rbind(real_numeric, synthetic_numeric)# Create correlation matrix plot using GGally if there are valid columns to plotif (ncol(combined_data) >1) { p <-ggpairs( combined_data, aes(color = Dataset, alpha =0.5), title =paste(dataset_name, "- Correlation Matrix Comparison"),columns =1:(ncol(combined_data) -1), # Exclude the 'Dataset' column from the plotupper =list(continuous =wrap("cor", size =3, alignPercent =0.5)), # Smaller correlation labelslower =list(continuous =wrap("smooth", alpha =0.4, size =0.5)), # Smoother plots for readabilitydiag =list(continuous =wrap("densityDiag", alpha =0.5)) # Density plot with improved transparency ) +theme_minimal() +# Use a cleaner themetheme(strip.text =element_text(size =8, face ="bold"), # Smaller strip labelsaxis.text.x =element_text(angle =45, hjust =1, size =7), # Rotate x-axis labels for better readabilityaxis.text.y =element_text(size =7),legend.position ="top", # Move legend to a better locationlegend.title =element_blank() # Remove unnecessary legend title )print(p) } else {cat("\nNot enough numeric columns to generate a correlation matrix plot for", dataset_name, "\n") }}# Apply the function to each synthetic datasetcompare_correlation_matrices_ggally(heart_failure, syn_data_1, "Parametric MICE")
This section provides a comparative analysis of the descriptive statistics between the real dataset and the synthetic datasets generated using different methods.
6.1.1 Real Dataset
The descriptive statistics for the real heart failure dataset include metrics such as mean, median, standard deviation, and range. This serves as the baseline for comparison with the synthetic datasets.
Show code
# Generate descriptive statsdesc_tbl <-describe(heart_failure) %>%as.data.frame() %>%select(n, mean, sd, min, median, max) %>%round(2) %>% tibble::rownames_to_column("Variable")# Display as a neat tablekable( desc_tbl,caption ="Descriptive Statistics for the Heart Failure Dataset",align ="lrrrrrr")
Descriptive Statistics for the Heart Failure Dataset
Variable
n
mean
sd
min
median
max
age
284
60.80
12.08
40.0
60.0
95.0
anaemia*
299
1.43
0.50
1.0
1.0
2.0
creatinine_phosphokinase
285
578.81
981.99
23.0
249.0
7861.0
diabetes*
299
1.42
0.49
1.0
1.0
2.0
ejection_fraction
287
38.00
11.92
14.0
38.0
80.0
platelets
285
265163.83
97508.63
25100.0
263358.0
850000.0
serum_creatinine
281
1.39
1.05
0.5
1.1
9.4
serum_sodium
279
136.59
4.49
113.0
137.0
148.0
sex*
299
1.65
0.48
1.0
2.0
2.0
smoking*
299
1.32
0.47
1.0
1.0
2.0
hypertension*
299
1.35
0.48
1.0
1.0
2.0
deceased*
299
1.32
0.47
1.0
1.0
2.0
follow_up
299
130.26
77.61
4.0
115.0
285.0
6.1.2 Parametric Imputation (MICE)
The synthetic data generated using the MICE framework is analyzed to compare its summary statistics with the real dataset. This will help evaluate how well the parametric method captures the key characteristics of the data.
Descriptive Statistics for Parametric MICE Synthetic Dataset
Variable
n
mean
sd
min
median
max
synth_age
284
61.01
11.40
42
60
82
synth_anaemia*
299
1.45
0.50
1
1
2
synth_creatinine_phosphokinase
285
735.72
687.98
59
540
2221
synth_diabetes*
299
1.42
0.49
1
1
2
synth_ejection_fraction
287
38.44
10.70
20
38
60
synth_platelets
285
257523.78
86505.14
132200
254087
421200
synth_serum_creatinine
281
1.70
0.77
1
2
3
synth_serum_sodium
279
135.78
3.96
130
135
144
synth_sex*
299
1.64
0.48
1
2
2
synth_smoking*
299
1.29
0.45
1
1
2
synth_hypertension*
299
1.42
0.49
1
1
2
synth_deceased*
299
1.31
0.47
1
1
2
synth_follow_up
299
132.74
71.11
13
132
250
6.1.3 Non-Parametric Imputation (CART)
The CART method is a non-parametric technique that handles non-linear relationships. The descriptive statistics of the CART-imputed synthetic data provide insight into how this method captures the dataset’s distribution.
The Synthpop package generates synthetic data aimed at preserving data privacy. Descriptive statistics for this dataset help assess how well the key attributes of the real data are preserved.
Show code
# Descriptive stats for Synthpop synthetic datasetdesc_synthpop <-describe(syn_data_low_fidelity_synthpop) %>%as.data.frame() %>%select(n, mean, sd, min, median, max) %>%round(2) %>% tibble::rownames_to_column("Variable")# Display as a clean tablekable( desc_synthpop,caption ="Descriptive Statistics for Synthpop Synthetic Dataset",align ="lrrrrrr")
Descriptive Statistics for Synthpop Synthetic Dataset
Variable
n
mean
sd
min
median
max
synth_age
271
61.15
11.13
42
60
82
synth_anaemia*
299
1.47
0.50
1
1
2
synth_creatinine_phosphokinase
269
435.35
498.44
59
232
2221
synth_diabetes*
299
1.37
0.48
1
1
2
synth_ejection_fraction
285
36.73
10.93
20
35
60
synth_platelets
268
269756.34
77738.41
132200
263358
421200
synth_serum_creatinine
264
1.33
0.63
1
1
3
synth_serum_sodium
265
136.92
3.69
130
137
144
synth_sex*
299
1.65
0.48
1
2
2
synth_smoking*
299
1.28
0.45
1
1
2
synth_hypertension*
299
1.37
0.48
1
1
2
synth_deceased*
299
1.31
0.46
1
1
2
synth_follow_up
299
121.22
74.28
13
108
250
6.1.5 Metadata-Based Synthetic Data
The synthetic data generated using metadata ensures that variable structures conform to the data dictionary. The summary statistics here show how well the dataset reflects the real’s attributes based on predefined metadata.
Show code
# Descriptive stats for Metadata-Based synthetic datasetdesc_metadata <-describe(syn_data_metadata) %>%as.data.frame() %>%select(n, mean, sd, min, median, max) %>%round(2) %>% tibble::rownames_to_column("Variable")# Display as a clean tablekable( desc_metadata,caption ="Descriptive Statistics for Metadata-Based Synthetic Dataset",align ="lrrrrrr")
Descriptive Statistics for Metadata-Based Synthetic Dataset
Variable
n
mean
sd
min
median
max
synth_age
284
62.57
11.98
42
62.5
82
synth_anaemia*
299
1.52
0.50
1
2.0
2
synth_creatinine_phosphokinase
285
1129.11
594.90
67
1185.0
2216
synth_diabetes*
299
1.48
0.50
1
1.0
2
synth_ejection_fraction
287
39.81
12.08
20
38.0
60
synth_platelets
285
275686.74
79876.34
134054
280442.0
421118
synth_serum_creatinine
281
1.79
0.71
1
2.0
3
synth_serum_sodium
279
136.69
4.20
130
137.0
144
synth_sex*
299
1.49
0.50
1
1.0
2
synth_smoking*
299
1.51
0.50
1
2.0
2
synth_hypertension*
299
1.51
0.50
1
2.0
2
synth_deceased*
299
1.47
0.50
1
1.0
2
synth_follow_up
299
136.77
70.26
13
139.0
249
By comparing the descriptive statistics, you can evaluate how closely the synthetic datasets resemble the real dataset across key summary metrics. This is a critical step in assessing the quality and usability of synthetic data. Histogram Similarity Score ### Variable Exploration
6.1.6 Parametric Imputation (MICE)
The parametric imputation using the MICE framework generates synthetic data by imputing missing values based on multivariate normal distributions.
6.1.6.1 Bar Plots for Categorical Variables
To evaluate the preservation of categorical variable distributions in the synthetic data, we generate bar plots comparing the proportions of categorical variables between the real and synthetic datasets.
Show code
# Combine real and parametric MICE synthetic datasetsheart_failure$dataset <-"real"syn_data_1$dataset <-"synthetic"# Adjust the column names for synthetic data by removing the 'synth_' prefixcolnames(syn_data_1) <-gsub("synth_", "", colnames(syn_data_1))# Combine both datasets (real and synthetic)combined_data_param <-bind_rows(heart_failure, syn_data_1)# Bar plots for categorical variablesbar_plots <-colnames(heart_failure)[map_lgl(heart_failure, is.factor)] %>%map(~ggplot(combined_data_param, aes_string(.x, fill ='dataset', group ='dataset')) +geom_bar(aes(y = ..prop..), position =position_dodge2(), stat ="count") +scale_fill_manual(values =c("real"="#B0D99B", "synthetic"="#528AA8")) +labs(x = .x, y ="Proportion")) %>% patchwork::wrap_plots() +plot_annotation(title ="Comparison of Categorical Variables Between real and Synthetic Datasets",theme =theme(plot.title =element_text(hjust =0.5)))# Print the combined plot with one main titleprint(bar_plots)
6.1.6.2 Density Plots for Numeric Variables
The density plots illustrate the distribution of numeric variables between the real dataset and the parametric MICE synthetic data. The goal is to ensure that the synthetic data closely follows the real data’s numeric distribution.
Show code
# Density plots for numeric variables (parametric imputed data)density_plots_param <-colnames(heart_failure)[map_lgl(heart_failure, is.numeric)] %>%map(~ggplot(combined_data_param, aes_string(.x, fill ='dataset', group ='dataset')) +geom_density(alpha =0.5) +scale_fill_manual(values =c("real"="#B0D99B", "synthetic"="#528AA8")) +labs(x = .x, y ="Density")) %>% patchwork::wrap_plots(ncol =2) +# Arrange plots in 2 columnsplot_annotation(title ="Comparison of Numeric Variables Between real and Parametric Synthetic Data",theme =theme(plot.title =element_text(hjust =0.5)))# Print the combined density plots with one main titleprint(density_plots_param)
6.1.7 Non-Parametric Imputation (CART)
CART is a non-parametric approach for generating synthetic data that captures complex relationships between variables without making distributional assumptions.
6.1.7.1 Bar Plots for Categorical Variables
The bar plots compare the distribution of categorical variables in the real and synthetic datasets generated using the CART method.
Show code
# Combine real and non-parametric CART synthetic datasetssyn_cart_1$dataset <-"synthetic"# Adjust the column names for synthetic data by removing the 'synth_' prefixcolnames(syn_cart_1) <-gsub("synth_", "", colnames(syn_cart_1))# Combine both datasets (real and synthetic)combined_data_cart <-bind_rows(heart_failure, syn_cart_1)# Bar plots for categorical variables (CART-imputed data)bar_plots_cart <-colnames(heart_failure)[map_lgl(heart_failure, is.factor)] %>%map(~ggplot(combined_data_cart, aes_string(.x, fill ='dataset', group ='dataset')) +geom_bar(aes(y = ..prop..), position =position_dodge2(), stat ="count") +scale_fill_manual(values =c("real"="#B0D99B", "synthetic"="#528AA8")) +labs(x = .x, y ="Proportion")) %>% patchwork::wrap_plots() +plot_annotation(title ="Comparison of Categorical Variables Between real and Synthetic Data",theme =theme(plot.title =element_text(hjust =0.5)))# Print the combined plot with one main titleprint(bar_plots_cart)
6.1.7.2 Density Plots for Numeric Variables
The density plots display how well the numeric variables in the CART-imputed synthetic data align with the real dataset.
Show code
# Density plots for numeric variables (CART-imputed data)density_plots_cart <-colnames(heart_failure)[map_lgl(heart_failure, is.numeric)] %>%map(~ggplot(combined_data_cart, aes_string(.x, fill ='dataset', group ='dataset')) +geom_density(alpha =0.5) +scale_fill_manual(values =c("real"="#B0D99B", "synthetic"="#528AA8")) +labs(x = .x, y ="Density")) %>% patchwork::wrap_plots(ncol =2) +# Arrange plots in 2 columnsplot_annotation(title ="Comparison of Numeric Variables Between real and Synthetic Data",theme =theme(plot.title =element_text(hjust =0.5)))# Print the combined density plots with one main titleprint(density_plots_cart)
6.1.8 Synthpop-Based Synthetic Data
Synthpop is used to generate synthetic datasets that closely follow the distributions of the real data for privacy-preserving data sharing.
6.1.8.1 Bar Plots for Categorical Variables
These bar plots assess how well Synthpop-based synthetic data maintains the categorical distributions of the real dataset.
Show code
# Combine real and Synthpop-based synthetic datasetsheart_failure$dataset <-"real"syn_data_low_fidelity_synthpop$dataset <-"synthetic"# Adjust the column names for synthetic data by removing the 'synth_' prefixcolnames(syn_data_low_fidelity_synthpop) <-gsub("synth_", "", colnames(syn_data_low_fidelity_synthpop))# Combine both datasets (real and synthetic)combined_data_synthpop <-bind_rows(heart_failure, syn_data_low_fidelity_synthpop)# Bar plots for categorical variables (Synthpop-based synthetic data)bar_plots_synthpop <-colnames(heart_failure)[map_lgl(heart_failure, is.factor)] %>%map(~ggplot(combined_data_synthpop, aes_string(.x, fill ='dataset', group ='dataset')) +geom_bar(aes(y = ..prop..), position =position_dodge2(), stat ="count") +scale_fill_manual(values =c("real"="#B0D99B", "synthetic"="#528AA8")) +labs(x = .x, y ="Proportion")) %>% patchwork::wrap_plots() +plot_annotation(title ="Comparison of Categorical Variables Between real and Synthetic Data",theme =theme(plot.title =element_text(hjust =0.5)))# Print the combined plot with one main titleprint(bar_plots_synthpop)
6.1.8.2 Density Plots for Numeric Variables
The density plots illustrate how closely Synthpop-based synthetic data matches the real dataset for numeric variables.
Show code
# Combine real and Synthpop-based synthetic datasetsheart_failure$dataset <-"real"syn_data_low_fidelity_synthpop$dataset <-"synthetic"combined_data_synthpop <-bind_rows(heart_failure, syn_data_low_fidelity_synthpop)# Density plots for numeric variables (Synthpop-based synthetic data)density_plots_synthpop <-colnames(heart_failure)[map_lgl(heart_failure, is.numeric)] %>%map(~ggplot(combined_data_synthpop, aes_string(.x, fill ='dataset', group ='dataset')) +geom_density(alpha =0.5) +scale_fill_manual(values =c("real"="#B0D99B", "synthetic"="#528AA8")) +labs(x = .x, y ="Density")) %>% patchwork::wrap_plots(ncol =2) +# Arrange plots in 2 columnsplot_annotation(title ="Comparison of Numeric Variables Between real and Synthetic Data",theme =theme(plot.title =element_text(hjust =0.5)))# Print the combined density plots with one main titleprint(density_plots_synthpop)
6.1.9 Metadata-Based Synthetic Data
Metadata-based synthetic data generation uses a predefined data dictionary to ensure that the synthetic data follows the correct structure and distribution.
6.1.9.1 Bar Plots for Categorical Variables
We compare the categorical variables of the real and metadata-based synthetic datasets to assess distributional alignment.
Show code
# Combine real and Metadata-based synthetic datasetsheart_failure$dataset <-"real"syn_data_metadata$dataset <-"synthetic"# Adjust the column names for synthetic data by removing the 'synth_' prefixcolnames(syn_data_metadata) <-gsub("synth_", "", colnames(syn_data_metadata))# Combine both datasets (real and synthetic)combined_data_metadata <-bind_rows(heart_failure, syn_data_metadata)# Bar plots for categorical variables (Metadata-based synthetic data)bar_plots_metadata <-colnames(heart_failure)[map_lgl(heart_failure, is.factor)] %>%map(~ggplot(combined_data_metadata, aes_string(.x, fill ='dataset', group ='dataset')) +geom_bar(aes(y = ..prop..), position =position_dodge2(), stat ="count") +scale_fill_manual(values =c("real"="#B0D99B", "synthetic"="#528AA8")) +labs(x = .x, y ="Proportion")) %>% patchwork::wrap_plots() +plot_annotation(title ="Comparison of Categorical Variables Between real and Synthetic Datasets",theme =theme(plot.title =element_text(hjust =0.5)))# Print the combined plot with one main titleprint(bar_plots_metadata)
6.1.9.2 Density Plots for Numeric Variables
The density plots for numeric variables compare the distribution of numeric values between the real and metadata-based synthetic datasets.
Show code
# Combine real and Metadata-based synthetic datasetsheart_failure$dataset <-"real"syn_data_metadata$dataset <-"synthetic"combined_data_metadata <-bind_rows(heart_failure, syn_data_metadata)# Density plots for numeric variables (Metadata-based synthetic data)density_plots_metadata <-colnames(heart_failure)[map_lgl(heart_failure, is.numeric)] %>%map(~ggplot(combined_data_metadata, aes_string(.x, fill ='dataset', group ='dataset')) +geom_density(alpha =0.5) +scale_fill_manual(values =c("real"="#B0D99B", "synthetic"="#528AA8")) +labs(x = .x, y ="Density")) %>% patchwork::wrap_plots(ncol =2) +# Arrange plots in 2 columnsplot_annotation(title ="Comparison of Numeric Variables Between real and Synthetic Data",theme =theme(plot.title =element_text(hjust =0.5)))# Print the combined density plots with one main titleprint(density_plots_metadata)
6.2 Histogram Similarity Score
The Histogram Similarity Score measures how closely the marginal distributions of the synthetic data match the real data. Marginal distributions are critical for assessing whether key data characteristics such as spread, central tendency (mean/median), and shape (skewness/kurtosis) are preserved.
Ideal Ranges for Histogram Similarity:
Perfect Match: 1 (100%) indicates a perfect match between the distributions.
Good Match: 0.85–1 (85% to 100%) shows strong similarity.
Moderate Match: 0.65–0.85 suggests some differences that may need addressing.
Poor Match: Below 0.65 indicates significant deviation from the real data.
Show code
# Custom palette for datasetsdataset_palette <-c("Parametric MICE"="#B0D99B","CART Imputation"="#528AA8","Synthpop (Low Fidelity)"="#FFB6DB","Metadata-Based"="#264653")# Define function oncecalculate_histogram_similarity <-function(real_data, synthetic_data) { real_numeric <- real_data[sapply(real_data, is.numeric)] synthetic_numeric <- synthetic_data[sapply(synthetic_data, is.numeric)] similarity_scores <-sapply(names(real_numeric), function(var) { real_hist <-hist(real_numeric[[var]], plot =FALSE) synthetic_hist <-hist(synthetic_numeric[[var]], plot =FALSE)# Wasserstein distance between histogram midpoints wasserstein_distance <- transport::wasserstein1d(real_hist$mids, synthetic_hist$mids)1/ (1+ wasserstein_distance) # similarity score })mean(similarity_scores)}# Collect results in a tibble (convert to percentages)hist_sim_tbl <- tibble::tibble(Dataset =c("Parametric MICE", "CART Imputation", "Synthpop (Low Fidelity)", "Metadata-Based"),Histogram_Similarity =c(calculate_histogram_similarity(heart_failure, syn_data_1),calculate_histogram_similarity(heart_failure, syn_cart_1),calculate_histogram_similarity(heart_failure, syn_data_low_fidelity_synthpop),calculate_histogram_similarity(heart_failure, syn_data_metadata) )) %>%mutate(Histogram_Similarity =round(Histogram_Similarity *100, 1)) # percentage# Display as tablekable( hist_sim_tbl,caption ="Histogram Similarity Scores (%): higher = more similar to real data",align ="lr")
Histogram Similarity Scores (%): higher = more similar to real data
Dataset
Histogram_Similarity
Parametric MICE
9.5
CART Imputation
9.5
Synthpop (Low Fidelity)
9.5
Metadata-Based
9.5
Show code
# Horizontal bar chart with custom coloursggplot(hist_sim_tbl, aes(y = Dataset, x = Histogram_Similarity, fill = Dataset)) +geom_col(alpha =0.9, show.legend =FALSE) +geom_text(aes(label =paste0(Histogram_Similarity, "%")), hjust =-0.1, size =3.3) +scale_fill_manual(values = dataset_palette) +scale_x_continuous(labels = scales::percent_format(scale =1), limits =c(0, 100)) +labs(title ="Histogram Similarity by Synthetic Method",subtitle ="Based on Wasserstein distance of variable histograms",x ="Similarity (%)", y =NULL ) +theme_minimal() +theme(axis.text.y =element_text(size =10))
6.3 Mutual Information Score
The Mutual Information (MI) Score measures the shared information between pairs of features in the real and synthetic datasets, assessing how well relationships between variables are preserved.
Ideal Ranges for Mutual Information:
Perfect Alignment: MI scores for synthetic data should ideally match the real within 10-20% for critical feature pairs.
Lower MI: Consistently lower MI scores suggest that noise has been added, weakening relationships.
Overfitting Risk: Higher MI scores than the real may indicate overfitting, increasing re-identification risk.
Show code
# Custom palette for datasetsdataset_palette <-c("Parametric MICE"="#B0D99B","CART Imputation"="#528AA8","Synthpop (Low Fidelity)"="#FFB6DB","Metadata-Based"="#264653")# ------- Helpers -------# Strip 'synth_' so variable names align with the real datasetstrip_synth_prefix <-function(df) {names(df) <-sub("^synth_", "", names(df)) df}# Compute pairwise MI matrix for a data.frame of discretized columnscompute_mi_matrix <-function(df_disc) { p <-ncol(df_disc)if (p <2) return(matrix(numeric(0), nrow = p, ncol = p,dimnames =list(colnames(df_disc), colnames(df_disc)))) M <-matrix(0, nrow = p, ncol = p,dimnames =list(colnames(df_disc), colnames(df_disc)))for (i inseq_len(p)) {for (j in i:p) {# safe MI (skip NA rows; zero if not enough variation) ok <-is.finite(df_disc[[i]]) &is.finite(df_disc[[j]]) xi <- df_disc[[i]][ok]; xj <- df_disc[[j]][ok] mi <-if (length(unique(xi)) <2||length(unique(xj)) <2) 0elsesuppressWarnings(infotheo::mutinformation(xi, xj)) M[i, j] <- mi M[j, i] <- mi } } M}# Compare two MI matrices (upper triangle only) -> 0–1 similarity score (higher = closer)mi_similarity_score <-function(M_real, M_synth) {if (length(M_real) ==0||length(M_synth) ==0) return(NA_real_) common <-intersect(colnames(M_real), colnames(M_synth))if (length(common) <2) return(NA_real_) R <- M_real[common, common, drop =FALSE] S <- M_synth[common, common, drop =FALSE] ut <-upper.tri(R, diag =FALSE) mean_abs_diff <-mean(abs(R[ut] - S[ut]))1/ (1+ mean_abs_diff)}# Discretize numeric columns (equal-frequency bins); keep only shared numeric varsMutual_Information <-function(real_df, synth_df, nbins =10) { real_num <- dplyr::select(real_df, where(is.numeric)) synth_num <-strip_synth_prefix(synth_df) |> dplyr::select(where(is.numeric)) common <-intersect(names(real_num), names(synth_num))if (length(common) <2) return(NA_real_) real_num <- real_num[common] synth_num <- synth_num[common]# Discretize numerics (returns integer codes) real_disc <-as.data.frame(lapply(real_num, \(x) infotheo::discretize(x, disc ="equalfreq", nbins = nbins))) synth_disc <-as.data.frame(lapply(synth_num, \(x) infotheo::discretize(x, disc ="equalfreq", nbins = nbins)))# MI matrices and similarity score M_real <-compute_mi_matrix(real_disc) M_synth <-compute_mi_matrix(synth_disc)mi_similarity_score(M_real, M_synth)}# ------- Compute scores and present nicely (percentages) -------mi_results <- tibble::tibble(Dataset =c("Parametric MICE", "CART Imputation", "Synthpop (Low Fidelity)", "Metadata-Based"),Mutual_Information =c(Mutual_Information(heart_failure, syn_data_1),Mutual_Information(heart_failure, syn_cart_1),Mutual_Information(heart_failure, syn_data_low_fidelity_synthpop),Mutual_Information(heart_failure, syn_data_metadata) )) |>mutate(Mutual_Information =round(Mutual_Information *100, 1)) # convert to %# User-friendly tableknitr::kable( mi_results,caption ="Mutual Information Similarity (%): higher = synthetic preserves dependency structure more closely",align ="lr")
Mutual Information Similarity (%): higher = synthetic preserves dependency structure more closely
Dataset
Mutual_Information
Parametric MICE
93.3
CART Imputation
94.3
Synthpop (Low Fidelity)
91.5
Metadata-Based
93.2
Show code
# Horizontal bar chart with custom coloursggplot(mi_results, aes(y = Dataset, x = Mutual_Information, fill = Dataset)) +geom_col(alpha =0.9, show.legend =FALSE) +geom_text(aes(label =paste0(Mutual_Information, "%")), hjust =-0.1, size =3.3) +scale_fill_manual(values = dataset_palette) +scale_x_continuous(labels = scales::percent_format(scale =1), limits =c(0, 100)) +labs(title ="Mutual Information Similarity by Synthetic Method",subtitle ="Based on pairwise MI matrices of discretized numeric variables (equal-frequency bins)",x ="Similarity (%)", y =NULL ) +theme_minimal() +theme(axis.text.y =element_text(size =10))
6.4 Correlation Score
The Correlation Score measures how well the relationships (correlations) between variables are preserved in the synthetic dataset compared to the real.
Ideal Ranges for Correlation Score:
Perfect Alignment: Correlations in the synthetic data should match the real closely, with deviations of 5-10% being acceptable.
Weaker Correlations: If synthetic correlations are weaker, it may indicate loss of fidelity due to noise.
Stronger Correlations: Stronger correlations suggest overfitting, which may increase privacy risks.
Show code
# Custom palette for datasetsdataset_palette <-c("Parametric MICE"="#B0D99B","CART Imputation"="#528AA8","Synthpop (Low Fidelity)"="#FFB6DB","Metadata-Based"="#264653")# --- Strip 'synth_' so names alignstrip_synth_prefix <-function(df) {names(df) <-sub("^synth_", "", names(df)) df}# --- Function to calculate Correlation Similarity Scorecalculate_correlation_score <-function(real_data, synthetic_data) {# Keep numeric columns only real_num <- real_data[sapply(real_data, is.numeric)] synth_num <-strip_synth_prefix(synthetic_data)[sapply(strip_synth_prefix(synthetic_data), is.numeric)]# Use only common variables common <-intersect(names(real_num), names(synth_num))if (length(common) <2) return(NA_real_) real_corr <-cor(real_num[common], use ="complete.obs") synth_corr <-cor(synth_num[common], use ="complete.obs")# Frobenius norm difference → similarity score in [0,1] corr_diff <-norm(real_corr - synth_corr, type ="F")1/ (1+ corr_diff)}# --- Compute scores for all synthetic datasetscorr_scores <- tibble::tibble(Dataset =c("Parametric MICE", "CART Imputation", "Synthpop (Low Fidelity)", "Metadata-Based"),Correlation_Score =c(calculate_correlation_score(heart_failure, syn_data_1),calculate_correlation_score(heart_failure, syn_cart_1),calculate_correlation_score(heart_failure, syn_data_low_fidelity_synthpop),calculate_correlation_score(heart_failure, syn_data_metadata) )) %>%mutate(Correlation_Score =round(Correlation_Score *100, 1)) # convert to %# --- User-friendly tablekable( corr_scores,caption ="Correlation Similarity Scores (%): higher = synthetic preserves correlation structure more closely",align ="lr")
# --- Horizontal bar chart with custom coloursggplot(corr_scores, aes(y = Dataset, x = Correlation_Score, fill = Dataset)) +geom_col(alpha =0.9, show.legend =FALSE) +geom_text(aes(label =paste0(Correlation_Score, "%")), hjust =-0.1, size =3.3) +scale_fill_manual(values = dataset_palette) +scale_x_continuous(labels = scales::percent_format(scale =1), limits =c(0, 100)) +labs(title ="Correlation Similarity by Synthetic Method",subtitle ="Based on Frobenius norm difference between real and synthetic correlation matrices",x ="Similarity (%)", y =NULL ) +theme_minimal() +theme(axis.text.y =element_text(size =10))
7 Utility Assessment
7.1 Feature Importance Consistency Assessment
The Feature Importance Consistency Assessment evaluates how well the synthetic datasets retain the predictive significance of features compared to the real dataset. This assessment is essential for ensuring that the synthetic data accurately reflects the underlying relationships that contribute to the model’s predictive performance.
Feature importance measures each variable’s contribution to model predictions. Consistency in feature importance rankings between the real and synthetic datasets serves as a crucial indicator of data quality. Significant discrepancies may indicate deficiencies in the synthetic data generation process, potentially affecting the performance of models trained on synthetic data.
Ideal Ranges for Feature Importance Consistency Assessment:
Perfect Match: Feature importance scores in the synthetic dataset should closely align with those in the real dataset, particularly for highly predictive features.
Acceptable Deviation: A deviation of 10-15% is generally acceptable for most features. For variables identified as highly influential, stricter thresholds may be needed to maintain predictive accuracy.
Types of Deviations
Underrepresentation: If an important feature has lower importance in the synthetic dataset, it may suggest that critical predictive relationships are not well-preserved.
Overrepresentation: If a feature’s importance is higher in the synthetic data, this could indicate noise or overfitting, potentially distorting the model’s perception of feature relationships.
The following visualisation techniques will be used to compare feature importance:
Bar Plots for Feature Importance Comparison: Bar plots will compare feature importance scores between the real and synthetic datasets. This visual comparison can reveal significant deviations.
SHAP Summary Plots: SHAP (SHapley Additive exPlanations) summary plots provide a comprehensive view of the distribution of SHAP values for each feature, allowing for a direct comparison between the real and synthetic datasets.
Show code
# Prepare the data for XGBoostset.seed(123) # For reproducibilitytrain_indices <-sample(seq_len(nrow(heart_failure)), size =0.7*nrow(heart_failure))train_real <- heart_failure[train_indices, ]test_real <- heart_failure[-train_indices, ]train_real_matrix <-model.matrix(deceased ~ age + sex + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + platelets + serum_creatinine + serum_sodium + smoking + hypertension + follow_up -1, data = train_real)train_real_label <- train_real[rownames(train_real_matrix), "deceased"]train_real_label <-as.numeric(train_real_label) -1# Convert factor to 0 and 1train_synth <- syn_data_1[train_indices, ]train_synth_matrix <-model.matrix(deceased ~ age + sex + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + platelets + serum_creatinine + serum_sodium + smoking + hypertension + follow_up -1, data = train_synth)train_synth_label <- train_synth[rownames(train_synth_matrix), "deceased"]train_synth_label <-as.numeric(train_synth_label) -1# Convert factor to 0 and 1# Convert to DMatrix format, which is required for XGBoostdtrain_real <-xgb.DMatrix(data = train_real_matrix, label = train_real_label)dtrain_synth <-xgb.DMatrix(data = train_synth_matrix, label = train_synth_label)# Set up parameters for XGBoostparams <-list(objective ="binary:logistic",eval_metric ="auc", # Use AUC as an evaluation metricmax_depth =3,eta =0.1)# Tune hyperparameters using cross-validation (with parallel processing)cv_real <-xgb.cv(params = params,data = dtrain_real,nrounds =100,nfold =5,verbose =0,early_stopping_rounds =10,nthread = num_cores # Number of threads for parallel processing)best_nrounds_real <- cv_real$best_iterationcv_synth <-xgb.cv(params = params,data = dtrain_synth,nrounds =100,nfold =5,verbose =0,early_stopping_rounds =10,nthread = num_cores # Number of threads for parallel processing)best_nrounds_synth <- cv_synth$best_iteration# Train XGBoost on real data (TRTR)xgb_model_trtr <-xgboost(params = params, data = dtrain_real, nrounds = best_nrounds_real, verbose =0, nthread = num_cores)# Train XGBoost on synthetic data (TSTR)xgb_model_tstr <-xgboost(params = params, data = dtrain_synth, nrounds = best_nrounds_synth, verbose =0, nthread = num_cores)# Get feature importance for TRTRimportance_trtr <-xgb.importance(feature_names =colnames(train_real_matrix), model = xgb_model_trtr)cat("Feature Importance for TRTR:\n")
# Get feature importance for TSTRimportance_tstr <-xgb.importance(feature_names =colnames(train_synth_matrix), model = xgb_model_tstr)cat("\nFeature Importance for TSTR:\n")
The Prediction Score assesses synthetic data utility by comparing models trained on real data (Train Real Test Real (TRTR)) versus synthetic data (Train Synthetic Test Real (TSTR)) using two metrics:
Accuracy: The proportion of correct predictions made by the model.
pMSE (Prediction Mean Squared Error): Measures the average of the squared differences between predicted and actual values.
Ideal Ranges for Prediction Score:
Accuracy: TSTR accuracy should be within 5-10% of TRTR accuracy. Lower accuracy signals that synthetic data isn’t fully representative of the real dataset.
pMSE: TSTR pMSE should be close to TRTR pMSE. A 5-10% deviation is acceptable; a higher pMSE suggests synthetic data may not be accurately reflecting the true relationships in the real data.
This section assesses the utility of synthetic data generated through various methods, using XGBoost to calculate both Accuracy and pMSE for TRTR and TSTR.
Prediction Score Summary (XGBoost, 5-fold CV): TRTR vs TSTR
Dataset
Accuracy
pMSE
TRTR (Real→Real)
67.8%
0.322
TSTR (MICE→Real)
70.0%
0.300
TSTR (CART→Real)
70.0%
0.300
TSTR (Synthpop→Real)
66.7%
0.333
TSTR (Metadata→Real)
56.7%
0.433
Show code
# ---- horizontal Accuracy bar chart ----palette5 <-c("#B0D99B", "#528AA8", "#FFB6DB", "#264653", "#2A9D8F")plot_df <- results %>%mutate(Accuracy_num = readr::parse_number(Accuracy) /100)ggplot(plot_df, aes(y = Dataset, x = Accuracy_num, fill = Dataset)) +geom_col(alpha =0.9, show.legend =FALSE) +geom_text(aes(label = Accuracy), hjust =-0.1, size =3.5) +scale_fill_manual(values =setNames(palette5, unique(plot_df$Dataset))) +scale_x_continuous(labels = scales::percent_format(accuracy =1), limits =c(0, 1)) +labs(title ="Model Accuracy: TRTR vs TSTR", x ="Accuracy", y =NULL) +theme_minimal()
7.3 Quality Score (QScore)
The QScore provides a comprehensive evaluation of synthetic data by aggregating multiple metrics, such as distribution similarity, feature importance, and correlation preservation, into a single score. This offers a clear, overall assessment of how well the synthetic data replicates the real dataset.
Ideal Ranges for QScore:
Excellent Quality (0.8 - 1.0): Indicates the synthetic data closely mirrors the real and can be confidently used for most analyses.
Good Quality (0.6 - 0.8): Reflects good alignment, with some minor deviations. Suitable for most use cases, but some caution is advised.
Moderate Quality (0.4 - 0.6): The synthetic data may be useful for exploratory analysis but shows significant deviations in key metrics.
Poor Quality (< 0.4): The synthetic data poorly aligns with the real and is likely unsuitable for most analytic or modeling tasks.
Quality Score (QScore) — average % difference in aggregates (lower is better)
Dataset
QScore (%)
Parametric MICE
8.8%
CART Imputation
4.7%
Synthpop (Low Fidelity)
10.6%
Metadata-Based
3.0%
Show code
# --- Horizontal bar chart (percent axis) ---pal <-c("#B0D99B", "#528AA8", "#FFB6DB", "#264653", "#2A9D8F")ggplot(qscore_tbl, aes(y = Dataset, x = QScore_num, fill = Dataset)) +geom_col(alpha =0.9, show.legend =FALSE) +geom_text(aes(label = QScore_pct), hjust =-0.1, size =3.5) +scale_fill_manual(values =setNames(pal[seq_len(nrow(qscore_tbl))], qscore_tbl$Dataset)) +scale_x_continuous(labels = scales::percent_format(accuracy =1), limits =c(0, 1),expand =expansion(mult =c(0, 0.05))) +labs(title ="QScore Comparison (Lower is Better)",subtitle ="Average percentage difference between real and synthetic aggregates across random queries",x ="QScore (%)", y =NULL ) +theme_minimal()
8 Privacy / Disclosure Risk
8.1 Check for Replication of Real Value Combinations
This check identifies if any records in the synthetic data are exact duplicates of those in the real dataset. High replication increases privacy risks, so minimal or no duplication is ideal for preserving privacy.
Replication Check: Exact Matching Rows Between Real and Synthetic Datasets
Dataset
Replicated_Rows
Total_Rows
Replication_Percent
Status
Parametric MICE
0
299
0
✅ No Replication
Non-Parametric CART
0
299
0
✅ No Replication
Synthpop (Low Fidelity)
0
299
0
✅ No Replication
Metadata-Based
0
299
0
✅ No Replication
Show code
# ---- Optional: bar chart if any replication > 0 ----if (any(replication_tbl$Replicated_Rows >0)) { pal <-c("#B0D99B", "#528AA8", "#FFB6DB", "#264653")ggplot(replication_tbl, aes(y = Dataset, x = Replicated_Rows, fill = Dataset)) +geom_col(alpha =0.9, show.legend =FALSE) +geom_text(aes(label = Replicated_Rows), hjust =-0.1, size =3.5) +scale_fill_manual(values = pal[seq_len(nrow(replication_tbl))]) +labs(title ="Replication of Real Rows in Synthetic Data",subtitle ="Counts of exact duplicated rows across real and synthetic datasets",x ="Number of Replicated Rows", y =NULL ) +theme_minimal()} else {cat("✅ No replicated rows were found across any synthetic dataset. No plot is generated.\n")}
✅ No replicated rows were found across any synthetic dataset. No plot is generated.
8.2 Check for Unique Value Combinations
This analysis compares the number of unique row combinations between the real and synthetic datasets. A high number of unique combinations in the synthetic data suggests that it captures diverse patterns while minimising the replication of sensitive information, enhancing privacy protection.
Unique Value Combinations in Real vs. Synthetic Datasets
Dataset
Unique_Combinations
Total_Rows
Percent_Unique
Real Data
299
299
100
Parametric MICE
299
299
100
Non-Parametric CART
299
299
100
Synthpop (Low Fidelity)
299
299
100
Metadata-Based
299
299
100
Show code
# ---- Plot only if any dataset has > 0 unique rows ----if (any(unique_tbl$Unique_Combinations >0, na.rm =TRUE)) { pal <-c("#B0D99B", "#528AA8", "#FFB6DB", "#264653", "#2A9D8F")ggplot(unique_tbl, aes(y = Dataset, x = Percent_Unique, fill = Dataset)) +geom_col(alpha =0.9, show.legend =FALSE) +geom_text(aes(label =paste0(Percent_Unique, "%")), hjust =-0.1, size =3.5) +scale_fill_manual(values = pal[seq_len(nrow(unique_tbl))]) +scale_x_continuous(limits =c(0, 100)) +labs(title ="Percentage of Unique Row Combinations",subtitle ="Higher % means fewer duplicates within the dataset",x ="Percent of Rows that are Unique", y =NULL ) +theme_minimal()} else {cat("✅ No unique combinations detected across datasets. No plot is generated. \n")}
8.3 Exact Match Score
The Exact Match Score measures how many individual records in the synthetic data are identical to records in the real dataset. Lower scores are ideal for privacy preservation, as high exact matches may pose privacy risks.
Ideal Ranges for Exact Match Score:
0 - 0.1: Strong privacy protection with minimal or no exact matches, ideal for most privacy-preserving synthetic data.
0.1 - 0.3: Moderate exact matches, indicating some privacy risk but generally acceptable for many use cases.
0.3 - 0.5: Higher exact matches, raising privacy concerns. Review is needed to ensure proper data generation.
Above 0.5: Significant privacy risk, as the synthetic data too closely replicates the real.
Show code
# ---- Membership Inference Score ----calculate_membership_inference_score <-function(real_data, synthetic_data, k =5, threshold =0.01) { real_num <- real_data[sapply(real_data, is.numeric)] synth_num <- synthetic_data[sapply(synthetic_data, is.numeric)]if (ncol(real_num) ==0||ncol(synth_num) ==0) return(NA_real_) common <-intersect(names(real_num), names(synth_num))if (length(common) ==0) return(NA_real_) real_num <- real_num[common] synth_num <- synth_num[common] real_cc <- real_num[complete.cases(real_num), , drop =FALSE] synth_cc <- synth_num[complete.cases(synth_num), , drop =FALSE]if (nrow(real_cc) < k ||nrow(synth_cc) ==0) return(NA_real_) real_scaled <-scale(real_cc) synth_scaled <-scale(synth_cc) nn <- FNN::get.knnx(real_scaled, synth_scaled, k)$nn.dist risky <-sum(rowMeans(nn) < threshold) risky /nrow(synth_scaled)}# ---- Results table ----mis_tbl <- tibble::tibble(Dataset =c("Parametric MICE", "Non-Parametric CART","Synthpop (Low Fidelity)", "Metadata-Based"),Membership_Inference_Score =c(calculate_membership_inference_score(heart_failure, syn_data_1),calculate_membership_inference_score(heart_failure, syn_cart_1),calculate_membership_inference_score(heart_failure, syn_data_low_fidelity_synthpop),calculate_membership_inference_score(heart_failure, syn_data_metadata) )) %>%mutate(Membership_Inference_Score =round(Membership_Inference_Score, 4),Status = dplyr::case_when(is.na(Membership_Inference_Score) ~"— Insufficient data", Membership_Inference_Score ==0~"✅ No risky memberships", Membership_Inference_Score <0.01~"⚠️ Low risk",TRUE~"❌ High risk" ) )# ---- User-friendly table ----knitr::kable( mis_tbl,caption ="Membership Inference Score: Proportion of synthetic records that are very close to real records (lower is better for privacy)",align ="lrr")
Membership Inference Score: Proportion of synthetic records that are very close to real records (lower is better for privacy)
Dataset
Membership_Inference_Score
Status
Parametric MICE
0
✅ No risky memberships
Non-Parametric CART
0
✅ No risky memberships
Synthpop (Low Fidelity)
0
✅ No risky memberships
Metadata-Based
0
✅ No risky memberships
Show code
# ---- Conditional plot ----if (any(mis_tbl$Membership_Inference_Score >0, na.rm =TRUE)) { pal <-c("#B0D99B", "#528AA8", "#FFB6DB", "#264653")ggplot(mis_tbl, aes(y = Dataset, x = Membership_Inference_Score, fill = Dataset)) +geom_col(alpha =0.9, show.legend =FALSE, na.rm =TRUE) +geom_text(aes(label =ifelse(is.na(Membership_Inference_Score), "NA", scales::percent(Membership_Inference_Score, accuracy =0.01))),hjust =-0.1, size =3.5 ) +scale_fill_manual(values = pal[seq_len(nrow(mis_tbl))]) +scale_x_continuous(labels = scales::percent_format(accuracy =1)) +labs(title ="Membership Inference Score by Synthetic Method",subtitle ="Lower values indicate better privacy protection",x ="Score (Proportion of risky memberships)",y =NULL ) +theme_minimal()} else {cat("✅ All Membership Inference Scores are 0. No privacy risks detected, so no plot is generated\n")}
✅ All Membership Inference Scores are 0. No privacy risks detected, so no plot is generated
8.4 Neighbors’ Privacy Score
The Neighbors’ Privacy Score identifies real records that are “too similar” to synthetic records by using a nearest-neighbors search. This metric helps evaluate privacy risk by detecting synthetic records that closely resemble real data, which could compromise privacy.
Ideal Ranges for Neighbors’ Privacy Score:
0 - 0.1: Indicates strong privacy protection, with very few synthetic records too similar to real ones.
0.1 - 0.3: Moderate similarity, where privacy risk exists but remains within acceptable limits for many cases.
Above 0.3: Higher similarity, raising privacy concerns and requiring review to ensure the synthetic data is not too closely mimicking real data.
Show code
# ---- Neighbors' Privacy Score ----calculate_neighbors_privacy_score <-function(real_data, synthetic_data, k =5, threshold =0.01) {# Numeric-only columns real_num <- real_data[sapply(real_data, is.numeric)] synth_num <- synthetic_data[sapply(synthetic_data, is.numeric)]if (ncol(real_num) ==0||ncol(synth_num) ==0) return(NA_real_)# Keep only common columns common <-intersect(names(real_num), names(synth_num))if (length(common) ==0) return(NA_real_) real_num <- real_num[common] synth_num <- synth_num[common]# Complete cases real_cc <- real_num[complete.cases(real_num), , drop =FALSE] synth_cc <- synth_num[complete.cases(synth_num), , drop =FALSE]if (nrow(real_cc) < k ||nrow(synth_cc) ==0) return(NA_real_)# Standardise real_scaled <-scale(real_cc) synth_scaled <-scale(synth_cc)# k-NN distances from synthetic → real neighbors <- FNN::get.knnx(real_scaled, synth_scaled, k)$nn.dist# Privacy risk: synthetic records "too close" to real close_matches <-sum(rowMeans(neighbors) < threshold) score <- close_matches /nrow(synth_scaled)return(score)}# ---- Build results table ----neighbors_tbl <- tibble::tibble(Dataset =c("Parametric MICE", "Non-Parametric CART","Synthpop (Low Fidelity)", "Metadata-Based"),Neighbors_Privacy_Score =c(calculate_neighbors_privacy_score(heart_failure, syn_data_1),calculate_neighbors_privacy_score(heart_failure, syn_cart_1),calculate_neighbors_privacy_score(heart_failure, syn_data_low_fidelity_synthpop),calculate_neighbors_privacy_score(heart_failure, syn_data_metadata) )) %>%mutate(Neighbors_Privacy_Score =round(Neighbors_Privacy_Score, 4),Status =case_when( Neighbors_Privacy_Score ==0~"✅ No risky neighbors", Neighbors_Privacy_Score <0.01~"⚠️ Low risk",TRUE~"❌ High risk" ) )# ---- User-friendly table ----knitr::kable( neighbors_tbl,caption ="Neighbors' Privacy Score: Proportion of synthetic records too close to real records (lower is better for privacy)",align ="lrr")
Neighbors’ Privacy Score: Proportion of synthetic records too close to real records (lower is better for privacy)
Dataset
Neighbors_Privacy_Score
Status
Parametric MICE
0
✅ No risky neighbors
Non-Parametric CART
0
✅ No risky neighbors
Synthpop (Low Fidelity)
0
✅ No risky neighbors
Metadata-Based
0
✅ No risky neighbors
Show code
# ---- Plot only if any score > 0 ----if (any(neighbors_tbl$Neighbors_Privacy_Score >0)) { pal <-c("#B0D99B", "#528AA8", "#FFB6DB", "#264653")ggplot(neighbors_tbl, aes(y = Dataset, x = Neighbors_Privacy_Score, fill = Dataset)) +geom_col(alpha =0.9, show.legend =FALSE) +geom_text(aes(label = Neighbors_Privacy_Score), hjust =-0.1, size =3.5) +scale_fill_manual(values = pal[seq_len(nrow(neighbors_tbl))]) +scale_x_continuous(labels = scales::percent_format(accuracy =0.1)) +labs(title ="Neighbors' Privacy Score by Synthetic Method",subtitle ="Lower values mean synthetic records are less likely to be close replicas of real individuals",x ="Privacy Score (Proportion of risky neighbors)", y =NULL ) +theme_minimal()} else {cat("✅ All Neighbors' Privacy Scores are 0. No risky neighbors detected. No plot is generated.\n")}
✅ All Neighbors' Privacy Scores are 0. No risky neighbors detected. No plot is generated.
8.5 Membership Inference Score
The Membership Inference Score evaluates the risk of membership inference attacks, which attempt to determine whether a specific record belongs to the real dataset. This metric helps assess the vulnerability of synthetic data to privacy breaches.
Ideal Ranges for Membership Inference Score:
0 - 0.1: Indicates low risk, where it is unlikely that membership inference attacks can accurately identify real records.
0.1 - 0.3: Moderate risk, where some vulnerability to membership inference exists but may still be acceptable for certain use cases.
Above 0.3: High risk, suggesting significant privacy concerns as the synthetic data could reveal membership information about the real records.
Show code
# ---- Membership Inference Score ----# Interpreted as the proportion of synthetic records that are "too close" to the real data.# Lower is better (safer).calculate_membership_inference_score <-function(real_data, synthetic_data, k =5, threshold =0.01) {# Numeric-only columns real_num <- real_data[sapply(real_data, is.numeric)] synth_num <- synthetic_data[sapply(synthetic_data, is.numeric)]# Guardrailsif (ncol(real_num) ==0||ncol(synth_num) ==0) return(NA_real_)# Keep only common numeric variables common <-intersect(names(real_num), names(synth_num))if (length(common) ==0) return(NA_real_) real_num <- real_num[common] synth_num <- synth_num[common]# Use complete cases real_cc <- real_num[complete.cases(real_num), , drop =FALSE] synth_cc <- synth_num[complete.cases(synth_num), , drop =FALSE]if (nrow(real_cc) < k ||nrow(synth_cc) ==0) return(NA_real_)# Standardize real_scaled <-scale(real_cc) synth_scaled <-scale(synth_cc)# k-NN distances from synthetic → real nn <- FNN::get.knnx(real_scaled, synth_scaled, k)$nn.dist# Score: proportion of synthetic records whose average distance to k nearest reals is below threshold risky <-sum(rowMeans(nn) < threshold) risky /nrow(synth_scaled)}# ---- Results table ----mis_tbl <- tibble::tibble(Dataset =c("Parametric MICE", "Non-Parametric CART","Synthpop (Low Fidelity)", "Metadata-Based"),Membership_Inference_Score =c(calculate_membership_inference_score(heart_failure, syn_data_1),calculate_membership_inference_score(heart_failure, syn_cart_1),calculate_membership_inference_score(heart_failure, syn_data_low_fidelity_synthpop),calculate_membership_inference_score(heart_failure, syn_data_metadata) )) %>%mutate(Membership_Inference_Score =round(Membership_Inference_Score, 4),Status = dplyr::case_when(is.na(Membership_Inference_Score) ~"— Insufficient data", Membership_Inference_Score ==0~"✅ No risky memberships", Membership_Inference_Score <0.01~"⚠️ Low risk",TRUE~"❌ High risk" ) )# ---- User-friendly table ----knitr::kable( mis_tbl,caption ="Membership Inference Score: Proportion of synthetic records that are very close to real records (lower is better for privacy)",align ="lrr")
Membership Inference Score: Proportion of synthetic records that are very close to real records (lower is better for privacy)
Dataset
Membership_Inference_Score
Status
Parametric MICE
0
✅ No risky memberships
Non-Parametric CART
0
✅ No risky memberships
Synthpop (Low Fidelity)
0
✅ No risky memberships
Metadata-Based
0
✅ No risky memberships
Show code
# ---- Plot only if any risky memberships (> 0) ----if (any(mis_tbl$Membership_Inference_Score >0, na.rm =TRUE)) { pal <-c("#B0D99B", "#528AA8", "#FFB6DB", "#264653")ggplot(mis_tbl, aes(y = Dataset, x = Membership_Inference_Score, fill = Dataset)) +geom_col(alpha =0.9, show.legend =FALSE, na.rm =TRUE) +geom_text(aes(label =ifelse(is.na(Membership_Inference_Score), "NA", scales::percent(Membership_Inference_Score, accuracy =0.01))),hjust =-0.1, size =3.5 ) +scale_fill_manual(values = pal[seq_len(nrow(mis_tbl))]) +scale_x_continuous(labels = scales::percent_format(accuracy =1)) +labs(title ="Membership Inference Score by Synthetic Method",subtitle ="Lower values indicate better privacy protection",x ="Score (Proportion of risky memberships)",y =NULL ) +theme_minimal()} else {cat("✅ All Membership Inference Scores are 0. No risky memberships detected. No plot is generated. \n")}
✅ All Membership Inference Scores are 0. No risky memberships detected. No plot is generated.
9 Conclusions
This report evaluated the fidelity, utility, privacy, and overall quality of synthetic data generated using parametric MICE, non-parametric CART, Synthpop, and metadata-based methods. Key findings include:
Data Structure Preservation: All methods preserved the structure of the real dataset, though minor variations in variable distributions were observed.
Categorical Variables: Most synthetic datasets retained categorical levels, but frequency distribution discrepancies were particularly notable in Synthpop data.
Numeric Variables: The range and distribution of numeric variables were largely maintained across methods, with slight variations in density plots.
Correlation & Mutual Information: While most methods captured variable relationships, weaker correlations and lower mutual information scores suggested some loss of feature dependencies in certain synthetic datasets.
Utility (Prediction Score): Synthetic data performed similarly to real data in predictive modeling, though slight reductions in accuracy and increased pMSE indicated less precise representation of the real dataset’s predictive capabilities.
Privacy (Exact Match & Neighbors’ Privacy Scores): All synthetic datasets demonstrated strong privacy protection, with zero values for both Exact Match and Neighbors’ Privacy Scores. This indicates that no synthetic records were identical to or highly similar to any real records, affirming robust privacy across all methods.
In conclusion, synthetic data generated through these methods provides a balance between utility and privacy, though method selection should align with specific analytic or privacy needs.
Show code
# Stop the parallel clusterstopCluster(cl)registerDoSEQ()
Source Code
---title: Heart Failure Clinical Records Synthetic Data Projectauthor: Linus Chirchirdate: 7 September 2025date-format: "D MMMM YYYY"page-layout: fullformat: html: toc: true toc-depth: 4 toc-location: left toc-float: true number-sections: true toc-title: "Contents" css: styles/styles.css code-fold: true code-summary: "Show code" code-tools: true code-overflow: wrap docx: toc: true toc-depth: 5 toc-location: left toc-float: true number-sections: true toc-title: "Contents" css: styles.cssexecute: warning: false message: falsefreeze: auto---```{r setup}# Load necessary libraries# --- Project and Performance Utilities ---library(here) # Manage project-relative file pathslibrary(doParallel) # Enable parallel processing for faster computationslibrary(tidyverse) # Core data science toolkit (dplyr, tidyr, readr, purrr, etc.)library(readxl) # Import Excel (.xlsx, .xls) files into R# --- Data Exploration and Summaries ---library(psych) # Produce descriptive statistics and psychometric analyseslibrary(knitr) # Format dynamic tables and reports in Markdown/Quarto# --- Visualisation ---library(ggplot2) # Build customizable, high-quality data visualizationslibrary(patchwork) # Combine multiple ggplot2 plots into unified layoutslibrary(corrplot) # Visualize correlation matrices with heatmaps and plotslibrary(GGally) # Extend ggplot2 with correlation plots, pairwise plots, ggpairs()library(VIM) # Visualize/impute missing values with advanced methods (e.g., matrix plots)library(scales)# --- Data Imputation and Synthesis ---library(mice) # Perform multiple imputation by chained equations for missing datalibrary(synthpop) # Generate, evaluate, and compare synthetic datasets# --- Statistical and Information-Theoretic Measures ---library(transport) # Compute optimal transport distances (e.g., Wasserstein) for distribution comparisonlibrary(infotheo) # Calculate information-theoretic metrics (entropy, mutual information)# --- Modeling and Machine Learning ---library(caret) # Streamline machine learning workflows (training, tuning, evaluation)library(xgboost) # Implement gradient boosting for classification/regressionlibrary(SHAPforxgboost) # Explain XGBoost predictions using SHAP valueslibrary(FNN) # Perform fast k-nearest neighbor searches and distance calculations# Set up parallel processingnum_cores <- detectCores() - 1 # Use one less than the total number of corescl <- makeCluster(num_cores)registerDoParallel(cl)```\newpage## IntroductionThis project evaluates the quality of synthetic datasets derived from the Heart Failure Clinical Records data (299 patients, 13 variables). Our goal is to determine how well different generation strategies reproduce the statistical properties and analytic value of the original data while protecting privacy.We generate synthetic data using four approaches: (1) parametric imputation (MICE), (2) non-parametric imputation (CART via MICE), (3) distribution-driven synthesis (synthpop), and (4) metadata-guided rules. We then compare each synthetic dataset with the real data across three dimensions:- Fidelity – univariate and multivariate similarity (distributions, ranges, correlations), histogram similarity, and mutual information.- Utility – model transportability using XGBoost (TRTR vs. TSTR), feature-importance agreement, and SHAP-based behaviour.- Privacy / Disclosure risk – exact record matches, neighbour-proximity checks, and membership-inference sensitivity.Results are presented through aligned tables and plots (structure checks, categorical level retention, density overlays, correlation matrices, SHAP summaries) with concise scores for each method. The report concludes with a comparative summary to guide method selection given different fidelity–utility–privacy trade-offs.## Heart Failure Clinical Records Dataset and Initial Exploration The **Heart Failure Clinical Records dataset**, sourced from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records), contains records of **299 patients** with heart failure collected during their follow-up period. Each record includes **13 clinical features** covering demographics, clinical conditions, and laboratory measures: | Variable | Description | Unit / Levels ||-----------------------------|------------------------------------------------------|----------------------|| **age** | Age of the patient | Years || **anaemia** | Reduction in red blood cells | No / Yes || **creatinine_phosphokinase**| Enzyme level in the blood | mcg/L || **diabetes** | Diabetes status | No / Yes || **ejection_fraction** | % of blood leaving the heart with each contraction | Percentage (%) || **platelets** | Platelet count | Kiloplatelets/mL || **serum_creatinine** | Serum creatinine level | mg/dL || **serum_sodium** | Serum sodium level | mEq/L || **sex** | Gender of the patient | Female / Male || **smoking** | Smoking status | No / Yes || **hypertension** | Hypertension status | No / Yes || **deceased** | Survival status during follow-up | No / Yes || **follow_up** | Duration of the follow-up period | Days |<br>A preview of the first 10 rows of the dataset is shown below.```{r load the heart failure dataset}# Set seed for reproducibilityset.seed(123)# Load the heart failure dataset from the local directoryheart_failure <- read.csv(here("data", "heart_failure.csv"))# Let's preview the heart failure datasetknitr::kable( head(heart_failure, 10), caption = "Preview of the first 10 rows of the Heart Failure dataset", align = rep("l", ncol(heart_failure)) # left align all columns)```<br>Here's the preview of the structure of the heart failure dataset.```{r display the structure of the dataset}df <- data.frame( Variable = names(heart_failure), Type = unname(sapply(heart_failure, \(x) paste(class(x), collapse = ", "))), Values = unname(sapply(heart_failure, \(x) paste(head(unique(x), 10), collapse = ", "))), row.names = NULL, check.names = FALSE)knitr::kable( df, caption = "Structure of the Heart Failure Dataset: Variables, Types, and Values", align = c("l","l","l"), row.names = FALSE)```<br>Let's perform some basic data cleaning and preprocessing steps. Convert the following columns to factors: sex, anaemia, diabetes, smoking, hypertension and deceased. Here's the new structure of the dataset after the basic cleaning.```{r modify structure of the dataset}# Convert the specified columns to factorsheart_failure$sex <- as.factor(heart_failure$sex)heart_failure$anaemia <- as.factor(heart_failure$anaemia)heart_failure$diabetes <- as.factor(heart_failure$diabetes)heart_failure$smoking <- as.factor(heart_failure$smoking)heart_failure$hypertension <- as.factor(heart_failure$hypertension)heart_failure$deceased <- as.factor(heart_failure$deceased)```Here are the summary statistics of the heart failure dataset.```{r summary statistics of the heart failure dataset}# Generate descriptive statistics and display as a single tabledescribe(heart_failure) |> kable( caption = "Summary Statistics of the Heart Failure Dataset", digits = 2, # round to 2 decimal places align = "lrrrrrrrrrr" # left for variable, right for numbers )```## Generating Synthetic DataThe process of generating synthetic data is essential for tasks such as privacy preservation, testing machine learning models, and conducting simulations. In this section, we employ various techniques to generate synthetic data based on the real heart failure dataset. Each method serves a different purpose and offers distinct advantages, depending on the use case and the type of missing data or privacy concerns.We explore four key approaches for generating synthetic data:### Parametric Imputation Using MICEThis method uses the Multivariate Imputation by Chained Equations (MICE) framework, with parametric imputation based on a normal distribution. MICE allows for handling missing data through multiple imputations and is widely used in data science and healthcare for its flexibility and statistical rigor. Here's the preview of the first 10 rows of the synthetic data generated.```{r Parametric Imputation using MICE}# Function to compute the 5th and 95th percentiles for numeric columnscompute_percentiles <- function(data, lower_percentile = 0.05, upper_percentile = 0.95) { percentiles <- lapply(data, function(column) { if (is.numeric(column)) { lower <- quantile(column, probs = lower_percentile, na.rm = TRUE) upper <- quantile(column, probs = upper_percentile, na.rm = TRUE) return(c(lower, upper)) } return(c(NA, NA)) }) names(percentiles) <- names(data) return(percentiles)}# Function to scale numeric values based on the 5th and 95th percentilesscale_to_percentiles <- function(data, percentiles) { scaled_data <- data for (col in colnames(data)) { if (is.numeric(data[[col]])) { # Apply scaling only if percentiles are available for this column if (!is.na(percentiles[[col]][1]) && !is.na(percentiles[[col]][2])) { min_val <- percentiles[[col]][1] max_val <- percentiles[[col]][2] # Clip values to lie within the 5th and 95th percentile range scaled_data[[col]] <- pmin(pmax(data[[col]], min_val), max_val) } } } return(scaled_data)}# Function to generate missingness in a dataset based on the real data's patterngenerate_missingness_based_on_real <- function(real_data, synthetic_data) { # Ensure the synthetic dataset has the same structure as the real if (!all(colnames(real_data) == colnames(synthetic_data))) { stop("Column names in real and synthetic data must match") } # Copy the real missingness pattern to the synthetic dataset synthetic_data_with_missingness <- synthetic_data for (col in colnames(real_data)) { # Apply missingness where it was present in the real data missing_indices <- is.na(real_data[[col]]) synthetic_data_with_missingness[[col]][missing_indices] <- NA } return(synthetic_data_with_missingness)}# Compute the 5th and 95th percentiles for the real datapercentiles <- compute_percentiles(heart_failure)# Generate synthetic data using MICEwhere <- make.where(heart_failure, "all")method <- make.method(heart_failure, where = where)method[method == "pmm"] <- "norm"# Perform multiple imputation (10 datasets)syn_param <- mice(heart_failure, m = 10, maxit = 1, method = method, where = where, printFlag = FALSE)# Extract the first synthetic datasetsyn_data_1 <- complete(syn_param, 1)# Scale synthetic data to the 5th and 95th percentile rangessyn_data_1 <- scale_to_percentiles(syn_data_1, percentiles)# Apply the real missingness pattern to the synthetic datasetsyn_data_1 <- generate_missingness_based_on_real(heart_failure, syn_data_1)# Round off all numeric columns to 0 decimal placessyn_data_1 <- syn_data_1 %>% mutate(across(where(is.numeric), round, digits = 0))# Add a prefix 'synth_' to all synthetic dataset variable namescolnames(syn_data_1) <- paste("synth", colnames(syn_data_1), sep = "_")# Display the structure of the dataset (fix double variable issue)df_syn <- data.frame( Variable = names(syn_data_1), Type = unname(sapply(syn_data_1, \(x) paste(class(x), collapse = ", "))), Values = unname(sapply(syn_data_1, \(x) paste(head(unique(x), 10), collapse = ", "))), row.names = NULL, check.names = FALSE)# Remove duplicated variables if anydf_syn <- df_syn[!duplicated(df_syn$Variable), ]knitr::kable( df_syn, caption = "Structure of the Heart Failure Dataset: Variables, Types, and Values", align = c("l","l","l"), row.names = FALSE)```### Non-Parametric Imputation Using CARTThe Classification and Regression Trees (CART) method, a non-parametric approach, is useful for imputation when the relationships between variables are complex or non-linear. By leveraging decision trees, CART is capable of capturing intricate patterns in the data without making strong assumptions about the underlying distributions. Here's the preview of the first 10 rows of the synthetic data generated.```{r Non-Parametric Imputation using CART}# Compute the 5th and 95th percentiles for the real datapercentiles <- compute_percentiles(heart_failure)# Generate synthetic data using CART imputation with MICEwhere <- make.where(heart_failure, "all")method <- make.method(heart_failure, where = where)method[method == "pmm"] <- "cart" # Set the method to "cart" for non-parametric imputation# Perform multiple imputation (10 datasets)syn_cart <- mice(heart_failure, m = 10, maxit = 1, method = method, where = where, printFlag = FALSE)# Extract the first synthetic datasetsyn_cart_1 <- complete(syn_cart, 1)# Scale synthetic data to the 5th and 95th percentile rangessyn_cart_1 <- scale_to_percentiles(syn_cart_1, percentiles)# Apply the real missingness pattern to the synthetic datasetsyn_cart_1 <- generate_missingness_based_on_real(heart_failure, syn_cart_1)# Round off all numeric columns to 0 decimal placessyn_cart_1 <- syn_cart_1 %>% mutate(across(where(is.numeric), round, digits = 0))# Add a prefix 'synth_' to all synthetic dataset variable namescolnames(syn_cart_1) <- paste("synth", colnames(syn_cart_1), sep = "_")# Display the structure of the dataset (fix double variable issue)df_cart <- data.frame( Variable = names(syn_cart_1), Type = unname(sapply(syn_cart_1, \(x) paste(class(x), collapse = ", "))), Values = unname(sapply(syn_cart_1, \(x) paste(head(unique(x), 10), collapse = ", "))), row.names = NULL, check.names = FALSE)# Remove duplicated variables if anydf_cart <- df_cart[!duplicated(df_cart$Variable), ]knitr::kable( df_cart, caption = "Structure of the Heart Failure Dataset: Variables, Types, and Values (CART Imputation)", align = c("l","l","l"), row.names = FALSE)```### Generating Synthetic Data Using SynthpopThe Synthpop package is designed for creating synthetic data based on the distribution of the real dataset. It is particularly useful for privacy-preserving data sharing, where the aim is to create a synthetic dataset that mimics the real data without exposing sensitive information. Here's the preview of the first 10 rows of the synthetic data generated.```{r Generating Synthetic Data using Synthpop}# Compute the 5th and 95th percentiles for the real datapercentiles <- compute_percentiles(heart_failure)# Generate low-fidelity synthetic data using random samplingsyn_data_low_fidelity <- syn(heart_failure, method = "sample", seed = 123)# Convert the synthetic dataset to a data framesyn_data_low_fidelity_synthpop <- syn_data_low_fidelity$syn# Scale the synthetic data to the 5th and 95th percentile rangessyn_data_low_fidelity_synthpop <- scale_to_percentiles(syn_data_low_fidelity_synthpop, percentiles)# Apply the real missingness pattern to the synthetic datasetsyn_data_low_fidelity_synthpop <- generate_missingness_based_on_real(heart_failure, syn_data_low_fidelity_synthpop)# Round off all numeric columns to 0 decimal placessyn_data_low_fidelity_synthpop <- syn_data_low_fidelity_synthpop %>% mutate(across(where(is.numeric), round, digits = 0))# Add a prefix 'synth_' to all synthetic dataset variable namescolnames(syn_data_low_fidelity_synthpop) <- paste("synth", colnames(syn_data_low_fidelity_synthpop), sep = "_")# Display the structure of the dataset (fix double variable issue)df_synthpop <- data.frame( Variable = names(syn_data_low_fidelity_synthpop), Type = unname(sapply(syn_data_low_fidelity_synthpop, \(x) paste(class(x), collapse = ", "))), Values = unname(sapply(syn_data_low_fidelity_synthpop, \(x) paste(head(unique(x), 10), collapse = ", "))), row.names = NULL, check.names = FALSE)# Remove duplicated variables if anydf_synthpop <- df_synthpop[!duplicated(df_synthpop$Variable), ]knitr::kable( df_synthpop, caption = "Structure of the Heart Failure Dataset: Variables, Types, and Values (Synthpop)", align = c("l","l","l"), row.names = FALSE)```### Generating Synthetic Data Using MetadadaThis method uses a data dictionary (metadata) to drive the generation of synthetic data. The metadata defines the structure and types of variables, which ensures that the generated synthetic data conforms to expected formats, such as ranges for continuous variables and categories for binary or categorical variables. Here's the preview of the first 10 rows of the synthetic data generated.```{r Generating Synthetic Data using Metadata}# Compute the 5th and 95th percentiles for the real datapercentiles <- compute_percentiles(heart_failure)# Generate synthetic data using metadata# Load the metadata from the Excel fileheart_failure_metadata <- read_excel(here("data", "heart_failure_metadata.xlsx"))# Create an empty data frame based on the metadata structuren_rows <- nrow(heart_failure) # Number of synthetic records to generatevariable_names <- heart_failure_metadata$`Variable Name`data_types <- heart_failure_metadata$Typesyn_data_metadata <- data.frame(matrix(ncol = length(variable_names), nrow = n_rows))colnames(syn_data_metadata) <- variable_names# Function to generate data based on metadatagenerate_data <- function(variable, real_data, type, n) { if (type == "Continuous" || type == "Integer") { if (is.numeric(real_data[[variable]])) { # Compute 5th and 95th percentiles for numeric variables p5 <- quantile(real_data[[variable]], probs = 0.05, na.rm = TRUE) p95 <- quantile(real_data[[variable]], probs = 0.95, na.rm = TRUE) if (type == "Continuous") { return(runif(n, min = p5, max = p95)) } else if (type == "Integer") { return(sample(floor(p5):ceiling(p95), n, replace = TRUE)) } } } else if (is.factor(real_data[[variable]])) { return(sample(levels(real_data[[variable]]), n, replace = TRUE)) } else if (variable %in% c("anaemia", "diabetes", "hypertension", "smoking", "sex")) { return(sample(unique(real_data[[variable]]), n, replace = TRUE)) } else { return(rep(NA, n)) }}# Generate synthetic data based on the metadatafor (i in seq_along(variable_names)) { syn_data_metadata[[i]] <- generate_data(variable_names[i], heart_failure, data_types[i], n_rows)}# Scale the synthetic data to the 5th and 95th percentile rangessyn_data_metadata <- scale_to_percentiles(syn_data_metadata, percentiles)# Apply the real missingness pattern to the synthetic datasetsyn_data_metadata <- generate_missingness_based_on_real(heart_failure, syn_data_metadata)# Round off all numeric columns to 0 decimal placessyn_data_metadata <- syn_data_metadata %>% mutate(across(where(is.numeric), round, digits = 0))# Add a prefix 'synth_' to all synthetic dataset variable namescolnames(syn_data_metadata) <- paste("synth", colnames(syn_data_metadata), sep = "_")# Display the structure of the dataset (fix double variable issue)df_metadata <- data.frame( Variable = names(syn_data_metadata), Type = unname(sapply(syn_data_metadata, \(x) paste(class(x), collapse = ", "))), Values = unname(sapply(syn_data_metadata, \(x) paste(head(unique(x), 10), collapse = ", "))), row.names = NULL, check.names = FALSE)# Remove duplicated variables if anydf_metadata <- df_metadata[!duplicated(df_metadata$Variable), ]knitr::kable( df_metadata, caption = "Structure of the Heart Failure Dataset: Variables, Types, and Values (Metadata)", align = c("l","l","l"), row.names = FALSE)```## Synthetic Data IdentificationIn this section, we generate tables to identify which columns in the synthetic datasets are synthetic (prefixed with "synth_") and which remain unchanged from the real dataset. This helps ensure clarity on which variables have been altered during the synthetic data generation process.For each dataset, we produce a table displaying the column names and whether the column is synthetic. A column is marked "Yes" if it was generated synthetically and "No" if it was retained from the real data.```{r Synthetic Data Identification and Column Names Table}# Function to create column info with dataset namecreate_column_info <- function(dataset, dataset_name) { data.frame( column_name = colnames(dataset), dataset = dataset_name, is_synthetic = ifelse(grepl("^synth_", colnames(dataset)), "Yes", "No"), stringsAsFactors = FALSE )}# Collect column info for all datasetsall_column_info <- bind_rows( create_column_info(syn_data_1, "Parametric MICE"), create_column_info(syn_cart_1, "CART Imputation"), create_column_info(syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)"), create_column_info(syn_data_metadata, "Metadata-Based"))# Reshape into wide format so variables appear oncewide_column_info <- all_column_info %>% pivot_wider( names_from = dataset, values_from = is_synthetic )# Display the combined wide tablekable( wide_column_info, caption = "Synthetic Indicators for Each Variable Across Datasets")```## Synthetic Dataset Structure### Dataset Size and Structure ComparisonThis step evaluates whether the synthetic datasets match the real dataset in terms of the number of rows, columns, and feature types. Consistency in these dimensions is essential for the synthetic data to be a reliable stand-in for the real.- **Number of Rows**: Indicates if the synthetic data captures the same number of records as the real. Any mismatch can impact comparability.- **Number of Columns**: Ensures all variables are retained. Missing or extra columns suggest structural issues.- **Feature Types**: Confirms that categorical, numerical, and other variable types remain unchanged, preserving analytical consistency.```{r Dataset Size and Structure Comparison}# Function to extract structure infoget_structure_info <- function(real_data, synthetic_data, dataset_name) { tibble( Dataset = dataset_name, Rows_Real = nrow(real_data), Cols_Real = ncol(real_data), Rows_Synthetic = nrow(synthetic_data), Cols_Synthetic = ncol(synthetic_data), Structure_Match = ifelse( nrow(real_data) == nrow(synthetic_data) & ncol(real_data) == ncol(synthetic_data), "✅ Yes", "❌ No" ) )}# Collect resultsstructure_comparison <- bind_rows( get_structure_info(heart_failure, syn_data_1, "Parametric MICE"), get_structure_info(heart_failure, syn_cart_1, "CART Imputation"), get_structure_info(heart_failure, syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)"), get_structure_info(heart_failure, syn_data_metadata, "Metadata-Based"))# Display as a neat tablekable( structure_comparison, caption = "Comparison of Real vs Synthetic Dataset Sizes and Structures")```### Categorical Variables: All Levels Maintained and Comparison of DistributionThis section evaluates how well the synthetic datasets replicate the categorical variables in the real dataset, focusing on the preservation of all levels and their distributions.- **Level Matching**: For each categorical variable, the synthetic dataset should retain all levels (categories) present in the real dataset. Missing levels suggest that the synthetic data may be incomplete, while additional levels could indicate errors in data generation.- **Distribution Comparison**: The frequency of each categorical level in the synthetic data should closely mirror that of the real dataset. Significant deviations in category frequencies suggest that the synthetic data does not fully capture the true distribution of the categorical features.```{r Categorical Variables: All Levels Maintained and Comparison of Distribution}# Function to collect categorical level infoget_cat_levels <- function(real_data, synthetic_data, dataset_name) { cat_vars <- colnames(real_data)[sapply(real_data, is.factor)] do.call(rbind, lapply(cat_vars, function(var) { levels_real <- unique(real_data[[var]]) levels_synth <- unique(synthetic_data[[paste("synth", var, sep = "_")]]) data.frame( Variable = var, Dataset = dataset_name, Real_Levels = paste(levels_real, collapse = ", "), Synthetic_Levels = paste(levels_synth, collapse = ", "), Match = ifelse(all(levels_real %in% levels_synth), "✅ Yes", "❌ No"), stringsAsFactors = FALSE ) }))}# Collect resultscat_comparison <- bind_rows( get_cat_levels(heart_failure, syn_data_1, "Parametric MICE"), get_cat_levels(heart_failure, syn_cart_1, "CART Imputation"), get_cat_levels(heart_failure, syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)"), get_cat_levels(heart_failure, syn_data_metadata, "Metadata-Based"))# Display tablekable(cat_comparison, caption = "Comparison of Categorical Variable Levels in Real vs Synthetic Datasets")```### Numeric Variables: Range and Distribution ComparisonThis analysis evaluates how well the numeric variables in the synthetic datasets replicate the real data in terms of range and distribution.- **Range**: For each numeric variable, the synthetic data should maintain values within the range of the real dataset. Any deviations, especially outliers, could indicate inaccuracies in the synthetic generation process.- **Distribution Comparison**: Density plots are used to compare the shape of the distributions between the real and synthetic datasets. Ideally, the synthetic data should closely mirror the real distribution, indicating that the underlying data generation model has captured the true variability of the numeric features.```{r Numeric Variables: Range and Distribution Comparison}# Compute per-variable stats for a (real, synthetic) pairnumeric_summary <- function(real, synth, dataset_name) { num_vars <- names(real)[sapply(real, is.numeric)] # real percentiles pctl <- map(num_vars, ~quantile(real[[.x]], probs = c(.05, .95), na.rm = TRUE)) |> setNames(num_vars) tibble( Variable = num_vars, Real_Min = map_dbl(num_vars, ~min(real[[.x]], na.rm = TRUE)), Real_P5 = map_dbl(num_vars, ~pctl[[.x]][1]), Real_P95 = map_dbl(num_vars, ~pctl[[.x]][2]), Real_Max = map_dbl(num_vars, ~max(real[[.x]], na.rm = TRUE)), Synth_Min = map_dbl(num_vars, ~min(synth[[paste0("synth_", .x)]], na.rm = TRUE)), Synth_Max = map_dbl(num_vars, ~max(synth[[paste0("synth_", .x)]], na.rm = TRUE)) ) |> mutate( Within_P5_P95 = ifelse(Synth_Min >= Real_P5 & Synth_Max <= Real_P95, "✅ Yes", "❌ No"), Dataset = dataset_name, .before = 1 )}# Build one continuous table for all synthetic datasetsnum_table <- bind_rows( numeric_summary(heart_failure, syn_data_1, "Parametric MICE"), numeric_summary(heart_failure, syn_cart_1, "CART Imputation"), numeric_summary(heart_failure, syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)"), numeric_summary(heart_failure, syn_data_metadata, "Metadata-Based"))kable( num_table, caption = "Numeric variables: real ranges & percentiles vs. synthetic ranges (with P5–P95 containment check)")```### Missingness ComparisonIn this section, we compare the proportions of missing values between the real dataset and each synthetic dataset. This comparison helps determine whether the synthetic datasets accurately represent the patterns of missingness in the real data.- **Missingness Proportions**: For each dataset, we calculate the proportion of missing values for every column and compare the results across the real and synthetic datasets.- **Comparison Outcome**: Ideally, the synthetic datasets should exhibit similar missingness proportions to the real data. Any deviations may indicate differences in how missing values were handled during the synthetic data generation process.#### Introduction Plots```{r intro-plots, fig.width=10, fig.height=8, fig.align='center', out.width='100%'}# Custom palette for the 5 metricsmetric_palette <- c( "Discrete Columns" = "#B0D99B", "Continuous Columns" = "#528AA8", "All Missing Columns" = "#FFB6DB", "Complete Rows" = "#264653", "Missing Observations" = "#2A9D8F")# Strip 'synth_' for fair visualsstrip_synth_prefix <- function(df) { names(df) <- sub("^synth_", "", names(df)) df }# Datasets to comparedatasets <- list( "Real Data" = heart_failure, "Parametric MICE" = strip_synth_prefix(syn_data_1), "CART Imputation" = strip_synth_prefix(syn_cart_1), "Synthpop (Low Fidelity)" = strip_synth_prefix(syn_data_low_fidelity_synthpop), "Metadata-Based" = strip_synth_prefix(syn_data_metadata))# Function: horizontal bar chart for the 5 metricsmake_intro_plot <- function(df, title) { n_rows <- nrow(df) n_cols <- ncol(df) pct_discrete <- 100 * sum(vapply(df, is.factor, logical(1))) / n_cols pct_continuous <- 100 * sum(vapply(df, is.numeric, logical(1))) / n_cols pct_allmisscol <- 100 * sum(colSums(is.na(df)) == n_rows) / n_cols pct_completer <- 100 * sum(complete.cases(df)) / n_rows pct_missobs <- 100 * sum(is.na(df)) / (n_rows * n_cols) plot_df <- tibble::tibble( Metric = factor( c("Discrete Columns", "Continuous Columns", "All Missing Columns", "Complete Rows", "Missing Observations"), levels = rev(c("Discrete Columns", "Continuous Columns", "All Missing Columns", "Complete Rows", "Missing Observations")) # reverse for nicer order ), Percentage = c(pct_discrete, pct_continuous, pct_allmisscol, pct_completer, pct_missobs) ) ggplot(plot_df, aes(x = Percentage, y = Metric, fill = Metric)) + geom_col(alpha = 0.9, show.legend = FALSE) + geom_text(aes(label = paste0(round(Percentage, 1), "%")), hjust = -0.1, size = 3.5) + scale_x_continuous( labels = scales::percent_format(scale = 1), limits = c(0, 100), expand = expansion(mult = c(0, 0.05)) # headroom for labels ) + scale_fill_manual(values = metric_palette) + labs(title = title, x = "Percentage", y = NULL) + theme_light() + theme( axis.text.y = element_text(size = 9), plot.title = element_text(face = "bold") )}# Build plots and combine in one gridwrap_plots( imap(datasets, ~ make_intro_plot(.x, .y)), ncol = 2) + plot_annotation( title = "Dataset Overview: Discrete/Continuous Columns, All-Missing Columns, Complete Rows, Missing Observations", subtitle = "Percentages shown per dataset; axes standardized to 0–100%", theme = theme(plot.title = element_text(face = "bold", size = 12)) )```#### Missingness Proportions and Comparison Outcome```{r Missingness Proportions and Comparison Outcome}# Remove 'synth_' prefix so names match the real datasetstrip_synth_prefix <- function(df) { names(df) <- sub("^synth_", "", names(df)) df}# Compare against the real dataset's variable setcore_vars <- names(heart_failure)# Function to compute missingness proportions (2 d.p.) for one datasetget_missingness <- function(df, label) { df2 <- df %>% strip_synth_prefix() # keep only variables present in the real data (prevents mismatches) df2 <- df2 %>% select(any_of(core_vars)) tibble( Variable = names(df2), Proportion = round(colSums(is.na(df2)) / nrow(df2), 2), Dataset = label )}# Build tidy tablemissing_tbl <- bind_rows( get_missingness(heart_failure, "Real Data"), get_missingness(syn_data_1, "Parametric MICE"), get_missingness(syn_cart_1, "CART Imputation"), get_missingness(syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)"), get_missingness(syn_data_metadata, "Metadata-Based"))# Pivot wider for side-by-side comparisonmissing_tbl_wide <- missing_tbl %>% pivot_wider(names_from = Dataset, values_from = Proportion) %>% arrange(Variable)# Optional: add quick match flags (✅ if identical to Real Data)# missing_tbl_wide <- missing_tbl_wide %>%# mutate(# `Parametric MICE Match` = ifelse(`Parametric MICE` == `Real Data`, "✅", "❌"),# `CART Imputation Match` = ifelse(`CART Imputation` == `Real Data`, "✅", "❌"),# `Synthpop (Low Fidelity) Match` = ifelse(`Synthpop (Low Fidelity)` == `Real Data`, "✅", "❌"),# `Metadata-Based Match` = ifelse(`Metadata-Based` == `Real Data`, "✅", "❌")# )kable( missing_tbl_wide, caption = "Missingness Proportions by Variable (rounded to 2 d.p.)")```#### Missingness Maps```{r Missingness Maps, fig.width=14, fig.height=6, fig.align='center', out.width='100%'}# Generate missingness map for the real datasetreal_missing_map <- aggr(heart_failure, col = c("#B0D99B", "#528AA8"), numbers = TRUE, sortVars = TRUE, labels = names(heart_failure), cex.axis = .7, gap = 3, ylab = c("Missing Data", "Pattern - real Dataset"))# Generate missingness map for Parametric MICE synthetic datasetparametric_mice_missing_map <- aggr(syn_data_1, col = c("#B0D99B", "#528AA8"), numbers = TRUE, sortVars = TRUE, labels = names(syn_data_1), cex.axis = .7, gap = 3, ylab = c("Missing Data", "Pattern - Parametric MICE Dataset"))# Generate missingness map for CART synthetic datasetcart_missing_map <- aggr(syn_cart_1, col = c("#B0D99B", "#528AA8"), numbers = TRUE, sortVars = TRUE, labels = names(syn_cart_1), cex.axis = .7, gap = 3, ylab = c("Missing Data", "Pattern - CART Dataset"))# Generate missingness map for Synthpop synthetic datasetsynthpop_missing_map <- aggr(syn_data_low_fidelity_synthpop, col = c("#B0D99B", "#528AA8"), numbers = TRUE, sortVars = TRUE, labels = names(syn_data_low_fidelity_synthpop), cex.axis = .7, gap = 3, ylab = c("Missing Data", "Pattern - Synthpop Dataset"))# Generate missingness map for Metadata-based synthetic datasetmetadata_missing_map <- aggr(syn_data_metadata, col = c("#B0D99B", "#528AA8"), numbers = TRUE, sortVars = TRUE, labels = names(syn_data_metadata), cex.axis = .7, gap = 3, ylab = c("Missing Data", "Pattern - Metadata-Based Dataset"))```### Correlation Matrices ComparisonIn this section, we assess how well the correlations between numeric variables are preserved in the synthetic datasets compared to the real dataset. By comparing the correlation matrices, we evaluate whether the relationships between variables in the real data are reflected in the synthetic versions. This assessment is essential because maintaining the real data's correlation structure ensures that any statistical or machine learning models trained on synthetic data will behave similarly to those trained on real data.- **Correlation Matrix**: A correlation matrix quantifies the strength and direction of relationships between pairs of numeric variables. Each cell in the matrix represents the correlation coefficient between two variables, which can range from -1 (perfect negative correlation) to +1 (perfect positive correlation). It is crucial that synthetic data retains these relationships to ensure that any analyses based on these dependencies remain valid.- **Comparison Process**: The correlation matrices for both the real and synthetic datasets are computed and Visualised. Ideally, the structure and strength of correlations in the synthetic datasets should closely match those in the real dataset. Differences in correlation strength or direction indicate that the synthetic data may not fully capture the underlying relationships, which can affect the validity of any conclusions drawn from analyses using the synthetic data.Maintaining similar correlation structures is particularly important for use cases where relationships between variables play a crucial role, such as in predictive modeling, feature selection, and understanding variable dependencies. The following sections outline different methods used to compare correlation matrices between the real and synthetic datasets.#### Correlation Matrices Comparison using psychThe `psych` package provides tools for analyzing and visualizing correlation matrices. The `corr.test()` function is particularly useful for computing correlation coefficients while handling missing data and providing additional statistics, such as confidence intervals.```{r correlation-matrices-comparison-table}# Strip 'synth_' so names line up with realstrip_synth_prefix <- function(df) { names(df) <- sub("^synth_", "", names(df)) df}# Build a tidy table of pairwise correlations for real vs one synthetic datasetcorr_compare_one <- function(real, synth, label) { real_num <- real |> select(where(is.numeric)) synth_num <- synth |> strip_synth_prefix() |> select(where(is.numeric)) common <- intersect(names(real_num), names(synth_num)) if (length(common) < 2) { return(tibble(Dataset = label, Var1 = character(), Var2 = character(), Real = numeric(), Synthetic = numeric(), Diff = numeric(), AbsDiff = numeric())) } # Compute Pearson correlations with pairwise complete observations rc <- cor(real_num[common], use = "pairwise.complete.obs", method = "pearson") sc <- cor(synth_num[common], use = "pairwise.complete.obs", method = "pearson") # Keep only upper triangle (unique pairs) ut <- upper.tri(rc, diag = FALSE) pairs <- which(ut, arr.ind = TRUE) tibble( Var1 = colnames(rc)[pairs[, 1]], Var2 = colnames(rc)[pairs[, 2]], Real = rc[pairs], Synthetic = sc[pairs] ) |> mutate( Dataset = label, Diff = Synthetic - Real, AbsDiff = abs(Diff) ) |> select(Dataset, Var1, Var2, Real, Synthetic, Diff, AbsDiff)}# Build one combined table for all synthetic datasetscorr_cmp_tbl <- bind_rows( corr_compare_one(heart_failure, syn_data_1, "Parametric MICE"), corr_compare_one(heart_failure, syn_cart_1, "CART Imputation"), corr_compare_one(heart_failure, syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)"), corr_compare_one(heart_failure, syn_data_metadata, "Metadata-Based")) |> arrange(Dataset, desc(AbsDiff)) |> mutate( Real = round(Real, 2), Synthetic = round(Synthetic, 2), Diff = round(Diff, 2), AbsDiff = round(AbsDiff, 2) )knitr::kable( corr_cmp_tbl, caption = "Pairwise Correlations: Real vs. Synthetic (sorted by largest absolute difference per dataset)", align = "lllrrrr")```#### Correlation Matrices Comparison using CorrplotThe `corrplot` package is widely used for visualizing correlation matrices. It provides a variety of options for customization, making it easy to identify patterns and relationships between variables. This method is particularly effective for visual comparisons because it highlights differences in correlation strength and direction.```{r Correlation Matrices Comparison using Corrplot, fig.width=14, fig.height=6, fig.align='center', out.width='100%'}# Remove 'synth_' so names match the real datasetstrip_synth_prefix <- function(df) { names(df) <- sub("^synth_", "", names(df)); df }# Side-by-side correlation matrices (full width)compare_correlation_matrices <- function(real_data, synthetic_data, dataset_name) { # Numeric-only real_num <- real_data[sapply(real_data, is.numeric)] synth_num <- strip_synth_prefix(synthetic_data)[sapply(strip_synth_prefix(synthetic_data), is.numeric)] # Use common numeric variables in same order common <- intersect(names(real_num), names(synth_num)) if (length(common) < 2) { message(dataset_name, ": fewer than 2 common numeric columns — skipping.") return(invisible(NULL)) } real_corr <- cor(real_num[common], use = "complete.obs") synth_corr <- cor(synth_num[common], use = "complete.obs") # Make two panels across the page op <- par(no.readonly = TRUE); on.exit(par(op), add = TRUE) par(mfrow = c(1, 2), mar = c(4, 4, 4, 2)) corrplot(real_corr, tl.cex = 0.8, mar = c(0,0,2,0)) title(paste(dataset_name, "- Real Data"), line = 1) corrplot(synth_corr, tl.cex = 0.8, mar = c(0,0,2,0)) title(paste(dataset_name, "- Synthetic Data"), line = 1)}# Run for each synthetic datasetcompare_correlation_matrices(heart_failure, syn_data_1, "Parametric MICE")compare_correlation_matrices(heart_failure, syn_cart_1, "CART Imputation")compare_correlation_matrices(heart_failure, syn_data_low_fidelity_synthpop, "Synthpop (Low Fidelity)")compare_correlation_matrices(heart_failure, syn_data_metadata, "Metadata-Based")```#### Correlation Matrices Comparison using GGallyThe `GGally` package extends `ggplot2` to create pairwise visualizations, including correlation matrix plots. Using `ggpairs()`, you can generate a matrix of scatterplots, histograms, and correlation coefficients, which provides a detailed view of the relationships between numeric variables in the real and synthetic datasets.```{r Correlation Matrices Comparison using GGally, fig.width=11, fig.height=11, fig.align='center', out.width='100%'}compare_correlation_matrices_ggally <- function(real_data, synthetic_data, dataset_name) { # Select only numeric columns from both datasets real_numeric <- real_data %>% select(where(is.numeric)) # Remove prefix 'synth_' from synthetic data column names names(synthetic_data) <- gsub("^synth_", "", names(synthetic_data)) synthetic_numeric <- synthetic_data %>% select(where(is.numeric)) # Ensure that both datasets have the same columns common_columns <- intersect(names(real_numeric), names(synthetic_numeric)) if (length(common_columns) == 0) { cat("\nNo common numeric columns found for comparison in", dataset_name, "\n") return() } real_numeric <- real_numeric[, common_columns, drop = FALSE] synthetic_numeric <- synthetic_numeric[, common_columns, drop = FALSE] # Add a column to differentiate between real and synthetic data real_numeric$Dataset <- "Real" synthetic_numeric$Dataset <- "Synth" # Combine the datasets combined_data <- rbind(real_numeric, synthetic_numeric) # Create correlation matrix plot using GGally if there are valid columns to plot if (ncol(combined_data) > 1) { p <- ggpairs( combined_data, aes(color = Dataset, alpha = 0.5), title = paste(dataset_name, "- Correlation Matrix Comparison"), columns = 1:(ncol(combined_data) - 1), # Exclude the 'Dataset' column from the plot upper = list(continuous = wrap("cor", size = 3, alignPercent = 0.5)), # Smaller correlation labels lower = list(continuous = wrap("smooth", alpha = 0.4, size = 0.5)), # Smoother plots for readability diag = list(continuous = wrap("densityDiag", alpha = 0.5)) # Density plot with improved transparency ) + theme_minimal() + # Use a cleaner theme theme( strip.text = element_text(size = 8, face = "bold"), # Smaller strip labels axis.text.x = element_text(angle = 45, hjust = 1, size = 7), # Rotate x-axis labels for better readability axis.text.y = element_text(size = 7), legend.position = "top", # Move legend to a better location legend.title = element_blank() # Remove unnecessary legend title ) print(p) } else { cat("\nNot enough numeric columns to generate a correlation matrix plot for", dataset_name, "\n") }}# Apply the function to each synthetic datasetcompare_correlation_matrices_ggally(heart_failure, syn_data_1, "Parametric MICE")compare_correlation_matrices_ggally(heart_failure, syn_cart_1, "CART Imputation")compare_correlation_matrices_ggally(heart_failure, syn_data_low_fidelity_synthpop, "Synthpop Low Fidelity")compare_correlation_matrices_ggally(heart_failure, syn_data_metadata, "Metadata-Based Synthetic Data")```## Fidelity Assessment### Descriptive StatisticsThis section provides a comparative analysis of the descriptive statistics between the real dataset and the synthetic datasets generated using different methods.#### Real DatasetThe descriptive statistics for the real heart failure dataset include metrics such as mean, median, standard deviation, and range. This serves as the baseline for comparison with the synthetic datasets.```{r Descriptive statistics for for the Real Dataset}# Generate descriptive statsdesc_tbl <- describe(heart_failure) %>% as.data.frame() %>% select(n, mean, sd, min, median, max) %>% round(2) %>% tibble::rownames_to_column("Variable")# Display as a neat tablekable( desc_tbl, caption = "Descriptive Statistics for the Heart Failure Dataset", align = "lrrrrrr")```#### Parametric Imputation (MICE)The synthetic data generated using the MICE framework is analyzed to compare its summary statistics with the real dataset. This will help evaluate how well the parametric method captures the key characteristics of the data.```{r Descriptive statistics for parametric MICE dataset}# Descriptive stats for Parametric MICE synthetic datasetdesc_mice <- describe(syn_data_1) %>% as.data.frame() %>% select(n, mean, sd, min, median, max) %>% round(2) %>% tibble::rownames_to_column("Variable")# Display tablekable( desc_mice, caption = "Descriptive Statistics for Parametric MICE Synthetic Dataset", align = "lrrrrrr")```#### Non-Parametric Imputation (CART)The CART method is a non-parametric technique that handles non-linear relationships. The descriptive statistics of the CART-imputed synthetic data provide insight into how this method captures the dataset's distribution.```{r Descriptive statistics for non-parametric CART dataset}# Descriptive stats for CART synthetic datasetdesc_cart <- describe(syn_cart_1) %>% as.data.frame() %>% select(n, mean, sd, min, median, max) %>% round(2) %>% tibble::rownames_to_column("Variable")# Display tablekable( desc_cart, caption = "Descriptive Statistics for CART Synthetic Dataset", align = "lrrrrrr")```#### Synthpop-Based Synthetic DataThe Synthpop package generates synthetic data aimed at preserving data privacy. Descriptive statistics for this dataset help assess how well the key attributes of the real data are preserved.```{r Descriptive statistics for the synthpop dataset}# Descriptive stats for Synthpop synthetic datasetdesc_synthpop <- describe(syn_data_low_fidelity_synthpop) %>% as.data.frame() %>% select(n, mean, sd, min, median, max) %>% round(2) %>% tibble::rownames_to_column("Variable")# Display as a clean tablekable( desc_synthpop, caption = "Descriptive Statistics for Synthpop Synthetic Dataset", align = "lrrrrrr")```#### Metadata-Based Synthetic DataThe synthetic data generated using metadata ensures that variable structures conform to the data dictionary. The summary statistics here show how well the dataset reflects the real's attributes based on predefined metadata.```{r Descriptive statistics for syn_data_metadata}# Descriptive stats for Metadata-Based synthetic datasetdesc_metadata <- describe(syn_data_metadata) %>% as.data.frame() %>% select(n, mean, sd, min, median, max) %>% round(2) %>% tibble::rownames_to_column("Variable")# Display as a clean tablekable( desc_metadata, caption = "Descriptive Statistics for Metadata-Based Synthetic Dataset", align = "lrrrrrr")```By comparing the descriptive statistics, you can evaluate how closely the synthetic datasets resemble the real dataset across key summary metrics. This is a critical step in assessing the quality and usability of synthetic data.Histogram Similarity Score### Variable Exploration#### Parametric Imputation (MICE)The parametric imputation using the MICE framework generates synthetic data by imputing missing values based on multivariate normal distributions.##### Bar Plots for Categorical VariablesTo evaluate the preservation of categorical variable distributions in the synthetic data, we generate bar plots comparing the proportions of categorical variables between the real and synthetic datasets.```{r Bar Plots for Categorical Variables, fig.width=14, fig.height=6, fig.align='center', out.width='100%'}# Combine real and parametric MICE synthetic datasetsheart_failure$dataset <- "real"syn_data_1$dataset <- "synthetic"# Adjust the column names for synthetic data by removing the 'synth_' prefixcolnames(syn_data_1) <- gsub("synth_", "", colnames(syn_data_1))# Combine both datasets (real and synthetic)combined_data_param <- bind_rows(heart_failure, syn_data_1)# Bar plots for categorical variablesbar_plots <- colnames(heart_failure)[map_lgl(heart_failure, is.factor)] %>% map(~ ggplot(combined_data_param, aes_string(.x, fill = 'dataset', group = 'dataset')) + geom_bar(aes(y = ..prop..), position = position_dodge2(), stat = "count") + scale_fill_manual(values = c("real" = "#B0D99B", "synthetic" = "#528AA8")) + labs(x = .x, y = "Proportion")) %>% patchwork::wrap_plots() + plot_annotation(title = "Comparison of Categorical Variables Between real and Synthetic Datasets", theme = theme(plot.title = element_text(hjust = 0.5)))# Print the combined plot with one main titleprint(bar_plots)```##### Density Plots for Numeric VariablesThe density plots illustrate the distribution of numeric variables between the real dataset and the parametric MICE synthetic data. The goal is to ensure that the synthetic data closely follows the real data’s numeric distribution.```{r Density plots for numeric variables (parametric imputed data), fig.width=14, fig.height=10, fig.align='center', out.width='100%'}# Density plots for numeric variables (parametric imputed data)density_plots_param <- colnames(heart_failure)[map_lgl(heart_failure, is.numeric)] %>% map(~ ggplot(combined_data_param, aes_string(.x, fill = 'dataset', group = 'dataset')) + geom_density(alpha = 0.5) + scale_fill_manual(values = c("real" = "#B0D99B", "synthetic" = "#528AA8")) + labs(x = .x, y = "Density")) %>% patchwork::wrap_plots(ncol = 2) + # Arrange plots in 2 columns plot_annotation(title = "Comparison of Numeric Variables Between real and Parametric Synthetic Data", theme = theme(plot.title = element_text(hjust = 0.5)))# Print the combined density plots with one main titleprint(density_plots_param)```#### Non-Parametric Imputation (CART)CART is a non-parametric approach for generating synthetic data that captures complex relationships between variables without making distributional assumptions.##### Bar Plots for Categorical VariablesThe bar plots compare the distribution of categorical variables in the real and synthetic datasets generated using the CART method.```{r Bar plots for categorical variables (CART-imputed data), fig.width=14, fig.height=6, fig.align='center', out.width='100%'}# Combine real and non-parametric CART synthetic datasetssyn_cart_1$dataset <- "synthetic"# Adjust the column names for synthetic data by removing the 'synth_' prefixcolnames(syn_cart_1) <- gsub("synth_", "", colnames(syn_cart_1))# Combine both datasets (real and synthetic)combined_data_cart <- bind_rows(heart_failure, syn_cart_1)# Bar plots for categorical variables (CART-imputed data)bar_plots_cart <- colnames(heart_failure)[map_lgl(heart_failure, is.factor)] %>% map(~ ggplot(combined_data_cart, aes_string(.x, fill = 'dataset', group = 'dataset')) + geom_bar(aes(y = ..prop..), position = position_dodge2(), stat = "count") + scale_fill_manual(values = c("real" = "#B0D99B", "synthetic" = "#528AA8")) + labs(x = .x, y = "Proportion")) %>% patchwork::wrap_plots() + plot_annotation(title = "Comparison of Categorical Variables Between real and Synthetic Data", theme = theme(plot.title = element_text(hjust = 0.5)))# Print the combined plot with one main titleprint(bar_plots_cart)```##### Density Plots for Numeric VariablesThe density plots display how well the numeric variables in the CART-imputed synthetic data align with the real dataset.```{r Density plots for numeric variables (CART-imputed data), fig.width=14, fig.height=10, fig.align='center', out.width='100%'}# Density plots for numeric variables (CART-imputed data)density_plots_cart <- colnames(heart_failure)[map_lgl(heart_failure, is.numeric)] %>% map(~ ggplot(combined_data_cart, aes_string(.x, fill = 'dataset', group = 'dataset')) + geom_density(alpha = 0.5) + scale_fill_manual(values = c("real" = "#B0D99B", "synthetic" = "#528AA8")) + labs(x = .x, y = "Density")) %>% patchwork::wrap_plots(ncol = 2) + # Arrange plots in 2 columns plot_annotation(title = "Comparison of Numeric Variables Between real and Synthetic Data", theme = theme(plot.title = element_text(hjust = 0.5)))# Print the combined density plots with one main titleprint(density_plots_cart)```#### Synthpop-Based Synthetic DataSynthpop is used to generate synthetic datasets that closely follow the distributions of the real data for privacy-preserving data sharing.##### Bar Plots for Categorical VariablesThese bar plots assess how well Synthpop-based synthetic data maintains the categorical distributions of the real dataset.```{r Bar plots for categorical variables (Synthpop-based synthetic data), fig.width=14, fig.height=6, fig.align='center', out.width='100%'}# Combine real and Synthpop-based synthetic datasetsheart_failure$dataset <- "real"syn_data_low_fidelity_synthpop$dataset <- "synthetic"# Adjust the column names for synthetic data by removing the 'synth_' prefixcolnames(syn_data_low_fidelity_synthpop) <- gsub("synth_", "", colnames(syn_data_low_fidelity_synthpop))# Combine both datasets (real and synthetic)combined_data_synthpop <- bind_rows(heart_failure, syn_data_low_fidelity_synthpop)# Bar plots for categorical variables (Synthpop-based synthetic data)bar_plots_synthpop <- colnames(heart_failure)[map_lgl(heart_failure, is.factor)] %>% map(~ ggplot(combined_data_synthpop, aes_string(.x, fill = 'dataset', group = 'dataset')) + geom_bar(aes(y = ..prop..), position = position_dodge2(), stat = "count") + scale_fill_manual(values = c("real" = "#B0D99B", "synthetic" = "#528AA8")) + labs(x = .x, y = "Proportion")) %>% patchwork::wrap_plots() + plot_annotation(title = "Comparison of Categorical Variables Between real and Synthetic Data", theme = theme(plot.title = element_text(hjust = 0.5)))# Print the combined plot with one main titleprint(bar_plots_synthpop)```##### Density Plots for Numeric VariablesThe density plots illustrate how closely Synthpop-based synthetic data matches the real dataset for numeric variables.```{r Density plots for numeric variables (Synthpop-based synthetic data), fig.width=14, fig.height=10, fig.align='center', out.width='100%'}# Combine real and Synthpop-based synthetic datasetsheart_failure$dataset <- "real"syn_data_low_fidelity_synthpop$dataset <- "synthetic"combined_data_synthpop <- bind_rows(heart_failure, syn_data_low_fidelity_synthpop)# Density plots for numeric variables (Synthpop-based synthetic data)density_plots_synthpop <- colnames(heart_failure)[map_lgl(heart_failure, is.numeric)] %>% map(~ ggplot(combined_data_synthpop, aes_string(.x, fill = 'dataset', group = 'dataset')) + geom_density(alpha = 0.5) + scale_fill_manual(values = c("real" = "#B0D99B", "synthetic" = "#528AA8")) + labs(x = .x, y = "Density")) %>% patchwork::wrap_plots(ncol = 2) + # Arrange plots in 2 columns plot_annotation(title = "Comparison of Numeric Variables Between real and Synthetic Data", theme = theme(plot.title = element_text(hjust = 0.5)))# Print the combined density plots with one main titleprint(density_plots_synthpop)```#### Metadata-Based Synthetic DataMetadata-based synthetic data generation uses a predefined data dictionary to ensure that the synthetic data follows the correct structure and distribution.##### Bar Plots for Categorical VariablesWe compare the categorical variables of the real and metadata-based synthetic datasets to assess distributional alignment.```{r Bar plots for categorical variables (Metadata-based synthetic data), fig.width=14, fig.height=6, fig.align='center', out.width='100%'}# Combine real and Metadata-based synthetic datasetsheart_failure$dataset <- "real"syn_data_metadata$dataset <- "synthetic"# Adjust the column names for synthetic data by removing the 'synth_' prefixcolnames(syn_data_metadata) <- gsub("synth_", "", colnames(syn_data_metadata))# Combine both datasets (real and synthetic)combined_data_metadata <- bind_rows(heart_failure, syn_data_metadata)# Bar plots for categorical variables (Metadata-based synthetic data)bar_plots_metadata <- colnames(heart_failure)[map_lgl(heart_failure, is.factor)] %>% map(~ ggplot(combined_data_metadata, aes_string(.x, fill = 'dataset', group = 'dataset')) + geom_bar(aes(y = ..prop..), position = position_dodge2(), stat = "count") + scale_fill_manual(values = c("real" = "#B0D99B", "synthetic" = "#528AA8")) + labs(x = .x, y = "Proportion")) %>% patchwork::wrap_plots() + plot_annotation(title = "Comparison of Categorical Variables Between real and Synthetic Datasets", theme = theme(plot.title = element_text(hjust = 0.5)))# Print the combined plot with one main titleprint(bar_plots_metadata)```##### Density Plots for Numeric VariablesThe density plots for numeric variables compare the distribution of numeric values between the real and metadata-based synthetic datasets.```{r Density Plots for Numeric Variables, fig.width=14, fig.height=10, fig.align='center', out.width='100%'}# Combine real and Metadata-based synthetic datasetsheart_failure$dataset <- "real"syn_data_metadata$dataset <- "synthetic"combined_data_metadata <- bind_rows(heart_failure, syn_data_metadata)# Density plots for numeric variables (Metadata-based synthetic data)density_plots_metadata <- colnames(heart_failure)[map_lgl(heart_failure, is.numeric)] %>% map(~ ggplot(combined_data_metadata, aes_string(.x, fill = 'dataset', group = 'dataset')) + geom_density(alpha = 0.5) + scale_fill_manual(values = c("real" = "#B0D99B", "synthetic" = "#528AA8")) + labs(x = .x, y = "Density")) %>% patchwork::wrap_plots(ncol = 2) + # Arrange plots in 2 columns plot_annotation(title = "Comparison of Numeric Variables Between real and Synthetic Data", theme = theme(plot.title = element_text(hjust = 0.5)))# Print the combined density plots with one main titleprint(density_plots_metadata)```### Histogram Similarity ScoreThe Histogram Similarity Score measures how closely the marginal distributions of the synthetic data match the real data. Marginal distributions are critical for assessing whether key data characteristics such as spread, central tendency (mean/median), and shape (skewness/kurtosis) are preserved.Ideal Ranges for Histogram Similarity:- **Perfect Match:** 1 (100%) indicates a perfect match between the distributions.- **Good Match:** 0.85–1 (85% to 100%) shows strong similarity.- **Moderate Match:** 0.65–0.85 suggests some differences that may need addressing.- **Poor Match:** Below 0.65 indicates significant deviation from the real data.```{r histogram-similarity-score, fig.width=10, fig.height=4, fig.align='center', out.width='100%'}# Custom palette for datasetsdataset_palette <- c( "Parametric MICE" = "#B0D99B", "CART Imputation" = "#528AA8", "Synthpop (Low Fidelity)" = "#FFB6DB", "Metadata-Based" = "#264653")# Define function oncecalculate_histogram_similarity <- function(real_data, synthetic_data) { real_numeric <- real_data[sapply(real_data, is.numeric)] synthetic_numeric <- synthetic_data[sapply(synthetic_data, is.numeric)] similarity_scores <- sapply(names(real_numeric), function(var) { real_hist <- hist(real_numeric[[var]], plot = FALSE) synthetic_hist <- hist(synthetic_numeric[[var]], plot = FALSE) # Wasserstein distance between histogram midpoints wasserstein_distance <- transport::wasserstein1d(real_hist$mids, synthetic_hist$mids) 1 / (1 + wasserstein_distance) # similarity score }) mean(similarity_scores)}# Collect results in a tibble (convert to percentages)hist_sim_tbl <- tibble::tibble( Dataset = c("Parametric MICE", "CART Imputation", "Synthpop (Low Fidelity)", "Metadata-Based"), Histogram_Similarity = c( calculate_histogram_similarity(heart_failure, syn_data_1), calculate_histogram_similarity(heart_failure, syn_cart_1), calculate_histogram_similarity(heart_failure, syn_data_low_fidelity_synthpop), calculate_histogram_similarity(heart_failure, syn_data_metadata) )) %>% mutate(Histogram_Similarity = round(Histogram_Similarity * 100, 1)) # percentage# Display as tablekable( hist_sim_tbl, caption = "Histogram Similarity Scores (%): higher = more similar to real data", align = "lr")# Horizontal bar chart with custom coloursggplot(hist_sim_tbl, aes(y = Dataset, x = Histogram_Similarity, fill = Dataset)) + geom_col(alpha = 0.9, show.legend = FALSE) + geom_text(aes(label = paste0(Histogram_Similarity, "%")), hjust = -0.1, size = 3.3) + scale_fill_manual(values = dataset_palette) + scale_x_continuous(labels = scales::percent_format(scale = 1), limits = c(0, 100)) + labs( title = "Histogram Similarity by Synthetic Method", subtitle = "Based on Wasserstein distance of variable histograms", x = "Similarity (%)", y = NULL ) + theme_minimal() + theme(axis.text.y = element_text(size = 10))```### Mutual Information ScoreThe Mutual Information (MI) Score measures the shared information between pairs of features in the real and synthetic datasets, assessing how well relationships between variables are preserved.Ideal Ranges for Mutual Information:- **Perfect Alignment:** MI scores for synthetic data should ideally match the real within 10-20% for critical feature pairs.- **Lower MI:** Consistently lower MI scores suggest that noise has been added, weakening relationships.- **Overfitting Risk:** Higher MI scores than the real may indicate overfitting, increasing re-identification risk.```{r mutual-information-score, fig.width=10, fig.height=4, fig.align='center', out.width='100%'}# Custom palette for datasetsdataset_palette <- c( "Parametric MICE" = "#B0D99B", "CART Imputation" = "#528AA8", "Synthpop (Low Fidelity)" = "#FFB6DB", "Metadata-Based" = "#264653")# ------- Helpers -------# Strip 'synth_' so variable names align with the real datasetstrip_synth_prefix <- function(df) { names(df) <- sub("^synth_", "", names(df)) df}# Compute pairwise MI matrix for a data.frame of discretized columnscompute_mi_matrix <- function(df_disc) { p <- ncol(df_disc) if (p < 2) return(matrix(numeric(0), nrow = p, ncol = p, dimnames = list(colnames(df_disc), colnames(df_disc)))) M <- matrix(0, nrow = p, ncol = p, dimnames = list(colnames(df_disc), colnames(df_disc))) for (i in seq_len(p)) { for (j in i:p) { # safe MI (skip NA rows; zero if not enough variation) ok <- is.finite(df_disc[[i]]) & is.finite(df_disc[[j]]) xi <- df_disc[[i]][ok]; xj <- df_disc[[j]][ok] mi <- if (length(unique(xi)) < 2 || length(unique(xj)) < 2) 0 else suppressWarnings(infotheo::mutinformation(xi, xj)) M[i, j] <- mi M[j, i] <- mi } } M}# Compare two MI matrices (upper triangle only) -> 0–1 similarity score (higher = closer)mi_similarity_score <- function(M_real, M_synth) { if (length(M_real) == 0 || length(M_synth) == 0) return(NA_real_) common <- intersect(colnames(M_real), colnames(M_synth)) if (length(common) < 2) return(NA_real_) R <- M_real[common, common, drop = FALSE] S <- M_synth[common, common, drop = FALSE] ut <- upper.tri(R, diag = FALSE) mean_abs_diff <- mean(abs(R[ut] - S[ut])) 1 / (1 + mean_abs_diff)}# Discretize numeric columns (equal-frequency bins); keep only shared numeric varsMutual_Information <- function(real_df, synth_df, nbins = 10) { real_num <- dplyr::select(real_df, where(is.numeric)) synth_num <- strip_synth_prefix(synth_df) |> dplyr::select(where(is.numeric)) common <- intersect(names(real_num), names(synth_num)) if (length(common) < 2) return(NA_real_) real_num <- real_num[common] synth_num <- synth_num[common] # Discretize numerics (returns integer codes) real_disc <- as.data.frame(lapply(real_num, \(x) infotheo::discretize(x, disc = "equalfreq", nbins = nbins))) synth_disc <- as.data.frame(lapply(synth_num, \(x) infotheo::discretize(x, disc = "equalfreq", nbins = nbins))) # MI matrices and similarity score M_real <- compute_mi_matrix(real_disc) M_synth <- compute_mi_matrix(synth_disc) mi_similarity_score(M_real, M_synth)}# ------- Compute scores and present nicely (percentages) -------mi_results <- tibble::tibble( Dataset = c("Parametric MICE", "CART Imputation", "Synthpop (Low Fidelity)", "Metadata-Based"), Mutual_Information = c( Mutual_Information(heart_failure, syn_data_1), Mutual_Information(heart_failure, syn_cart_1), Mutual_Information(heart_failure, syn_data_low_fidelity_synthpop), Mutual_Information(heart_failure, syn_data_metadata) )) |> mutate(Mutual_Information = round(Mutual_Information * 100, 1)) # convert to %# User-friendly tableknitr::kable( mi_results, caption = "Mutual Information Similarity (%): higher = synthetic preserves dependency structure more closely", align = "lr")# Horizontal bar chart with custom coloursggplot(mi_results, aes(y = Dataset, x = Mutual_Information, fill = Dataset)) + geom_col(alpha = 0.9, show.legend = FALSE) + geom_text(aes(label = paste0(Mutual_Information, "%")), hjust = -0.1, size = 3.3) + scale_fill_manual(values = dataset_palette) + scale_x_continuous(labels = scales::percent_format(scale = 1), limits = c(0, 100)) + labs( title = "Mutual Information Similarity by Synthetic Method", subtitle = "Based on pairwise MI matrices of discretized numeric variables (equal-frequency bins)", x = "Similarity (%)", y = NULL ) + theme_minimal() + theme(axis.text.y = element_text(size = 10))```### Correlation ScoreThe Correlation Score measures how well the relationships (correlations) between variables are preserved in the synthetic dataset compared to the real.Ideal Ranges for Correlation Score:- **Perfect Alignment:** Correlations in the synthetic data should match the real closely, with deviations of 5-10% being acceptable.- **Weaker Correlations:** If synthetic correlations are weaker, it may indicate loss of fidelity due to noise.- **Stronger Correlations:** Stronger correlations suggest overfitting, which may increase privacy risks.```{r correlation-score, fig.width=10, fig.height=4, fig.align='center', out.width='100%'}# Custom palette for datasetsdataset_palette <- c( "Parametric MICE" = "#B0D99B", "CART Imputation" = "#528AA8", "Synthpop (Low Fidelity)" = "#FFB6DB", "Metadata-Based" = "#264653")# --- Strip 'synth_' so names alignstrip_synth_prefix <- function(df) { names(df) <- sub("^synth_", "", names(df)) df}# --- Function to calculate Correlation Similarity Scorecalculate_correlation_score <- function(real_data, synthetic_data) { # Keep numeric columns only real_num <- real_data[sapply(real_data, is.numeric)] synth_num <- strip_synth_prefix(synthetic_data)[sapply(strip_synth_prefix(synthetic_data), is.numeric)] # Use only common variables common <- intersect(names(real_num), names(synth_num)) if (length(common) < 2) return(NA_real_) real_corr <- cor(real_num[common], use = "complete.obs") synth_corr <- cor(synth_num[common], use = "complete.obs") # Frobenius norm difference → similarity score in [0,1] corr_diff <- norm(real_corr - synth_corr, type = "F") 1 / (1 + corr_diff)}# --- Compute scores for all synthetic datasetscorr_scores <- tibble::tibble( Dataset = c("Parametric MICE", "CART Imputation", "Synthpop (Low Fidelity)", "Metadata-Based"), Correlation_Score = c( calculate_correlation_score(heart_failure, syn_data_1), calculate_correlation_score(heart_failure, syn_cart_1), calculate_correlation_score(heart_failure, syn_data_low_fidelity_synthpop), calculate_correlation_score(heart_failure, syn_data_metadata) )) %>% mutate(Correlation_Score = round(Correlation_Score * 100, 1)) # convert to %# --- User-friendly tablekable( corr_scores, caption = "Correlation Similarity Scores (%): higher = synthetic preserves correlation structure more closely", align = "lr")# --- Horizontal bar chart with custom coloursggplot(corr_scores, aes(y = Dataset, x = Correlation_Score, fill = Dataset)) + geom_col(alpha = 0.9, show.legend = FALSE) + geom_text(aes(label = paste0(Correlation_Score, "%")), hjust = -0.1, size = 3.3) + scale_fill_manual(values = dataset_palette) + scale_x_continuous(labels = scales::percent_format(scale = 1), limits = c(0, 100)) + labs( title = "Correlation Similarity by Synthetic Method", subtitle = "Based on Frobenius norm difference between real and synthetic correlation matrices", x = "Similarity (%)", y = NULL ) + theme_minimal() + theme(axis.text.y = element_text(size = 10))```## Utility Assessment### Feature Importance Consistency AssessmentThe Feature Importance Consistency Assessment evaluates how well the synthetic datasets retain the predictive significance of features compared to the real dataset. This assessment is essential for ensuring that the synthetic data accurately reflects the underlying relationships that contribute to the model's predictive performance.Feature importance measures each variable's contribution to model predictions. Consistency in feature importance rankings between the real and synthetic datasets serves as a crucial indicator of data quality. Significant discrepancies may indicate deficiencies in the synthetic data generation process, potentially affecting the performance of models trained on synthetic data.Ideal Ranges for Feature Importance Consistency Assessment:- **Perfect Match**: Feature importance scores in the synthetic dataset should closely align with those in the real dataset, particularly for highly predictive features.- **Acceptable Deviation**: A deviation of 10-15% is generally acceptable for most features. For variables identified as highly influential, stricter thresholds may be needed to maintain predictive accuracy.Types of Deviations- **Underrepresentation**: If an important feature has lower importance in the synthetic dataset, it may suggest that critical predictive relationships are not well-preserved.- **Overrepresentation**: If a feature's importance is higher in the synthetic data, this could indicate noise or overfitting, potentially distorting the model’s perception of feature relationships.The following visualisation techniques will be used to compare feature importance:- **Bar Plots for Feature Importance Comparison**: Bar plots will compare feature importance scores between the real and synthetic datasets. This visual comparison can reveal significant deviations.- **SHAP Summary Plots**: SHAP (SHapley Additive exPlanations) summary plots provide a comprehensive view of the distribution of SHAP values for each feature, allowing for a direct comparison between the real and synthetic datasets.```{r Feature Importance Consistency Assessment, fig.width=10, fig.height=4, fig.align='center', out.width='100%'}# Prepare the data for XGBoostset.seed(123) # For reproducibilitytrain_indices <- sample(seq_len(nrow(heart_failure)), size = 0.7 * nrow(heart_failure))train_real <- heart_failure[train_indices, ]test_real <- heart_failure[-train_indices, ]train_real_matrix <- model.matrix(deceased ~ age + sex + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + platelets + serum_creatinine + serum_sodium + smoking + hypertension + follow_up - 1, data = train_real)train_real_label <- train_real[rownames(train_real_matrix), "deceased"]train_real_label <- as.numeric(train_real_label) - 1 # Convert factor to 0 and 1train_synth <- syn_data_1[train_indices, ]train_synth_matrix <- model.matrix(deceased ~ age + sex + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + platelets + serum_creatinine + serum_sodium + smoking + hypertension + follow_up - 1, data = train_synth)train_synth_label <- train_synth[rownames(train_synth_matrix), "deceased"]train_synth_label <- as.numeric(train_synth_label) - 1 # Convert factor to 0 and 1# Convert to DMatrix format, which is required for XGBoostdtrain_real <- xgb.DMatrix(data = train_real_matrix, label = train_real_label)dtrain_synth <- xgb.DMatrix(data = train_synth_matrix, label = train_synth_label)# Set up parameters for XGBoostparams <- list( objective = "binary:logistic", eval_metric = "auc", # Use AUC as an evaluation metric max_depth = 3, eta = 0.1)# Tune hyperparameters using cross-validation (with parallel processing)cv_real <- xgb.cv( params = params, data = dtrain_real, nrounds = 100, nfold = 5, verbose = 0, early_stopping_rounds = 10, nthread = num_cores # Number of threads for parallel processing)best_nrounds_real <- cv_real$best_iterationcv_synth <- xgb.cv( params = params, data = dtrain_synth, nrounds = 100, nfold = 5, verbose = 0, early_stopping_rounds = 10, nthread = num_cores # Number of threads for parallel processing)best_nrounds_synth <- cv_synth$best_iteration# Train XGBoost on real data (TRTR)xgb_model_trtr <- xgboost(params = params, data = dtrain_real, nrounds = best_nrounds_real, verbose = 0, nthread = num_cores)# Train XGBoost on synthetic data (TSTR)xgb_model_tstr <- xgboost(params = params, data = dtrain_synth, nrounds = best_nrounds_synth, verbose = 0, nthread = num_cores)# Get feature importance for TRTRimportance_trtr <- xgb.importance(feature_names = colnames(train_real_matrix), model = xgb_model_trtr)cat("Feature Importance for TRTR:\n")print(importance_trtr)# Get feature importance for TSTRimportance_tstr <- xgb.importance(feature_names = colnames(train_synth_matrix), model = xgb_model_tstr)cat("\nFeature Importance for TSTR:\n")print(importance_tstr)# Visualise feature importance for TRTR and TSTRimportance_trtr$Dataset <- "Real Data"importance_tstr$Dataset <- "Synthetic Data"importance_combined <- rbind(importance_trtr, importance_tstr)# Plot feature importance comparisonggplot(importance_combined, aes(x = reorder(Feature, Gain), y = Gain, fill = Dataset)) + geom_bar(stat = "identity", position = "dodge") + coord_flip() + labs(title = "Feature Importance Comparison (TRTR vs TSTR)", x = "Feature", y = "Gain") + scale_fill_manual(values = c("Real Data" = "#B0D99B", "Synthetic Data" = "#528AA8")) + theme_minimal()# Correlation analysis between feature importance rankingsimportance_trtr <- importance_trtr %>% mutate(rank = rank(-Gain))importance_tstr <- importance_tstr %>% mutate(rank = rank(-Gain))importance_ranking <- inner_join(importance_trtr, importance_tstr, by = "Feature", suffix = c("_trtr", "_tstr"))correlation <- cor(importance_ranking$rank_trtr, importance_ranking$rank_tstr, method = "spearman")cat("\nSpearman correlation Between TRTR and TSTR Feature Importance Rankings:\n")print(correlation)# Calculate SHAP values for better interpretability# SHAP values for TRTRshap_values_trtr <- shap.values(xgb_model_trtr, dtrain_real)shap_long_trtr <- shap.prep(shap_contrib = shap_values_trtr$shap_score, X_train = train_real_matrix)cat("\nSHAP Summary for TRTR:\n")shap_values_trtr$mean_shap_score %>% print()# SHAP values for TSTRshap_values_tstr <- shap.values(xgb_model_tstr, dtrain_synth)shap_long_tstr <- shap.prep(shap_contrib = shap_values_tstr$shap_score, X_train = train_synth_matrix)cat("\nSHAP Summary for TSTR:\n")shap_values_tstr$mean_shap_score %>% print()# SHAP visualisationshap.plot.summary(shap_long_trtr)shap.plot.summary(shap_long_tstr)```### Prediction ScoreThe Prediction Score assesses synthetic data utility by comparing models trained on real data (Train Real Test Real (TRTR)) versus synthetic data (Train Synthetic Test Real (TSTR)) using two metrics:- **Accuracy**: The proportion of correct predictions made by the model.- **pMSE (Prediction Mean Squared Error)**: Measures the average of the squared differences between predicted and actual values.Ideal Ranges for Prediction Score:- **Accuracy**: TSTR accuracy should be within 5-10% of TRTR accuracy. Lower accuracy signals that synthetic data isn’t fully representative of the real dataset.- **pMSE**: TSTR pMSE should be close to TRTR pMSE. A 5-10% deviation is acceptable; a higher pMSE suggests synthetic data may not be accurately reflecting the true relationships in the real data.This section assesses the utility of synthetic data generated through various methods, using XGBoost to calculate both Accuracy and pMSE for TRTR and TSTR.```{r Prediction Score, fig.width=10, fig.height=4, fig.align='center', out.width='100%'}set.seed(123)# ---- helpers ----strip_synth_prefix <- function(df) { names(df) <- sub("^synth_", "", names(df)); df }align_target_levels <- function(df, levels_ref = NULL, positive = NULL) { if (!is.factor(df$deceased)) { if (is.numeric(df$deceased)) df$deceased <- factor(as.character(df$deceased)) else df$deceased <- factor(df$deceased) } if (!is.null(levels_ref)) df$deceased <- factor(df$deceased, levels = levels_ref) lev <- levels(df$deceased) if (length(lev) < 2) stop("Target 'deceased' must have two levels.") pos <- if (!is.null(positive) && positive %in% lev) positive else if ("1" %in% lev) "1" else lev[2] df$deceased <- stats::relevel(df$deceased, ref = pos) list(data = df, levels = levels(df$deceased), positive = pos)}evaluate_model <- function(train_data, test_data, label, tune_grid, train_control, seed = 123) { test_aln <- align_target_levels(test_data) train_aln <- align_target_levels(train_data, levels_ref = test_aln$levels, positive = test_aln$positive) train_data <- train_aln$data test_data <- test_aln$data set.seed(seed) model <- caret::train( deceased ~ follow_up + serum_creatinine + ejection_fraction + creatinine_phosphokinase + serum_sodium + age + platelets, data = train_data, method = "xgbTree", trControl = train_control, tuneGrid = tune_grid, na.action = na.pass ) pred <- predict(model, test_data) acc <- mean(pred == test_data$deceased, na.rm = TRUE) yhat <- as.numeric(pred) y <- as.numeric(test_data$deceased) pMSE <- mean((yhat - y)^2, na.rm = TRUE) tibble::tibble(Dataset = label, Accuracy = acc, pMSE = pMSE)}# ---- CV setup ----train_control <- caret::trainControl(method = "cv", number = 5)tune_grid <- expand.grid( nrounds = 50, max_depth = 3, eta = 0.1, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1, subsample = 0.7)# ---- prepare synthetic splits ----syn_mice <- strip_synth_prefix(syn_data_1)syn_cart <- strip_synth_prefix(syn_cart_1)syn_synthpop <- strip_synth_prefix(syn_data_low_fidelity_synthpop)syn_metadata <- strip_synth_prefix(syn_data_metadata)train_synth_mice <- syn_mice[train_indices, , drop = FALSE]train_synth_cart <- syn_cart[train_indices, , drop = FALSE]train_synth_synthpop <- syn_synthpop[train_indices, , drop = FALSE]train_synth_metadata <- syn_metadata[train_indices, , drop = FALSE]# ---- results table ----results <- dplyr::bind_rows( evaluate_model(train_real, test_real, "TRTR (Real→Real)", tune_grid, train_control), evaluate_model(train_synth_mice, test_real, "TSTR (MICE→Real)", tune_grid, train_control), evaluate_model(train_synth_cart, test_real, "TSTR (CART→Real)", tune_grid, train_control), evaluate_model(train_synth_synthpop, test_real, "TSTR (Synthpop→Real)", tune_grid, train_control), evaluate_model(train_synth_metadata, test_real, "TSTR (Metadata→Real)", tune_grid, train_control)) %>% mutate( Accuracy = scales::percent(Accuracy, accuracy = 0.1), pMSE = round(pMSE, 3) )knitr::kable( results, caption = "Prediction Score Summary (XGBoost, 5-fold CV): TRTR vs TSTR", align = "lrr")# ---- horizontal Accuracy bar chart ----palette5 <- c("#B0D99B", "#528AA8", "#FFB6DB", "#264653", "#2A9D8F")plot_df <- results %>% mutate(Accuracy_num = readr::parse_number(Accuracy) / 100)ggplot(plot_df, aes(y = Dataset, x = Accuracy_num, fill = Dataset)) + geom_col(alpha = 0.9, show.legend = FALSE) + geom_text(aes(label = Accuracy), hjust = -0.1, size = 3.5) + scale_fill_manual(values = setNames(palette5, unique(plot_df$Dataset))) + scale_x_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1)) + labs(title = "Model Accuracy: TRTR vs TSTR", x = "Accuracy", y = NULL) + theme_minimal()```### Quality Score (QScore)The QScore provides a comprehensive evaluation of synthetic data by aggregating multiple metrics, such as distribution similarity, feature importance, and correlation preservation, into a single score. This offers a clear, overall assessment of how well the synthetic data replicates the real dataset.Ideal Ranges for QScore:- **Excellent Quality (0.8 - 1.0)**: Indicates the synthetic data closely mirrors the real and can be confidently used for most analyses.- **Good Quality (0.6 - 0.8)**: Reflects good alignment, with some minor deviations. Suitable for most use cases, but some caution is advised.- **Moderate Quality (0.4 - 0.6)**: The synthetic data may be useful for exploratory analysis but shows significant deviations in key metrics.- **Poor Quality (< 0.4)**: The synthetic data poorly aligns with the real and is likely unsuitable for most analytic or modeling tasks.```{r Quality Score, fig.width=10, fig.height=4, fig.align='center', out.width='99%'}# ---- QScore helper ----calculate_qscore <- function(real_data, synthetic_data, num_queries = 10) { numeric_real_vars <- names(real_data)[sapply(real_data, is.numeric)] numeric_synth_vars <- names(synthetic_data)[sapply(synthetic_data, is.numeric)] common_vars <- intersect(numeric_real_vars, numeric_synth_vars) if (length(common_vars) == 0) return(NA_real_) qscores <- replicate(num_queries, { var_name <- sample(common_vars, 1) fun_name <- sample(c("mean", "sum"), 1) fun <- match.fun(fun_name) real_val <- fun(real_data[[var_name]], na.rm = TRUE) synth_val <- fun(synthetic_data[[var_name]], na.rm = TRUE) if (!is.na(real_val) && real_val != 0 && !is.na(synth_val)) { abs(real_val - synth_val) / abs(real_val) # proportion difference (0–1) } else { 0 } }) mean(qscores, na.rm = TRUE)}# ---- Compute and present ----qscore_tbl <- tibble::tibble( Dataset = c("Parametric MICE", "CART Imputation", "Synthpop (Low Fidelity)", "Metadata-Based"), QScore = c( calculate_qscore(heart_failure, syn_data_1), calculate_qscore(heart_failure, syn_cart_1), calculate_qscore(heart_failure, syn_data_low_fidelity_synthpop), calculate_qscore(heart_failure, syn_data_metadata) )) %>% dplyr::mutate( QScore_num = QScore, # keep numeric 0–1 QScore_pct = scales::percent(QScore_num, accuracy = 0.1) # pretty % )# --- User-friendly table (as percentages) ---knitr::kable( qscore_tbl %>% dplyr::select(Dataset, `QScore (%)` = QScore_pct), caption = "Quality Score (QScore) — average % difference in aggregates (lower is better)", align = "lr")# --- Horizontal bar chart (percent axis) ---pal <- c("#B0D99B", "#528AA8", "#FFB6DB", "#264653", "#2A9D8F")ggplot(qscore_tbl, aes(y = Dataset, x = QScore_num, fill = Dataset)) + geom_col(alpha = 0.9, show.legend = FALSE) + geom_text(aes(label = QScore_pct), hjust = -0.1, size = 3.5) + scale_fill_manual(values = setNames(pal[seq_len(nrow(qscore_tbl))], qscore_tbl$Dataset)) + scale_x_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1), expand = expansion(mult = c(0, 0.05))) + labs( title = "QScore Comparison (Lower is Better)", subtitle = "Average percentage difference between real and synthetic aggregates across random queries", x = "QScore (%)", y = NULL ) + theme_minimal()```## Privacy / Disclosure Risk### Check for Replication of Real Value CombinationsThis check identifies if any records in the synthetic data are exact duplicates of those in the real dataset. High replication increases privacy risks, so minimal or no duplication is ideal for preserving privacy.```{r Check for Replication of Real Value Combinations, fig.width=8, fig.height=5, fig.align='center', out.width='99%'}# ---- Replicated rows count ----check_replicated_rows <- function(real_data, synthetic_data) { combined <- rbind(real_data, synthetic_data) sum(duplicated(combined))}# ---- Build results table ----replication_tbl <- tibble::tibble( Dataset = c("Parametric MICE", "Non-Parametric CART", "Synthpop (Low Fidelity)", "Metadata-Based"), Replicated_Rows = c( check_replicated_rows(heart_failure, syn_data_1), check_replicated_rows(heart_failure, syn_cart_1), check_replicated_rows(heart_failure, syn_data_low_fidelity_synthpop), check_replicated_rows(heart_failure, syn_data_metadata) )) %>% mutate( Total_Rows = nrow(heart_failure), Replication_Percent = round((Replicated_Rows / Total_Rows) * 100, 2), Status = ifelse(Replicated_Rows > 0, "⚠️ Replication Found", "✅ No Replication") )# ---- User-friendly table ----knitr::kable( replication_tbl, caption = "Replication Check: Exact Matching Rows Between Real and Synthetic Datasets", align = "lrrrl")# ---- Optional: bar chart if any replication > 0 ----if (any(replication_tbl$Replicated_Rows > 0)) { pal <- c("#B0D99B", "#528AA8", "#FFB6DB", "#264653") ggplot(replication_tbl, aes(y = Dataset, x = Replicated_Rows, fill = Dataset)) + geom_col(alpha = 0.9, show.legend = FALSE) + geom_text(aes(label = Replicated_Rows), hjust = -0.1, size = 3.5) + scale_fill_manual(values = pal[seq_len(nrow(replication_tbl))]) + labs( title = "Replication of Real Rows in Synthetic Data", subtitle = "Counts of exact duplicated rows across real and synthetic datasets", x = "Number of Replicated Rows", y = NULL ) + theme_minimal()} else { cat("✅ No replicated rows were found across any synthetic dataset. No plot is generated.\n")}```### Check for Unique Value CombinationsThis analysis compares the number of unique row combinations between the real and synthetic datasets. A high number of unique combinations in the synthetic data suggests that it captures diverse patterns while minimising the replication of sensitive information, enhancing privacy protection.```{r Check for Unique Value Combinations, fig.width=10, fig.height=4, fig.align='center', out.width='99%'}# ---- Count unique rows ----unique_combinations <- function(data) { nrow(unique(data))}# ---- Build results table ----unique_tbl <- tibble::tibble( Dataset = c("Real Data", "Parametric MICE", "Non-Parametric CART", "Synthpop (Low Fidelity)", "Metadata-Based"), Unique_Combinations = c( unique_combinations(heart_failure), unique_combinations(syn_data_1), unique_combinations(syn_cart_1), unique_combinations(syn_data_low_fidelity_synthpop), unique_combinations(syn_data_metadata) )) %>% mutate( Total_Rows = c(nrow(heart_failure), rep(nrow(heart_failure), 4)), Percent_Unique = round((Unique_Combinations / Total_Rows) * 100, 2) )# ---- User-friendly table ----knitr::kable( unique_tbl, caption = "Unique Value Combinations in Real vs. Synthetic Datasets", align = "lrr")# ---- Plot only if any dataset has > 0 unique rows ----if (any(unique_tbl$Unique_Combinations > 0, na.rm = TRUE)) { pal <- c("#B0D99B", "#528AA8", "#FFB6DB", "#264653", "#2A9D8F") ggplot(unique_tbl, aes(y = Dataset, x = Percent_Unique, fill = Dataset)) + geom_col(alpha = 0.9, show.legend = FALSE) + geom_text(aes(label = paste0(Percent_Unique, "%")), hjust = -0.1, size = 3.5) + scale_fill_manual(values = pal[seq_len(nrow(unique_tbl))]) + scale_x_continuous(limits = c(0, 100)) + labs( title = "Percentage of Unique Row Combinations", subtitle = "Higher % means fewer duplicates within the dataset", x = "Percent of Rows that are Unique", y = NULL ) + theme_minimal()} else { cat("✅ No unique combinations detected across datasets. No plot is generated. \n")}```### Exact Match ScoreThe Exact Match Score measures how many individual records in the synthetic data are identical to records in the real dataset. Lower scores are ideal for privacy preservation, as high exact matches may pose privacy risks.Ideal Ranges for Exact Match Score:- **0 - 0.1**: Strong privacy protection with minimal or no exact matches, ideal for most privacy-preserving synthetic data.- **0.1 - 0.3**: Moderate exact matches, indicating some privacy risk but generally acceptable for many use cases.- **0.3 - 0.5**: Higher exact matches, raising privacy concerns. Review is needed to ensure proper data generation.- **Above 0.5**: Significant privacy risk, as the synthetic data too closely replicates the real.```{r Exact Match Score, fig.width=8, fig.height=5, fig.align='center', out.width='99%'}# ---- Membership Inference Score ----calculate_membership_inference_score <- function(real_data, synthetic_data, k = 5, threshold = 0.01) { real_num <- real_data[sapply(real_data, is.numeric)] synth_num <- synthetic_data[sapply(synthetic_data, is.numeric)] if (ncol(real_num) == 0 || ncol(synth_num) == 0) return(NA_real_) common <- intersect(names(real_num), names(synth_num)) if (length(common) == 0) return(NA_real_) real_num <- real_num[common] synth_num <- synth_num[common] real_cc <- real_num[complete.cases(real_num), , drop = FALSE] synth_cc <- synth_num[complete.cases(synth_num), , drop = FALSE] if (nrow(real_cc) < k || nrow(synth_cc) == 0) return(NA_real_) real_scaled <- scale(real_cc) synth_scaled <- scale(synth_cc) nn <- FNN::get.knnx(real_scaled, synth_scaled, k)$nn.dist risky <- sum(rowMeans(nn) < threshold) risky / nrow(synth_scaled)}# ---- Results table ----mis_tbl <- tibble::tibble( Dataset = c("Parametric MICE", "Non-Parametric CART", "Synthpop (Low Fidelity)", "Metadata-Based"), Membership_Inference_Score = c( calculate_membership_inference_score(heart_failure, syn_data_1), calculate_membership_inference_score(heart_failure, syn_cart_1), calculate_membership_inference_score(heart_failure, syn_data_low_fidelity_synthpop), calculate_membership_inference_score(heart_failure, syn_data_metadata) )) %>% mutate( Membership_Inference_Score = round(Membership_Inference_Score, 4), Status = dplyr::case_when( is.na(Membership_Inference_Score) ~ "— Insufficient data", Membership_Inference_Score == 0 ~ "✅ No risky memberships", Membership_Inference_Score < 0.01 ~ "⚠️ Low risk", TRUE ~ "❌ High risk" ) )# ---- User-friendly table ----knitr::kable( mis_tbl, caption = "Membership Inference Score: Proportion of synthetic records that are very close to real records (lower is better for privacy)", align = "lrr")# ---- Conditional plot ----if (any(mis_tbl$Membership_Inference_Score > 0, na.rm = TRUE)) { pal <- c("#B0D99B", "#528AA8", "#FFB6DB", "#264653") ggplot(mis_tbl, aes(y = Dataset, x = Membership_Inference_Score, fill = Dataset)) + geom_col(alpha = 0.9, show.legend = FALSE, na.rm = TRUE) + geom_text( aes(label = ifelse(is.na(Membership_Inference_Score), "NA", scales::percent(Membership_Inference_Score, accuracy = 0.01))), hjust = -0.1, size = 3.5 ) + scale_fill_manual(values = pal[seq_len(nrow(mis_tbl))]) + scale_x_continuous(labels = scales::percent_format(accuracy = 1)) + labs( title = "Membership Inference Score by Synthetic Method", subtitle = "Lower values indicate better privacy protection", x = "Score (Proportion of risky memberships)", y = NULL ) + theme_minimal()} else { cat("✅ All Membership Inference Scores are 0. No privacy risks detected, so no plot is generated\n")}```### Neighbors' Privacy ScoreThe Neighbors' Privacy Score identifies real records that are "too similar" to synthetic records by using a nearest-neighbors search. This metric helps evaluate privacy risk by detecting synthetic records that closely resemble real data, which could compromise privacy.Ideal Ranges for Neighbors' Privacy Score:- **0 - 0.1**: Indicates strong privacy protection, with very few synthetic records too similar to real ones.- **0.1 - 0.3**: Moderate similarity, where privacy risk exists but remains within acceptable limits for many cases.- **Above 0.3**: Higher similarity, raising privacy concerns and requiring review to ensure the synthetic data is not too closely mimicking real data.```{r Neighbors Privacy Score, fig.width=8, fig.height=5, fig.align='center', out.width='99%'}# ---- Neighbors' Privacy Score ----calculate_neighbors_privacy_score <- function(real_data, synthetic_data, k = 5, threshold = 0.01) { # Numeric-only columns real_num <- real_data[sapply(real_data, is.numeric)] synth_num <- synthetic_data[sapply(synthetic_data, is.numeric)] if (ncol(real_num) == 0 || ncol(synth_num) == 0) return(NA_real_) # Keep only common columns common <- intersect(names(real_num), names(synth_num)) if (length(common) == 0) return(NA_real_) real_num <- real_num[common] synth_num <- synth_num[common] # Complete cases real_cc <- real_num[complete.cases(real_num), , drop = FALSE] synth_cc <- synth_num[complete.cases(synth_num), , drop = FALSE] if (nrow(real_cc) < k || nrow(synth_cc) == 0) return(NA_real_) # Standardise real_scaled <- scale(real_cc) synth_scaled <- scale(synth_cc) # k-NN distances from synthetic → real neighbors <- FNN::get.knnx(real_scaled, synth_scaled, k)$nn.dist # Privacy risk: synthetic records "too close" to real close_matches <- sum(rowMeans(neighbors) < threshold) score <- close_matches / nrow(synth_scaled) return(score)}# ---- Build results table ----neighbors_tbl <- tibble::tibble( Dataset = c("Parametric MICE", "Non-Parametric CART", "Synthpop (Low Fidelity)", "Metadata-Based"), Neighbors_Privacy_Score = c( calculate_neighbors_privacy_score(heart_failure, syn_data_1), calculate_neighbors_privacy_score(heart_failure, syn_cart_1), calculate_neighbors_privacy_score(heart_failure, syn_data_low_fidelity_synthpop), calculate_neighbors_privacy_score(heart_failure, syn_data_metadata) )) %>% mutate( Neighbors_Privacy_Score = round(Neighbors_Privacy_Score, 4), Status = case_when( Neighbors_Privacy_Score == 0 ~ "✅ No risky neighbors", Neighbors_Privacy_Score < 0.01 ~ "⚠️ Low risk", TRUE ~ "❌ High risk" ) )# ---- User-friendly table ----knitr::kable( neighbors_tbl, caption = "Neighbors' Privacy Score: Proportion of synthetic records too close to real records (lower is better for privacy)", align = "lrr")# ---- Plot only if any score > 0 ----if (any(neighbors_tbl$Neighbors_Privacy_Score > 0)) { pal <- c("#B0D99B", "#528AA8", "#FFB6DB", "#264653") ggplot(neighbors_tbl, aes(y = Dataset, x = Neighbors_Privacy_Score, fill = Dataset)) + geom_col(alpha = 0.9, show.legend = FALSE) + geom_text(aes(label = Neighbors_Privacy_Score), hjust = -0.1, size = 3.5) + scale_fill_manual(values = pal[seq_len(nrow(neighbors_tbl))]) + scale_x_continuous(labels = scales::percent_format(accuracy = 0.1)) + labs( title = "Neighbors' Privacy Score by Synthetic Method", subtitle = "Lower values mean synthetic records are less likely to be close replicas of real individuals", x = "Privacy Score (Proportion of risky neighbors)", y = NULL ) + theme_minimal()} else { cat("✅ All Neighbors' Privacy Scores are 0. No risky neighbors detected. No plot is generated.\n")}```### Membership Inference ScoreThe Membership Inference Score evaluates the risk of membership inference attacks, which attempt to determine whether a specific record belongs to the real dataset. This metric helps assess the vulnerability of synthetic data to privacy breaches.Ideal Ranges for Membership Inference Score:- **0 - 0.1**: Indicates low risk, where it is unlikely that membership inference attacks can accurately identify real records.- **0.1 - 0.3**: Moderate risk, where some vulnerability to membership inference exists but may still be acceptable for certain use cases.- **Above 0.3**: High risk, suggesting significant privacy concerns as the synthetic data could reveal membership information about the real records.```{r Membership Inference Score, fig.width=8, fig.height=5, fig.align='center', out.width='99%'}# ---- Membership Inference Score ----# Interpreted as the proportion of synthetic records that are "too close" to the real data.# Lower is better (safer).calculate_membership_inference_score <- function(real_data, synthetic_data, k = 5, threshold = 0.01) { # Numeric-only columns real_num <- real_data[sapply(real_data, is.numeric)] synth_num <- synthetic_data[sapply(synthetic_data, is.numeric)] # Guardrails if (ncol(real_num) == 0 || ncol(synth_num) == 0) return(NA_real_) # Keep only common numeric variables common <- intersect(names(real_num), names(synth_num)) if (length(common) == 0) return(NA_real_) real_num <- real_num[common] synth_num <- synth_num[common] # Use complete cases real_cc <- real_num[complete.cases(real_num), , drop = FALSE] synth_cc <- synth_num[complete.cases(synth_num), , drop = FALSE] if (nrow(real_cc) < k || nrow(synth_cc) == 0) return(NA_real_) # Standardize real_scaled <- scale(real_cc) synth_scaled <- scale(synth_cc) # k-NN distances from synthetic → real nn <- FNN::get.knnx(real_scaled, synth_scaled, k)$nn.dist # Score: proportion of synthetic records whose average distance to k nearest reals is below threshold risky <- sum(rowMeans(nn) < threshold) risky / nrow(synth_scaled)}# ---- Results table ----mis_tbl <- tibble::tibble( Dataset = c("Parametric MICE", "Non-Parametric CART", "Synthpop (Low Fidelity)", "Metadata-Based"), Membership_Inference_Score = c( calculate_membership_inference_score(heart_failure, syn_data_1), calculate_membership_inference_score(heart_failure, syn_cart_1), calculate_membership_inference_score(heart_failure, syn_data_low_fidelity_synthpop), calculate_membership_inference_score(heart_failure, syn_data_metadata) )) %>% mutate( Membership_Inference_Score = round(Membership_Inference_Score, 4), Status = dplyr::case_when( is.na(Membership_Inference_Score) ~ "— Insufficient data", Membership_Inference_Score == 0 ~ "✅ No risky memberships", Membership_Inference_Score < 0.01 ~ "⚠️ Low risk", TRUE ~ "❌ High risk" ) )# ---- User-friendly table ----knitr::kable( mis_tbl, caption = "Membership Inference Score: Proportion of synthetic records that are very close to real records (lower is better for privacy)", align = "lrr")# ---- Plot only if any risky memberships (> 0) ----if (any(mis_tbl$Membership_Inference_Score > 0, na.rm = TRUE)) { pal <- c("#B0D99B", "#528AA8", "#FFB6DB", "#264653") ggplot(mis_tbl, aes(y = Dataset, x = Membership_Inference_Score, fill = Dataset)) + geom_col(alpha = 0.9, show.legend = FALSE, na.rm = TRUE) + geom_text( aes(label = ifelse(is.na(Membership_Inference_Score), "NA", scales::percent(Membership_Inference_Score, accuracy = 0.01))), hjust = -0.1, size = 3.5 ) + scale_fill_manual(values = pal[seq_len(nrow(mis_tbl))]) + scale_x_continuous(labels = scales::percent_format(accuracy = 1)) + labs( title = "Membership Inference Score by Synthetic Method", subtitle = "Lower values indicate better privacy protection", x = "Score (Proportion of risky memberships)", y = NULL ) + theme_minimal()} else { cat("✅ All Membership Inference Scores are 0. No risky memberships detected. No plot is generated. \n")}```## ConclusionsThis report evaluated the fidelity, utility, privacy, and overall quality of synthetic data generated using parametric MICE, non-parametric CART, Synthpop, and metadata-based methods. Key findings include:- **Data Structure Preservation**: All methods preserved the structure of the real dataset, though minor variations in variable distributions were observed.- **Categorical Variables**: Most synthetic datasets retained categorical levels, but frequency distribution discrepancies were particularly notable in Synthpop data.- **Numeric Variables**: The range and distribution of numeric variables were largely maintained across methods, with slight variations in density plots.- **Correlation & Mutual Information**: While most methods captured variable relationships, weaker correlations and lower mutual information scores suggested some loss of feature dependencies in certain synthetic datasets.- **Utility (Prediction Score)**: Synthetic data performed similarly to real data in predictive modeling, though slight reductions in accuracy and increased pMSE indicated less precise representation of the real dataset's predictive capabilities.- **Privacy (Exact Match & Neighbors' Privacy Scores)**: All synthetic datasets demonstrated strong privacy protection, with zero values for both Exact Match and Neighbors' Privacy Scores. This indicates that no synthetic records were identical to or highly similar to any real records, affirming robust privacy across all methods.In conclusion, synthetic data generated through these methods provides a balance between utility and privacy, though method selection should align with specific analytic or privacy needs.```{r Stop the parallel cluster}# Stop the parallel clusterstopCluster(cl)registerDoSEQ()```