Research

Statistical Learning Methods for Multiview Learning (Data Integration) and Prediction of Clinical Outcomes

Multiview learning (or data integration), an approach that integrates vast amounts of diverse but related data (e.g. genomic, phenotypic, imaging, proteomic data) results in a better understanding of biological processes involved in complex diseases. We have developed unsupervised statistical learning methods that are both data- and knowledge-driven, with the latter utilizing prior information (e.g. gene pathway information) to inform integrative analysis methods. The methods are capable of dimensionality reduction and biomarker selection; these are key components needed in the analysis of big biomedical data to yield interpretable results. We have also developed statistical methods that combine supervised and unsupervised methods, thus coupling integrative analysis with a clinical outcome prediction, offering a more powerful approach to identifying biomarkers with potential to enhance our understanding of the pathobiology of complex diseases. Some key publications include (* denote mentees):

a. Safo Sandra, Min, Eun Jeong, Haine,Lillian*. Sparse Linear Discriminant Analysis for Multi-view Structured Data. Biometrics, 2021, PMID: 33739448

b. Chekouo T, Safo S. Bayesian Integrative Analysis and Prediction with Application to Atherosclerosis Cardiovascular Disease. Biostatistics, 2022, PMID: 33969382

c. Palzer EF*, Wendt C, Bowler R, Hersh CP, Safo SE, Lock EF. sJIVE: "Supervised Joint and Individual Variation Explained", 2022 Computational Statistics and Data Analysis 

d.     Min EJ, Safo SE, Long Q. Penalized co-inertia analysis with applications to -omics data. Bioinformatics. 2019 Mar 15;35(6):1018-1025. PubMed PMID: 30165424; PubMed Central PMCID: PMC6419918.

e.     Safo SE, Ahn J, Jeon Y, Jung S. Sparse generalized eigenvalue problem with application to canonical correlation analysis for integrative analysis of methylation and gene expression data. Biometrics. 2018 Dec;74(4):1362-1371. PubMed PMID: 29750830.

f.     Safo SE, Li S, Long Q. Integrative analysis of transcriptomic and metabolomic data via sparse canonical correlation analysis with incorporation of biological information. Biometrics. 2018 Mar;74(1):300-312. PubMed PMID: 28482123; PubMed Central PMCID: PMC5677597.

Interpretable nonlinear, including deep learning methods, for Mulitview Learning (Data Integration) 

The rise of machine (including kernel methods) and deep learning methods have enabled the training of models to capture complex, nonlinear relationships, and to find insights faster and more accurately. While deep learning- and kernel-based methods have shown remarkable prediction performances in many applications, a major limitation is their inability to produce clinically meaningful findings if the goal is biomarker identification. Our group has developed interpretable kernel and deep learning methods for data integration capable of variable selection/ranking. We combine the flexibility of deep learning and kernel methods with classical statistical techniques and regularization methods to promote interpretability. Some key publications include (* denote mentees):

a. Wang, H*., Lu, H*., Sun, J., & Safo, S. E. (2023). Interpretable deep learning methods for multiview learning. arXiv preprint arXiv:2302.07930.

b. Safo, S. E., & Lu, H*. (2023). Scalable Randomized Kernel Methods for Multiview Data Integration and Prediction. arXiv preprint arXiv:2304.04692.

c. Sarthak, Jain* and Safo, S.E. A deep learning pipeline for cross-sectional and longitudinal multiview data integration arXiv preprint arXiv: 2312.01238.

d. Wang J*, Safo SE. Deep IDA: A Deep Learning Method for Integrative Discriminant Analysis of Multi-View Data with Feature Ranking-An Application to COVID-19 severity. ArXiv [Preprint]. 2021 Nov 18:arXiv:2111.09964v2. PMID: 34815984; PMCID: PMC8609900.

Multiview Learning Methods to Understand Disease Heterogeneity

Epidemiological studies suggest that heterogeneity (e.g., racial and gender disparities) exists in many complex diseases. Further, the current classification of many complex diseases depends on a single measure or few variables and does not fully capture their complexity and heterogeneity. There are therefore compelling reasons to consider subgroup (such as gender, race) variation in integrative analysis and feature selection. Our group has developed an integrative analysis and prediction method, HIP, that accounts for data heterogeneity and allows to identify biomarkers, common and specific to different subgroups. HIP is applicable to continuous, multi-class, Poisson, and zero-inflated Poisson outcomes and can accommodate covariate adjustments. The heterogeneity of complex diseases have further motivated attempts to classify distinct subgroups to better understand the underlying physiology. Multiview learning offers an exciting opportunity to expand the definition of many complex diseases and to better stratify patients. We have developed a robust biclustering (two-way clustering) method for data from multiple sources that is useful for detecting disease subtypes and biomarkers characterizing these subtypes. Some key publications include (* denote mentees):

a. W. Zhang*, C. Wendt, R. Bowler, C. P. Hersh, and S. E. Safo. Robust Integrative Biclustering for Multi-view Data. 2022 iSSVD (Arxiv)  Statistical Methods in Medical Research

b. J. Butts*, C. Wendt, R. Bowler, C.P. Hersh, Q. Long, L. Eberly, S. E. Safo. (2023). Extensions of Heterogeneity in Integration and Prediction (HIP) with R Shiny Application. arXiv preprint 

c. J. Butts*, C. Wendt, R. Bowler, C.P. Hersh, Q. Long, L. Eberly, S. E. Safo. Accounting for data heterogeneity in integrative analysis and prediction methods: An application to Chronic Obstructive Pulmonary Disease 2021 HIP (Arxiv)

Classification, Dimension Reduction and Sample Size Methods for High Dimensional Data

We have developed classification, dimension reduction and sample size methods for high dimensional data in general. We have developed a novel classification method that generalizes existing binary linear discriminant analysis (LDA) solutions into multi-class ones by exploiting the equivalence between the solution space of a multi-class LDA problem and the solution space of separate applications of binary LDA problems with the mean vector replaced with a basis vector (a). We have developed a classification method that exploits the relationship between original Fisher’s LDA and generalized eigenvalue problem (b) and extended this formulation to  use structural information (network penalties) in LDA (c) and PCA (d) to enhance interpretability. We have developed a nonparametric sample size method for high dimensional classification that uses errors in variables logistic regression, bootstrap and cross-validation resampling techniques to estimate the required sample size in a training set for classifying cancer patients based on characteristics of their disease (e). 

a. Safo, S. E., & Ahn, J. (2016). General sparse multi-class linear discriminant analysis. Computational Statistics and Data Analysis, 99, 81-90. doi: 10.1016/j.csda.2016.01.011

b. Safo, S. E., Ahn, J., Jeon, Y., & Jung, S. (2018). Sparse generalized eigenvalue problem with application to canonical correlation analysis for integrative analysis of methylation and gene expression data. Biometrics, 74(4), 1362-1371. doi: 10.1111/biom.12886

c. Safo, S., & Long, Q. (2018). Sparse linear discriminant analysis in structured covariates space. Statistical Analysis and Data Mining: The ASA Data Science Journal, 772-781. doi: 10.1109/DSAA.2016.66

d. Li, Z*., Safo, S. E., & Long, Q. (2017). Incorporating biological information in sparse principal component analysis with application to genomic data. BMC bioinformatics, 18(1), 332. PubMed Central ID Number: PMC5504598 doi: 10.1186/s12859-017-1740-7

e. Safo, S., Song, X., & Dobbin, K. K. (2015). Sample size determination for training cancer classifiers from microarray and RNA-seq data. Annals of Applied Statistics, 9(2), 1053-1075. doi: 10.1214/15-AOAS825

Multiview Learning Packages and Web Portals

We have developed multiple software packages that are publicly available on the code-sharing repository GitHub. We have made available web applications to facilitate data integration anywhere and on any device.  We envision our  online application (MultiviewPortal) to be a one-stop avenue for data integration, providing capabilities such as: dimension reduction, variable selection, prediction of an outcome, disease subtyping, visualization, and many more. All these efforts demonstrate our commitment to reproducible research and dissemination. Some publications include:

a.   Palzer, E. F., & Safo, S. E. (2023). mvlearnR and Shiny App for multiview learning. arXiv preprint arXiv:2311.16181. Bioinformatics Advances (Accepted). Website: https://multi-viewlearn.shinyapps.io/MultiView_Modeling/ .This is an online tool for integrating data from multiple sources.

b. Wang, H., Lu, H., Sun, J., & Safo, S. E. (2023). R-package iDeepViewLearn. https://github.com/lasandrall/iDeepViewLearn

c. Butts, J., Wendt, C., Bowler, R., Hersh, C. P., Long, Q., Eberly, L., & Safo, S. E. (2023). Extensions of Heterogeneity in Integration and Prediction (HIP) with R Shiny Application. arXiv preprint arXiv:2310.08426. https://multi-viewlearn.shinyapps.io/HIP_ShinyApp/ .

d. Safo, S. E., & Lu, H. (2023). R-package RandMVLearn for Scalable Randomized Kernel Methods for Multiview Data Integration and Prediction https://github.com/lasandrall/RandMVLearn

Biomarker identification and Molecular Risk Scores for COVID-19, CVD, and HIV

We have made strong contributions to translational research in HIV, cardiovascular diseases (CVD), and COVID-

19. In our Journal of the American Heart Association (JAHA) paper, we describe novel biomarkers for CVD in persons

living with HIV (PLWH), that when combined with established risk factors discriminated between CVD cases and controls better than the individual variables and established risk factors. We have identified proteo-genomic predictors and developed proteo-genomic risk scores for CVD in PLWH.  We have  conducted multi-omics research and  identified molecular signatures (metabolomites, proteins, lipids) and developed molecular scores for COVID-19 severity and status. We found signatures related to lipid metabolism, inflammation, and immune activation efficiently discriminated between disease status or predicted disease severity. Some key publications include (* denote mentees):


a. Safo, S., Haine, L*., Baker, J., Reilly, C., Duprez, D., Neaton, J., . . . Staub, T. Derivation of a protein risk score for cardiovascular disease for a multiethnic HIV+ cohort. Journal of the American Heart Association (JAHA). 22 Jun 2023 2023 Jul 4;12(13):e027273. doi: 10.1161/JAHA.122.027273 PMID: 37345752

b. Mehta, T.*, Haine, L.*, Baker, J., Reilly, C., Duprez, D., Neaton, J., . . . Staub, T. & Safo, S. E. Proteo-genomics predictors and risk scores of cardiovascular diseases in persons living with HIV. Under Review

c. Castro-Pearson*, S., Samorodnitsky, S*., Yang, K*., Lotfi-Emran, S., Ingraham, N. E., Bramante, C., ... , Safo, S.E., and Tignanelli, C. J. (2023). Development of a proteomic signature associated with severe disease for patients with COVID-19 using data from 5 multicenter, randomized, controlled, and prospective studies. Scientific Reports, 13(1), 20315.


d.  Lipman D, Safo SE, Chekouo T. Integrative multi-omics approach for identifying molecular signatures and pathways and deriving and validating molecular scores for COVID-19 severity and status. BMC Genomics. 2023 Jun 12;24(1):319. doi: 10.1186/s12864-023-09410-5. PMID: 37308820; PMCID: PMC10259816.


e.  Yang, K., Kang, Z., Guan, W., Lotfi-Emran, S., Mayer, Z. J., Guerrero, C. R., ... & Safo, S.E. (2023). Developing A Baseline Metabolomic Signature Associated with COVID-19 Severity: Insights from Prospective Trials Encompassing 13 US Centers. Metabolites, 13(11), 1107.


f. Lipman D, Safo SE, Chekouo T. Multi-omic analysis reveals enriched pathways associated with COVID-19 and COVID-19 severity. PLoS One. 2022;17(4):e0267047. doi: 10.1371/journal.pone.0267047. eCollection 2022. PubMed PMID: 35468151; PubMed Central PMCID: PMC9038205.

HIV, COVID-19, Race/Ethnicity, and Social Determinants of Health

We aim to use the largest national EHR data on COVID-19 (N3C data) to gain a better understanding of the impact of HIV infection on the severity of COVID-19, and to identify potential individuals and subgroups with increased risk for severe illness. Our specific goals are to i) determine which factors are mostly associated with severe COVID-19 illness in PLWH and among different race/ethnicity subgrops, ii) investigate the additional contribution of HIV infection to COVID-19 severity ii) identify which subgroups of PLWH are most likely to develop severe COVID-19 illness, and iv) predict which COVID-19 PLWH will progress and develop severe illness. Some key publications include (* denote mentees):

a. Kunz, M*., Rott, K*., Hurwitz, E., Kunisaki, K., Islam, J., Sun, J., . . . Patel, R., & Safo, S. E. The Intersections of COVID-19, HIV, and race/ethnicity: Machine Learning Methods to Identify and Model Risk Factors for Severe COVID-19 in a Large U.S. National Dataset. AIDS and Behavior 2024