Skip to contents
library(MicrobiomeDB, quietly = TRUE)
#> Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
#> 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
library(tidyverse, quietly = TRUE)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#>  dplyr     1.1.4      readr     2.1.5
#>  forcats   1.0.0      stringr   1.5.1
#>  ggplot2   3.5.1      tibble    3.2.1
#>  lubridate 1.9.3      tidyr     1.3.1
#>  purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#>  dplyr::filter() masks stats::filter()
#>  dplyr::lag()    masks stats::lag()
#>  Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(igraph, quietly = TRUE)
#> 
#> Attaching package: 'igraph'
#> 
#> The following objects are masked from 'package:lubridate':
#> 
#>     %--%, union
#> 
#> The following objects are masked from 'package:dplyr':
#> 
#>     as_data_frame, groups, union
#> 
#> The following objects are masked from 'package:purrr':
#> 
#>     compose, simplify
#> 
#> The following object is masked from 'package:tidyr':
#> 
#>     crossing
#> 
#> The following object is masked from 'package:tibble':
#> 
#>     as_data_frame
#> 
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> 
#> The following object is masked from 'package:base':
#> 
#>     union

What is a Correlation Analysis?

Correlations are useful for identifying relationships between variables. They are very helpful for identifying biomarkers and functional associations, among other things.

Biomarkers in microbiome data are microbial taxa or features that exhibit a significant correlation with particular conditions, traits, or variables in sample metadata. Identifying these biomarkers allows researchers to uncover associations between the microbiome and external factors, such as age, disease status, or environmental conditions. Biomarkers serve as indicators of specific biological or clinical phenomena within microbial communities.

Functional associations in microbiome data refer to relationships between taxonomic relative abundances and abundances of specific pathways or other functional features. Understanding these associations allows researchers to explore how the taxonomic composition of microbial communities influences functional capabilities.

Why Care About Biomarker Discovery?

Researchers are interested in biomarker discovery for several reasons:

Diagnostic Insights: Biomarkers can serve as potential diagnostic indicators, helping identify microbial patterns associated with specific conditions.

Predictive Modeling: Understanding biomarker correlations enables predictive modeling of microbial responses to external factors.

Biological Significance: Biomarkers provide insights into the biological significance of microbial community variations in response to different conditions.

Why Care About Functional Associations?

Biological Insights: Explore how changes in taxonomic composition may impact the functional potential of microbial communities.

Pathway-Level Analysis: Understand how specific pathways or functional features correlate with taxonomic abundance, providing pathway-level insights.

Predictive Modeling: Assessing functional associations aids in predicting microbial functional responses to environmental changes or perturbations.

How are Correlations Calculated?

This package employs correlation analysis between microbial taxonomic abundances and sample metadata or abundances of pathways or other functional data using the following approach:

Users can choose either Spearman or Pearson correlation for the analysis. Both will produce a correlation coefficient and a p-value indicating statistical significance.

Spearman Correlation

Use when the relationship between variables is monotonic but not necessarily linear. Suitable for non-linear associations.

## first lets find some interesting data
microbiomeData::getCuratedDatasetNames()
#>  [1] "Anopheles_albimanus"     "BONUS"                  
#>  [3] "Bangladesh"              "DailyBaby"              
#>  [5] "DiabImmune"              "ECAM"                   
#>  [7] "EcoCF"                   "FARMM"                  
#>  [9] "GEMS1"                   "HMP_MGX"                
#> [11] "HMP_V1V3"                "HMP_V3V5"               
#> [13] "Leishmaniasis"           "MALED_2yr"              
#> [15] "MALED_diarrhea"          "MORDOR"                 
#> [17] "Malaysia_helminth"       "NICU_NEC"               
#> [19] "PIH_Uganda"              "PretermInfantResistome1"
#> [21] "PretermInfantResistome2" "UgandaMaternal"

getCollectionNames(microbiomeData::HMP_V3V5)
#>  [1] "16S (V3-V5) Order (Relative taxonomic abundance analysis)"  
#>  [2] "16S (V3-V5) Order (Absolute taxonomic abundance analysis)"  
#>  [3] "16S (V3-V5) Genus (Relative taxonomic abundance analysis)"  
#>  [4] "16S (V3-V5) Genus (Absolute taxonomic abundance analysis)"  
#>  [5] "16S (V3-V5) Family (Relative taxonomic abundance analysis)" 
#>  [6] "16S (V3-V5) Family (Absolute taxonomic abundance analysis)" 
#>  [7] "16S (V3-V5) Class (Relative taxonomic abundance analysis)"  
#>  [8] "16S (V3-V5) Class (Absolute taxonomic abundance analysis)"  
#>  [9] "16S (V3-V5) Phylum (Relative taxonomic abundance analysis)" 
#> [10] "16S (V3-V5) Phylum (Absolute taxonomic abundance analysis)" 
#> [11] "16S (V3-V5) Species (Relative taxonomic abundance analysis)"
#> [12] "16S (V3-V5) Species (Absolute taxonomic abundance analysis)"
#> [13] "16S (V3-V5) Kingdom (Relative taxonomic abundance analysis)"
#> [14] "16S (V3-V5) Kingdom (Absolute taxonomic abundance analysis)"

## grab a collection of interest
HMP_V3V5_species <- getCollection(microbiomeData::HMP_V3V5, "16S (V3-V5) Species (Relative taxonomic abundance analysis)", continuousMetadataOnly = TRUE)

## get a correlation ComputeResult
## this is not necessarily to recommend spearman for metadata. 
## it is simply exemplary. Always look at your data!
species_vs_metadata <- correlation(HMP_V3V5_species, method = 'spearman') 
#> 
#> 2024-06-26 14:48:35.698139 Removed 1 records with no data.
#> 
#> 2024-06-26 14:48:35.733392 Removed 1 records with no data.
#> 
#> 2024-06-26 14:48:35.76831 Removed 1 records with no data.
#> 
#> 2024-06-26 14:48:38.159312 Completed correlation with method=spearman. Formatting results.
#> 
#> 2024-06-26 14:48:38.16027 Received df table with 5461 samples and 137 features with values.
#> 
#> 2024-06-26 14:48:38.164508 Correlation computation completed with parameters recordIdColumn= 16S_rRNA_(V3-V5)_assay_Id , method =  spearman

Pearson Correlation

Use when the relationship between variables is linear. Suitable for assessing linear associations.

## grab two collections of interest, in this case species level data and pathway abundance data
HMP_MGX_species <- getCollection(microbiomeData::HMP_MGX, "Shotgun metagenomics Species (Relative taxonomic abundance analysis)")
HMP_MGX_pathways <- getCollection(microbiomeData::HMP_MGX, "Shotgun metagenomics Metagenome enzyme pathway abundance data" )

## get a correlation ComputeResult
## this is not necessarily to recommend pearson for functional data.
## it is simply exemplary. Always look at your data!
pathway_vs_species <- correlation(HMP_MGX_species, HMP_MGX_pathways, method = 'pearson')
#> 
#> 2024-06-26 14:48:38.666936 Removed 7 records with no data.
#> 
#> 2024-06-26 14:48:38.705565 Removed 7 records with no data.
#> 
#> 2024-06-26 14:48:38.828758 Removed 7 records with no data.
#> 
#> 2024-06-26 14:48:38.839323 Received first df table with 734 samples and 325 features with values.
#> 
#> 2024-06-26 14:48:38.839594 Received second df table with 741 samples and 371 features with values.
#> 
#> 2024-06-26 14:48:38.839841 Found 734 samples in common between data1 and data2. Only these samples will be used.
#> 
#> 2024-06-26 14:48:39.261762 Completed correlation with method=pearson. Formatting results.
#> 
#> 2024-06-26 14:48:39.26772 Correlation computation completed with parameters recordIdColumn= Metagenomic_sequencing_assay_Id , method =  pearson

Interpreting Results

Correlation Coefficients: Assess the strength and direction of correlations. Positive coefficients indicate positive correlations, while negative coefficients indicate negative correlations.

p-values and Adjusted p-values: Identify biomarkers with statistically significant correlations, considering adjustments for multiple testing.

You can extract these metrics and sort and filter results by them:

## you can extract network metrics 
pathway_vs_species.metrics <- as_tibble(
  getComputeResult(
    pathway_vs_species,
    correlationCoefThreshold = 0.8,
    pValueThreshold = 0.05
  )
)

## it's also easy to sort and filter these network metrics
## begin by renaming columns
colnames(pathway_vs_species.metrics) <- c('species', 'pathway', 'correlationCoef', 'pValue')
pathway_vs_species.metrics %>%
  filter(species == "Faecalibacterium prausnitzii") %>%
  filter(correlationCoef > 0.5) %>%
  arrange(desc(correlationCoef))
#> # A tibble: 1 × 4
#>   species                      pathway                    correlationCoef pValue
#>   <chr>                        <fct>                                <dbl>  <dbl>
#> 1 Faecalibacterium prausnitzii PWY-5177: glutaryl-CoA de…           0.901      0

You can also visualize them with custom htmlwidgets:

## now plot the network
## filters can be applied based on correlation coefficient and p-value
## renders an interactive htmlwidget, using a predetermined layout
correlationNetwork(
  pathway_vs_species,
  correlationCoefThreshold = 0.8,
  pValueThreshold = 0.05
)
Escherichia coliStreptococcus parasanguinisFusobacterium periodonticumMonoglobus pectinilyticusFaecalibacterium prausnitziiAnaerotruncus colihominisFlavonifractor plautiiBacteroides thetaiotaomicronCampylobacter concisusPrevotella melaninogenicaBacteroides stercorisCorynebacterium matruchotiiCardiobacterium hominisActinomyces sp. oral taxon 448ARG+POLYAMINE-SYN: superpathway of arginine and polyamine biosynthesisFAO-PWY: fatty acid &beta;-oxidation IFUC-RHAMCAT-PWY: superpathway of fucose and rhamnose degradationFUCCAT-PWY: fucose degradationGALACTARDEG-PWY: D-galactarate degradation IGLUCARDEG-PWY: D-glucarate degradation IGLUCARGALACTSUPER-PWY: superpathway of D-glucarate and D-galactarate degradationGLYCOL-GLYOXDEG-PWY: superpathway of glycol metabolism and degradationGLYOXYLATE-BYPASS: glyoxylate cycleKDO-NAGLIPASYN-PWY: superpathway of (Kdo)2-lipid A biosynthesisKETOGLUCONMET-PWY: ketogluconate metabolismMETHGLYUT-PWY: superpathway of methylglyoxal degradationP105-PWY: TCA cycle IV (2-oxoglutarate decarboxylase)P125-PWY: superpathway of (R,R)-butanediol biosynthesisP163-PWY: L-lysine fermentation to acetate and butanoateP221-PWY: octane oxidationP461-PWY: hexitol fermentation to lactate, formate, ethanol and acetatePOLYAMSYN-PWY: superpathway of polyamine biosynthesis IPWY-4702: phytate degradation IPWY-5101: L-isoleucine biosynthesis IIPWY-5136: fatty acid &beta;-oxidation II (peroxisome)PWY-5138: unsaturated, even numbered fatty acid &beta;-oxidationPWY-5177: glutaryl-CoA degradationPWY-561: superpathway of glyoxylate cycle and fatty acid degradationPWY-5675: nitrate reduction V (assimilatory)PWY-5677: succinate fermentation to butanoatePWY-5723: Rubisco shuntPWY-5845: superpathway of menaquinol-9 biosynthesisPWY-5850: superpathway of menaquinol-6 biosynthesis IPWY-5860: superpathway of demethylmenaquinol-6 biosynthesis IPWY-5862: superpathway of demethylmenaquinol-9 biosynthesisPWY-5896: superpathway of menaquinol-10 biosynthesisPWY-6318: L-phenylalanine degradation IV (mammalian, via side chain)PWY-6353: purine nucleotides degradation II (aerobic)PWY-6478: GDP-D-glycero-&alpha;-D-manno-heptose biosynthesisPWY-6531: mannitol cyclePWY-6606: guanosine nucleotides degradation IIPWY-6731: starch degradation IIIPWY-6748: nitrate reduction VII (denitrification)PWY-6891: thiazole biosynthesis II (Bacillus)PWY-7031: protein N-glycosylation (bacterial)PWY-7046: 4-coumarate degradation (anaerobic)PWY-7200: superpathway of pyrimidine deoxyribonucleoside salvagePWY-7204: pyridoxal 5'-phosphate salvage II (plants)PWY-7209: superpathway of pyrimidine ribonucleosides degradationPWY-7269: NAD/NADP-NADH/NADPH mitochondrial interconversion (yeast)PWY-7312: dTDP-D-&beta;-fucofuranose biosynthesisPWY-7315: dTDP-N-acetylthomosamine biosynthesisPWY-7384: anaerobic energy metabolism (invertebrates, mitochondrial)PWY-7389: superpathway of anaerobic energy metabolism (invertebrates)PWY-7391: isoprene biosynthesis II (engineered)PWY1G-0: mycothiol biosynthesisPWY490-3: nitrate reduction VI (assimilatory)PWY66-391: fatty acid &beta;-oxidation VI (peroxisome)SALVADEHYPOX-PWY: adenosine nucleotides degradation IITCA-GLYOX-BYPASS: superpathway of glyoxylate bypass and TCA

Finally, for more advanced analysis of the network, you can extract it as an igraph object:

## if you extract the network as an igraph object, you can get more detailed metrics
pathway_vs_species.igraph <- getComputeResult(pathway_vs_species, format = 'igraph')
degree <- degree(pathway_vs_species.igraph)
edgeBT <- edge_betweenness(pathway_vs_species.igraph)
pgRank <- page_rank(pathway_vs_species.igraph)
coords <- layout_with_kk(pathway_vs_species.igraph)

#plot(pathway_vs_species.igraph, layout = coords)