Skip to contents
library(MicrobiomeDB, quietly = TRUE)
#> Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
#> 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
library(tidyverse, quietly = TRUE)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#>  dplyr     1.1.4      readr     2.1.5
#>  forcats   1.0.0      stringr   1.5.1
#>  ggplot2   3.5.1      tibble    3.2.1
#>  lubridate 1.9.3      tidyr     1.3.1
#>  purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#>  dplyr::filter() masks stats::filter()
#>  dplyr::lag()    masks stats::lag()
#>  Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(igraph, quietly = TRUE)
#> 
#> Attaching package: 'igraph'
#> 
#> The following objects are masked from 'package:lubridate':
#> 
#>     %--%, union
#> 
#> The following objects are masked from 'package:dplyr':
#> 
#>     as_data_frame, groups, union
#> 
#> The following objects are masked from 'package:purrr':
#> 
#>     compose, simplify
#> 
#> The following object is masked from 'package:tidyr':
#> 
#>     crossing
#> 
#> The following object is masked from 'package:tibble':
#> 
#>     as_data_frame
#> 
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> 
#> The following object is masked from 'package:base':
#> 
#>     union

What is a Correlation Analysis?

Correlations are useful for identifying relationships between variables. They are very helpful for identifying biomarkers and functional associations, among other things.

Biomarkers in microbiome data are microbial taxa or features that exhibit a significant correlation with particular conditions, traits, or variables in sample metadata. Identifying these biomarkers allows researchers to uncover associations between the microbiome and external factors, such as age, disease status, or environmental conditions. Biomarkers serve as indicators of specific biological or clinical phenomena within microbial communities.

Functional associations in microbiome data refer to relationships between taxonomic relative abundances and abundances of specific pathways or other functional features. Understanding these associations allows researchers to explore how the taxonomic composition of microbial communities influences functional capabilities.

Why Care About Biomarker Discovery?

Researchers are interested in biomarker discovery for several reasons:

Diagnostic Insights: Biomarkers can serve as potential diagnostic indicators, helping identify microbial patterns associated with specific conditions.

Predictive Modeling: Understanding biomarker correlations enables predictive modeling of microbial responses to external factors.

Biological Significance: Biomarkers provide insights into the biological significance of microbial community variations in response to different conditions.

Why Care About Functional Associations?

Biological Insights: Explore how changes in taxonomic composition may impact the functional potential of microbial communities.

Pathway-Level Analysis: Understand how specific pathways or functional features correlate with taxonomic abundance, providing pathway-level insights.

Predictive Modeling: Assessing functional associations aids in predicting microbial functional responses to environmental changes or perturbations.

How are Correlations Calculated?

This package employs correlation analysis between microbial taxonomic abundances and sample metadata or abundances of pathways or other functional data using the following approach:

Users can choose either Spearman or Pearson correlation for the analysis. Both will produce a correlation coefficient and a p-value indicating statistical significance.

Spearman Correlation

Use when the relationship between variables is monotonic but not necessarily linear. Suitable for non-linear associations.

## first lets find some interesting data
microbiomeData::getCuratedDatasetNames()
#>  [1] "Anopheles_albimanus"     "BONUS"                  
#>  [3] "Bangladesh"              "DailyBaby"              
#>  [5] "DiabImmune"              "ECAM"                   
#>  [7] "EcoCF"                   "FARMM"                  
#>  [9] "GEMS1"                   "HMP_MGX"                
#> [11] "HMP_V1V3"                "HMP_V3V5"               
#> [13] "Leishmaniasis"           "MALED_2yr"              
#> [15] "MALED_diarrhea"          "MORDOR"                 
#> [17] "Malaysia_helminth"       "NICU_NEC"               
#> [19] "PIH_Uganda"              "PretermInfantResistome1"
#> [21] "PretermInfantResistome2" "UgandaMaternal"

getCollectionNames(microbiomeData::HMP_V3V5)
#>  [1] "16S (V3-V5) Order (Relative taxonomic abundance analysis)"  
#>  [2] "16S (V3-V5) Order (Absolute taxonomic abundance analysis)"  
#>  [3] "16S (V3-V5) Genus (Relative taxonomic abundance analysis)"  
#>  [4] "16S (V3-V5) Genus (Absolute taxonomic abundance analysis)"  
#>  [5] "16S (V3-V5) Family (Relative taxonomic abundance analysis)" 
#>  [6] "16S (V3-V5) Family (Absolute taxonomic abundance analysis)" 
#>  [7] "16S (V3-V5) Class (Relative taxonomic abundance analysis)"  
#>  [8] "16S (V3-V5) Class (Absolute taxonomic abundance analysis)"  
#>  [9] "16S (V3-V5) Phylum (Relative taxonomic abundance analysis)" 
#> [10] "16S (V3-V5) Phylum (Absolute taxonomic abundance analysis)" 
#> [11] "16S (V3-V5) Species (Relative taxonomic abundance analysis)"
#> [12] "16S (V3-V5) Species (Absolute taxonomic abundance analysis)"
#> [13] "16S (V3-V5) Kingdom (Relative taxonomic abundance analysis)"
#> [14] "16S (V3-V5) Kingdom (Absolute taxonomic abundance analysis)"

## grab a collection of interest
HMP_V3V5_species <- getCollection(microbiomeData::HMP_V3V5, "16S (V3-V5) Species (Relative taxonomic abundance analysis)", continuousMetadataOnly = TRUE)

## get a correlation ComputeResult
## this is not necessarily to recommend spearman for metadata. 
## it is simply exemplary. Always look at your data!
species_vs_metadata <- correlation(HMP_V3V5_species, method = 'spearman') 
#> 
#> 2024-06-26 14:48:35.698139 Removed 1 records with no data.
#> 
#> 2024-06-26 14:48:35.733392 Removed 1 records with no data.
#> 
#> 2024-06-26 14:48:35.76831 Removed 1 records with no data.
#> 
#> 2024-06-26 14:48:38.159312 Completed correlation with method=spearman. Formatting results.
#> 
#> 2024-06-26 14:48:38.16027 Received df table with 5461 samples and 137 features with values.
#> 
#> 2024-06-26 14:48:38.164508 Correlation computation completed with parameters recordIdColumn= 16S_rRNA_(V3-V5)_assay_Id , method =  spearman

Pearson Correlation

Use when the relationship between variables is linear. Suitable for assessing linear associations.

## grab two collections of interest, in this case species level data and pathway abundance data
HMP_MGX_species <- getCollection(microbiomeData::HMP_MGX, "Shotgun metagenomics Species (Relative taxonomic abundance analysis)")
HMP_MGX_pathways <- getCollection(microbiomeData::HMP_MGX, "Shotgun metagenomics Metagenome enzyme pathway abundance data" )

## get a correlation ComputeResult
## this is not necessarily to recommend pearson for functional data.
## it is simply exemplary. Always look at your data!
pathway_vs_species <- correlation(HMP_MGX_species, HMP_MGX_pathways, method = 'pearson')
#> 
#> 2024-06-26 14:48:38.666936 Removed 7 records with no data.
#> 
#> 2024-06-26 14:48:38.705565 Removed 7 records with no data.
#> 
#> 2024-06-26 14:48:38.828758 Removed 7 records with no data.
#> 
#> 2024-06-26 14:48:38.839323 Received first df table with 734 samples and 325 features with values.
#> 
#> 2024-06-26 14:48:38.839594 Received second df table with 741 samples and 371 features with values.
#> 
#> 2024-06-26 14:48:38.839841 Found 734 samples in common between data1 and data2. Only these samples will be used.
#> 
#> 2024-06-26 14:48:39.261762 Completed correlation with method=pearson. Formatting results.
#> 
#> 2024-06-26 14:48:39.26772 Correlation computation completed with parameters recordIdColumn= Metagenomic_sequencing_assay_Id , method =  pearson

Interpreting Results

Correlation Coefficients: Assess the strength and direction of correlations. Positive coefficients indicate positive correlations, while negative coefficients indicate negative correlations.

p-values and Adjusted p-values: Identify biomarkers with statistically significant correlations, considering adjustments for multiple testing.

You can extract these metrics and sort and filter results by them:

## you can extract network metrics 
pathway_vs_species.metrics <- as_tibble(
  getComputeResult(
    pathway_vs_species,
    correlationCoefThreshold = 0.8,
    pValueThreshold = 0.05
  )
)

## it's also easy to sort and filter these network metrics
## begin by renaming columns
colnames(pathway_vs_species.metrics) <- c('species', 'pathway', 'correlationCoef', 'pValue')
pathway_vs_species.metrics %>%
  filter(species == "Faecalibacterium prausnitzii") %>%
  filter(correlationCoef > 0.5) %>%
  arrange(desc(correlationCoef))
#> # A tibble: 1 × 4
#>   species                      pathway                    correlationCoef pValue
#>   <chr>                        <fct>                                <dbl>  <dbl>
#> 1 Faecalibacterium prausnitzii PWY-5177: glutaryl-CoA de…           0.901      0

You can also visualize them with custom htmlwidgets:

## now plot the network
## filters can be applied based on correlation coefficient and p-value
## renders an interactive htmlwidget, using a predetermined layout
correlationNetwork(
  pathway_vs_species,
  correlationCoefThreshold = 0.8,
  pValueThreshold = 0.05
)

Finally, for more advanced analysis of the network, you can extract it as an igraph object:

## if you extract the network as an igraph object, you can get more detailed metrics
pathway_vs_species.igraph <- getComputeResult(pathway_vs_species, format = 'igraph')
degree <- degree(pathway_vs_species.igraph)
edgeBT <- edge_betweenness(pathway_vs_species.igraph)
pgRank <- page_rank(pathway_vs_species.igraph)
coords <- layout_with_kk(pathway_vs_species.igraph)

#plot(pathway_vs_species.igraph, layout = coords)