Self Correlation • MicrobiomeDB

library(MicrobiomeDB, quietly = TRUE)
#> Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
#> 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'

library(tidyverse, quietly = TRUE)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
#> ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
#> ✔ purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(igraph, quietly = TRUE)
#> 
#> Attaching package: 'igraph'
#> 
#> The following objects are masked from 'package:lubridate':
#> 
#>     %--%, union
#> 
#> The following objects are masked from 'package:dplyr':
#> 
#>     as_data_frame, groups, union
#> 
#> The following objects are masked from 'package:purrr':
#> 
#>     compose, simplify
#> 
#> The following object is masked from 'package:tidyr':
#> 
#>     crossing
#> 
#> The following object is masked from 'package:tibble':
#> 
#>     as_data_frame
#> 
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> 
#> The following object is masked from 'package:base':
#> 
#>     union

What is a Correlation Analysis?

Correlations are useful for identifying relationships between variables. When you correlate taxonomic abundance variables against themselves, it essentially produces a microbial network. Producing and analyzing these networks is complicated by the fact that microbiome data is inherently compositional. If not properly accounted for, that can result in a lot of spurious negative correlations.

Why Care About Microbial Network Analysis?

Scientists delve into microbial network analysis for a large number of reasons:

Unraveling Complex Interactions: Microbial network analysis unveils intricate relationships between microorganisms, shedding light on their interconnectedness within ecosystems.

Diagnostic Potential: Identifying key nodes in microbial networks can unveil potential diagnostic markers, offering a glimpse into microbial dynamics associated with various diseases or environmental conditions.

Insights into Functionality: Microbial network analysis elucidates functional roles within microbial communities, offering valuable insights into their ecological and physiological significance.

From unraveling ecological mysteries to informing medical diagnoses, microbial network analysis proves indispensable in understanding the intricate world of microorganisms.

How are Correlations Calculated?

This package employs SPARCC for correlation analysis between microbial taxa. SPARCC calculates correlations that account for the compositional nature of microbiome data.

## first lets find some interesting data
microbiomeData::getCuratedDatasetNames()
#>  [1] "Anopheles_albimanus"     "BONUS"                  
#>  [3] "Bangladesh"              "DailyBaby"              
#>  [5] "DiabImmune"              "ECAM"                   
#>  [7] "EcoCF"                   "FARMM"                  
#>  [9] "GEMS1"                   "HMP_MGX"                
#> [11] "HMP_V1V3"                "HMP_V3V5"               
#> [13] "Leishmaniasis"           "MALED_2yr"              
#> [15] "MALED_diarrhea"          "MORDOR"                 
#> [17] "Malaysia_helminth"       "NICU_NEC"               
#> [19] "PIH_Uganda"              "PretermInfantResistome1"
#> [21] "PretermInfantResistome2" "UgandaMaternal"


getCollectionNames(microbiomeData::FARMM)
#>  [1] "Metabolomics Mass spectrometry assay"                                     
#>  [2] "Shotgun metagenomics 4th level EC metagenome abundance data"              
#>  [3] "Shotgun metagenomics Metagenome enzyme pathway abundance data"            
#>  [4] "Shotgun metagenomics Metagenome enzyme pathway coverage data"             
#>  [5] "Shotgun metagenomics Genus (Relative taxonomic abundance analysis)"       
#>  [6] "Shotgun metagenomics Species (Relative taxonomic abundance analysis)"     
#>  [7] "Shotgun metagenomics Family (Relative taxonomic abundance analysis)"      
#>  [8] "Shotgun metagenomics Order (Relative taxonomic abundance analysis)"       
#>  [9] "Shotgun metagenomics Phylum (Relative taxonomic abundance analysis)"      
#> [10] "Shotgun metagenomics Class (Relative taxonomic abundance analysis)"       
#> [11] "Shotgun metagenomics Normalized number of taxon-specific sequence matches"
#> [12] "Shotgun metagenomics Kingdom (Relative taxonomic abundance analysis)"


## grab a collection of interest
## these methods can be used with other data types as well
farmm_metabolomics <- getCollection(microbiomeData::FARMM, "Metabolomics Mass spectrometry assay")
#> 
#> 2024-06-26 14:49:08.134849 Integer values detected. Converting collection to AbsoluteAbundanceData


## get a self correlation ComputeResult
## methods spearman and pearson are options here as well, for data known to not be compositional
selfCorrelation_metabolomics <- selfCorrelation(farmm_metabolomics, method='spearman')
#> 
#> 2024-06-26 14:49:09.492237 Completed correlation with method=spearman. Formatting results.
#> 
#> 2024-06-26 14:49:09.612335 Received df table with 150 samples and 479 features with abundances.
#> 
#> 2024-06-26 14:49:09.616744 Correlation computation completed with parameters recordIdColumn= Mass_spectrometry_assay_Id , method =  spearman

Interpreting Results

Correlation Coefficients: Assess the strength and direction of correlations. Positive coefficients indicate positive correlations, while negative coefficients indicate negative correlations.

p-values and Adjusted p-values: Identify biomarkers with statistically significant correlations, considering adjustments for multiple testing.

You can extract these metrics and sort and filter results by them:

## you can extract network metrics 
selfCorrelation_metabolomics.metrics <- as_tibble(
    getComputeResult(
        selfCorrelation_metabolomics,
        correlationCoefThreshold = 0.5,
        pValueThreshold = 0.05
    )
)

## it's also easy to sort and filter these network metrics
## begin by renaming columns
colnames(selfCorrelation_metabolomics.metrics) <- c('metabolite1', 'metabolite2', 'correlationCoef', 'pValue')
selfCorrelation_metabolomics.metrics %>%
    filter(correlationCoef > 0.5) %>%
    arrange(desc(correlationCoef))
#> # A tibble: 11,027 × 4
#>    metabolite1                       metabolite2          correlationCoef pValue
#>    <fct>                             <fct>                          <dbl>  <dbl>
#>  1 C46:3 TAG +NH4:C8-pos: Metabolite C46:3 TAG:C8-pos: M…           0.999      0
#>  2 C44:1 TAG +NH4:C8-pos: Metabolite C44:1 TAG:C8-pos: M…           0.999      0
#>  3 C16:0 LPC:C8-pos: Metabolite      C20:3 LPC:C8-pos: M…           0.999      0
#>  4 C44:2 TAG +NH4:C8-pos: Metabolite C44:2 TAG:C8-pos: M…           0.999      0
#>  5 C46:2 TAG +NH4:C8-pos: Metabolite C46:2 TAG:C8-pos: M…           0.998      0
#>  6 C52:0 TAG +NH4:C8-pos: Metabolite C52:0 TAG:C8-pos: M…           0.997      0
#>  7 C52:1 TAG +NH4:C8-pos: Metabolite C52:1 TAG:C8-pos: M…           0.997      0
#>  8 C50:0 TAG +NH4:C8-pos: Metabolite C50:0 TAG:C8-pos: M…           0.996      0
#>  9 C42:0 TAG +NH4:C8-pos: Metabolite C42:0 TAG:C8-pos: M…           0.995      0
#> 10 C48:3 TAG +NH4:C8-pos: Metabolite C48:3 TAG:C8-pos: M…           0.995      0
#> # ℹ 11,017 more rows

You can also visualize them with custom htmlwidgets:

## now plot the network
## filters can be applied based on correlation coefficient and p-value
## renders an interactive htmlwidget, using a predetermined layout
correlationNetwork(
    selfCorrelation_metabolomics,
    correlationCoefThreshold = 0.5,
    pValueThreshold = 0.05
)

Finally, for more advanced analysis of the network, you can extract it as an igraph object:

## if you extract the network as an igraph object, you can get more detailed metrics
selfCorrelation_metabolomics.igraph <- getComputeResult(selfCorrelation_metabolomics, format = 'igraph')
degree <- degree(selfCorrelation_metabolomics.igraph)
edgeBT <- edge_betweenness(selfCorrelation_metabolomics.igraph)
pgRank <- page_rank(selfCorrelation_metabolomics.igraph)
coords <- layout_with_kk(selfCorrelation_metabolomics.igraph)

#plot(selfCorrelation_metabolomics.igraph, coords)