Integrating scATAC-seq data using SCALEX

The following tutorial demonstrates how to use SCALEX for integrating scATAC-seq data.

There are two parts of this tutorial:

  • Seeing the batch effect. This part will show the batch effects of two adult mouse brain datasets from single nucleus ATAC-seq (snATAC) and droplet-based platform (Mouse Brain 10X) that used in SCALEX manuscript.

  • Integrating data using SCALEX. This part will show you how to perform batch correction using SCALEX function in SCALEX.

[1]:
import scalex
from scalex.function import SCALEX
from scalex.plot import embedding
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
import episcanpy as epi
[2]:
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
sc.logging.print_header()
plt.rcParams['axes.unicode_minus']=False
scanpy==1.6.1 anndata==0.7.5 umap==0.4.6 numpy==1.20.1 scipy==1.6.1 pandas==1.1.3 scikit-learn==0.23.2 statsmodels==0.12.0 python-igraph==0.8.3 louvain==0.7.0 leidenalg==0.8.3
[3]:
sns.__version__
[3]:
'0.10.1'
[4]:
scalex.__version__
[4]:
'0.2.0'

Seeing the batch effect

The following data has been used in the SnapATAC paper, has been used here, and can be downloaded from here.

On a unix system, you can uncomment and run the following to download the count matrix in its anndata format.

[5]:
# ! wget http://zhanglab.net/scalex-tutorial/mouse_brain_atac.h5ad
[6]:
adata_raw=sc.read('mouse_brain_atac.h5ad')
adata_raw
[6]:
AnnData object with n_obs × n_vars = 13746 × 479127
    obs: 'batch'

Inspect the batches contained in the dataset.

[7]:
adata_raw.obs.batch.value_counts()
[7]:
snATAC    9646
10X       4100
Name: batch, dtype: int64

The data processing procedure is according to the epiScanpy tutorial [Buenrostro_PBMC_data_processing].

[8]:
epi.pp.filter_cells(adata_raw, min_features=1)
epi.pp.filter_features(adata_raw, min_cells=5)
adata_raw.raw = adata_raw
adata_raw = epi.pp.select_var_feature(adata_raw,
                                      nb_features=30000,
                                      show=False,copy=True)
adata_raw.layers['binary'] = adata_raw.X.copy()
epi.pp.normalize_total(adata_raw)
adata_raw.layers['normalised'] = adata_raw.X.copy()
epi.pp.log1p(adata_raw)
epi.pp.lazy(adata_raw)
epi.tl.leiden(adata_raw)
filtered out 11140 genes that are detected in less than 5 cells
normalizing counts per cell
    finished (0:00:00)
computing PCA
    with n_comps=50
    finished (0:01:46)
computing neighbors
    using 'X_pca' with n_pcs = 50
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:14)
computing tSNE
    using 'X_pca' with n_pcs = 50
    using the 'MulticoreTSNE' package by Ulyanov (2017)
    finished: added
    'X_tsne', tSNE coordinates (adata.obsm) (0:01:35)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:09)
running Leiden clustering
    finished: found 25 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:01)

We observe a batch effect.

[9]:
sc.pl.umap(adata_raw,color=['leiden'],legend_fontsize=10)
../_images/tutorial_Integration_scATAC-seq_16_0.png
[10]:
sc.pl.umap(adata_raw,color=['batch'],legend_fontsize=10)
../_images/tutorial_Integration_scATAC-seq_17_0.png
[11]:
adata_raw
[11]:
AnnData object with n_obs × n_vars = 13746 × 30076
    obs: 'batch', 'nb_features', 'leiden'
    var: 'n_cells', 'prop_shared_cells', 'variability_score'
    uns: 'log1p', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_colors', 'batch_colors'
    obsm: 'X_pca', 'X_tsne', 'X_umap'
    varm: 'PCs'
    layers: 'binary', 'normalised'
    obsp: 'distances', 'connectivities'

Integrating data using SCALEX

The batch effects can be well-resolved using SCALEX.

Note

Here we use GPU to speed up the calculation process, however, you can get the same level of performance only using cpu.

[12]:
adata=SCALEX('mouse_brain_atac.h5ad',batch_name='batch',profile='ATAC',
            min_features=1, min_cells=5, n_top_features=30000,outdir='ATAC_output/',show=False,gpu=8)
2021-03-30 20:21:45,161 - root - INFO - Raw dataset shape: (13746, 479127)
2021-03-30 20:21:45,166 - root - INFO - Preprocessing
2021-03-30 20:21:45,639 - root - INFO - Filtering cells
2021-03-30 20:21:47,224 - root - INFO - Filtering features
filtered out 11140 genes that are detected in less than 5 cells
2021-03-30 20:21:49,461 - root - INFO - Finding variable features
2021-03-30 20:21:51,582 - root - INFO - Normalizing total per cell
normalizing counts per cell
    finished (0:00:00)
2021-03-30 20:21:51,690 - root - INFO - Batch specific maxabs scaling
2021-03-30 20:22:16,926 - root - INFO - Processed dataset shape: (13746, 30076)
2021-03-30 20:22:17,234 - root - INFO - model
VAE(
  (encoder): Encoder(
    (enc): NN(
      (net): ModuleList(
        (0): Block(
          (fc): Linear(in_features=30076, out_features=1024, bias=True)
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (act): ReLU()
        )
      )
    )
    (mu_enc): NN(
      (net): ModuleList(
        (0): Block(
          (fc): Linear(in_features=1024, out_features=10, bias=True)
        )
      )
    )
    (var_enc): NN(
      (net): ModuleList(
        (0): Block(
          (fc): Linear(in_features=1024, out_features=10, bias=True)
        )
      )
    )
  )
  (decoder): NN(
    (net): ModuleList(
      (0): Block(
        (fc): Linear(in_features=10, out_features=30076, bias=True)
        (norm): DSBatchNorm(
          (bns): ModuleList(
            (0): BatchNorm1d(30076, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (1): BatchNorm1d(30076, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
        )
        (act): Sigmoid()
      )
    )
  )
)
Epochs: 100%|██████████| 141/141 [10:53<00:00,  4.63s/it, recon_loss=992.783,kl_loss=3.643]
2021-03-30 20:33:17,866 - root - INFO - Output dir: ATAC_output//
2021-03-30 20:33:22,489 - root - INFO - Plot umap
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:03)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:09)
running Leiden clustering
    finished: found 16 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:01)
WARNING: saving figure to file ATAC_output/umap.pdf
[13]:
adata
[13]:
AnnData object with n_obs × n_vars = 13746 × 30076
    obs: 'batch', 'n_genes', 'leiden'
    var: 'n_cells', 'prop_shared_cells', 'variability_score'
    uns: 'neighbors', 'umap', 'leiden', 'batch_colors', 'leiden_colors'
    obsm: 'latent', 'X_umap'
    obsp: 'distances', 'connectivities'

While there seems to be some strong batch-effect in all cell types, SCALEX can integrate them homogeneously.

[14]:
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
[15]:
sc.pl.umap(adata,color=['leiden'],legend_fontsize=10)
../_images/tutorial_Integration_scATAC-seq_26_0.png
[16]:
sc.pl.umap(adata,color=['batch'],legend_fontsize=10)
../_images/tutorial_Integration_scATAC-seq_27_0.png

The integrated data is stored as adata.h5ad in the output directory assigned by outdir parameter in SCALE function.