Stars PyPI PyPIDownloads Docs

Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space

Tutorial

Integration

before integration

_images/pbmc_before_integration.png

after SCALEX integration

_images/pbmc_integration.png

Projection

Map new data to the embeddings of reference

A pancreas reference was created by integrating eight batches.

Here, map pancreas_gse81547, pancreas_gse83139 and pancreas_gse114297 to the embeddings of pancreas reference.

_images/pancreas_projection.png

Label transfer

Annotate cells in new data through label transfer

Label transfer tabula muris data and mouse kidney data from mouse atlas reference

mouse atlas reference

_images/mouse_atlas_reference.png

query tabula muris aging and query mouse kidney

_images/query_tabula_muris_aging.png _images/query_kidney.png

Integration scATAC-seq data

_images/scATAC_integration.png

Integration cross-modality data

Integrate scRNA-seq and scATAC-seq dataset

_images/cross-modality_integration.png

Spatial data (To be updated)

Integrating spatial data with scRNA-seq

Examples

Integrating PBMC data using SCALEX

The following tutorial demonstrates how to use SCALEX for integrating PBMC data.

There are two parts of this tutorial:

  • Seeing the batch effect. This part will show the batch effects of two PBMC datasets from single cell 3’ and 5’ gene expression libraries that used in SCALEX manuscript.

  • Integrating data using SCALEX. This part will show you how to perform batch correction using SCALEX function in SCALEX.

[1]:
import scalex
from scalex.function import SCALEX
from scalex.plot import embedding
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
[2]:
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
sc.logging.print_header()
plt.rcParams['axes.unicode_minus']=False
scanpy==1.6.1 anndata==0.7.5 umap==0.4.6 numpy==1.20.1 scipy==1.6.1 pandas==1.1.3 scikit-learn==0.23.2 statsmodels==0.12.0 python-igraph==0.8.3 louvain==0.7.0 leidenalg==0.8.3
[3]:
sns.__version__
[3]:
'0.10.1'
[4]:
scalex.__version__
[4]:
'0.2.0'
Integrating data using SCALEX

The batch effects can be well-resolved using SCALEX.

Note

Here we use GPU to speed up the calculation process, however, you can get the same level of performance only using cpu.

[12]:
# ! wget http://zhanglab.net/scalex-tutorial/pbmc.h5ad
adata=SCALEX('pbmc.h5ad', batch_name='batch', outdir='pbmc_output/')
2021-03-30 20:16:38,586 - root - INFO - Raw dataset shape: (15476, 33694)
2021-03-30 20:16:38,588 - root - INFO - Preprocessing
2021-03-30 20:16:38,616 - root - INFO - Filtering cells
Trying to set attribute `.obs` of view, copying.
2021-03-30 20:16:39,462 - root - INFO - Filtering features
filtered out 13316 genes that are detected in less than 3 cells
2021-03-30 20:16:39,952 - root - INFO - Normalizing total per cell
normalizing counts per cell
    finished (0:00:00)
2021-03-30 20:16:40,076 - root - INFO - Log1p transforming
2021-03-30 20:16:40,493 - root - INFO - Finding variable features
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
    finished (0:00:01)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
2021-03-30 20:16:42,776 - root - INFO - Batch specific maxabs scaling
2021-03-30 20:16:44,173 - root - INFO - Processed dataset shape: (15476, 2000)
2021-03-30 20:16:44,192 - root - INFO - model
VAE(
  (encoder): Encoder(
    (enc): NN(
      (net): ModuleList(
        (0): Block(
          (fc): Linear(in_features=2000, out_features=1024, bias=True)
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (act): ReLU()
        )
      )
    )
    (mu_enc): NN(
      (net): ModuleList(
        (0): Block(
          (fc): Linear(in_features=1024, out_features=10, bias=True)
        )
      )
    )
    (var_enc): NN(
      (net): ModuleList(
        (0): Block(
          (fc): Linear(in_features=1024, out_features=10, bias=True)
        )
      )
    )
  )
  (decoder): NN(
    (net): ModuleList(
      (0): Block(
        (fc): Linear(in_features=10, out_features=2000, bias=True)
        (norm): DSBatchNorm(
          (bns): ModuleList(
            (0): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (1): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
        )
        (act): Sigmoid()
      )
    )
  )
)
Epochs: 100%|██████████| 125/125 [06:11<00:00,  2.97s/it, recon_loss=182.388,kl_loss=3.988]
2021-03-30 20:23:03,050 - root - INFO - Output dir: pbmc_output//
2021-03-30 20:23:07,564 - root - INFO - Plot umap
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:03)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:12)
running Leiden clustering
    finished: found 15 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:05)
WARNING: saving figure to file pbmc_output/umap.pdf
[13]:
adata
[13]:
AnnData object with n_obs × n_vars = 15476 × 2000
    obs: 'batch', 'celltype', 'protocol', 'celltype0', 'n_genes', 'leiden'
    var: 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'log1p', 'hvg', 'neighbors', 'umap', 'leiden', 'batch_colors', 'celltype_colors', 'leiden_colors'
    obsm: 'latent', 'X_umap'
    obsp: 'distances', 'connectivities'

While there seems to be some strong batch-effect in all cell types, SCALEX can integrate them homogeneously.

[14]:
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
[15]:
sc.pl.umap(adata,color=['celltype'],legend_fontsize=10)
_images/tutorial_Integration_PBMC_13_0.png
[16]:
sc.pl.umap(adata,color=['batch'],legend_fontsize=10)
_images/tutorial_Integration_PBMC_14_0.png

The integrated data is stored as adata.h5ad in the output directory assigned by outdir parameter in SCALEX function.

Projection pancreas data using SCALEX

The following tutorial demonstrates how to use SCALEX for integrating data and projection new data adata_query onto an annotated reference adata_ref.

There are five parts of this tutorial:

  • Seeing the batch effect. This part will show the batch effects of eight pancreas datasets that used in SCALEX manuscript.

  • Integrating data using SCALEX. This part will show you how to perform batch correction and construct a reference batch adata_ref using SCALEX function in SCALEX.

  • Mapping onto a reference batch using projection function. The third part will describe the usage of projection function in SCALEX to map three batches query dataset adata_query onto the reference batch adata_ref you construetd in part two.

  • Visualizing distributions across batches. Often, batches correspond to experiments that one wants to compare. SCALEX v2 offers embedding function to convenient visualize for this.

  • Label transfer. SCALEX offers label_transfer function to conveniently transfer labels from reference datasets to query datasets.

[1]:
import scalex
from scalex import SCALEX, label_transfer
from scalex.plot import embedding
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
[2]:
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
sc.logging.print_header()
plt.rcParams['axes.unicode_minus']=False
scanpy==1.6.1 anndata==0.7.5 umap==0.4.6 numpy==1.20.1 scipy==1.6.1 pandas==1.1.3 scikit-learn==0.23.2 statsmodels==0.12.0 python-igraph==0.8.3 louvain==0.7.0 leidenalg==0.8.3
[3]:
sns.__version__
[3]:
'0.10.1'
[4]:
scalex.__version__
[4]:
'0.2.0'
Seeing the batch effect

The pancreas data has been used in the Seurat v3 and Harmony paper.

On a unix system, you can uncomment and run the following to download the count matrix in its anndata format.

[5]:
# ! wget http://zhanglab.net/scalex-tutorial/pancreas.h5ad
# ! wget http://zhanglab.net/scalex-tutorial/pancreas_query.h5ad
[6]:
adata_raw=sc.read('pancreas.h5ad')
adata_raw
[6]:
AnnData object with n_obs × n_vars = 16401 × 14895
    obs: 'batch', 'celltype', 'disease', 'donor', 'library', 'protocol'

Inspect the batches contained in the dataset.

[7]:
adata_raw.obs.batch.value_counts()
[7]:
pancreas_indrop3       3605
pancreas_celseq2       3072
pancreas_smartseq2     2394
pancreas_indrop1       1937
pancreas_celseq        1728
pancreas_indrop2       1724
pancreas_indrop4       1303
pancreas_fluidigmc1     638
Name: batch, dtype: int64

The data processing procedure is according to the scanpy tutorial [Preprocessing and clustering 3k PBMCs].

[8]:
sc.pp.filter_cells(adata_raw, min_genes=600)
sc.pp.filter_genes(adata_raw, min_cells=3)
adata_raw = adata_raw[:, [gene for gene in adata_raw.var_names if not str(gene).startswith(tuple(['ERCC', 'MT-', 'mt-']))]]
sc.pp.normalize_total(adata_raw, target_sum=1e4)
sc.pp.log1p(adata_raw)
sc.pp.highly_variable_genes(adata_raw, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata_raw.raw = adata_raw
adata_raw = adata_raw[:, adata_raw.var.highly_variable]
sc.pp.scale(adata_raw, max_value=10)
sc.pp.pca(adata_raw)
sc.pp.neighbors(adata_raw)
sc.tl.umap(adata_raw)
filtered out 1124 cells that have less than 600 genes expressed
normalizing counts per cell
    finished (0:00:00)
extracting highly variable genes
    finished (0:00:01)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
... as `zero_center=True`, sparse input is densified and may lead to large memory consumption
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:03)
computing neighbors
    using 'X_pca' with n_pcs = 50
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:15)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:09)

We observe a batch effect.

[9]:
sc.pl.umap(adata_raw,color=['celltype'],legend_fontsize=10)
_images/tutorial_Projection_pancreas_16_0.png
[10]:
sc.pl.umap(adata_raw,color=['batch'],legend_fontsize=10)
_images/tutorial_Projection_pancreas_17_0.png
[11]:
adata_raw
[11]:
AnnData object with n_obs × n_vars = 15277 × 2086
    obs: 'batch', 'celltype', 'disease', 'donor', 'library', 'protocol', 'n_genes'
    var: 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'celltype_colors', 'batch_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
Integrating data using SCALEX

The batch effects can be well-resolved using SCALEX.

Note

Here we use GPU to speed up the calculation process, however, you can get the same level of performance only using cpu.

[12]:
adata_ref=SCALEX('pancreas.h5ad',batch_name='batch',min_features=600, min_cells=3, outdir='pancreas_output/',show=False,gpu=7)
2021-03-30 20:21:06,757 - root - INFO - Raw dataset shape: (16401, 14895)
2021-03-30 20:21:06,759 - root - INFO - Preprocessing
2021-03-30 20:21:06,785 - root - INFO - Filtering cells
filtered out 1124 cells that have less than 600 genes expressed
Trying to set attribute `.obs` of view, copying.
2021-03-30 20:21:09,107 - root - INFO - Filtering features
2021-03-30 20:21:10,607 - root - INFO - Normalizing total per cell
normalizing counts per cell
    finished (0:00:00)
2021-03-30 20:21:10,850 - root - INFO - Log1p transforming
2021-03-30 20:21:11,759 - root - INFO - Finding variable features
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
    finished (0:00:04)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
2021-03-30 20:21:17,003 - root - INFO - Batch specific maxabs scaling
2021-03-30 20:21:19,528 - root - INFO - Processed dataset shape: (15277, 2000)
2021-03-30 20:21:19,566 - root - INFO - model
VAE(
  (encoder): Encoder(
    (enc): NN(
      (net): ModuleList(
        (0): Block(
          (fc): Linear(in_features=2000, out_features=1024, bias=True)
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (act): ReLU()
        )
      )
    )
    (mu_enc): NN(
      (net): ModuleList(
        (0): Block(
          (fc): Linear(in_features=1024, out_features=10, bias=True)
        )
      )
    )
    (var_enc): NN(
      (net): ModuleList(
        (0): Block(
          (fc): Linear(in_features=1024, out_features=10, bias=True)
        )
      )
    )
  )
  (decoder): NN(
    (net): ModuleList(
      (0): Block(
        (fc): Linear(in_features=10, out_features=2000, bias=True)
        (norm): DSBatchNorm(
          (bns): ModuleList(
            (0): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (1): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (2): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (3): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (4): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (5): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (6): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (7): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
        )
        (act): Sigmoid()
      )
    )
  )
)
Epochs: 100%|██████████| 127/127 [09:11<00:00,  4.34s/it, recon_loss=266.572,kl_loss=4.667]
2021-03-30 20:30:38,361 - root - INFO - Output dir: pancreas_output//
2021-03-30 20:30:47,018 - root - INFO - Plot umap
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:03)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:10)
running Leiden clustering
    finished: found 13 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:02)
WARNING: saving figure to file pancreas_output/umap.pdf
[13]:
adata_ref
[13]:
AnnData object with n_obs × n_vars = 15277 × 2000
    obs: 'batch', 'celltype', 'disease', 'donor', 'library', 'protocol', 'n_genes', 'leiden'
    var: 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'log1p', 'hvg', 'neighbors', 'umap', 'leiden', 'batch_colors', 'celltype_colors', 'leiden_colors'
    obsm: 'latent', 'X_umap'
    obsp: 'distances', 'connectivities'

While there seems to be some strong batch-effect in all cell types, SCALEX can integrate them homogeneously.

[14]:
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
[15]:
sc.pl.umap(adata_ref,color=['celltype'],legend_fontsize=10)
_images/tutorial_Projection_pancreas_26_0.png
[16]:
sc.pl.umap(adata_ref,color=['batch'],legend_fontsize=10)
_images/tutorial_Projection_pancreas_27_0.png

The integrated data is stored as adata.h5ad in the output directory assigned by outdir parameter in SCALEX function.

Mapping onto a reference batch using projection function

The pancreas query data are available through the Gene Expression Omnibus under accession GSE114297, GSE81547 and GSE83139.

[17]:
adata_query=sc.read('pancreas_query.h5ad')
adata_query
[17]:
AnnData object with n_obs × n_vars = 23963 × 31884
    obs: 'batch', 'celltype', 'disease', 'donor', 'protocol'

Inspect the batches contained in adata_query.

[18]:
adata_query.obs.batch.value_counts()
[18]:
pancreas_gse114297    20784
pancreas_gse81547      2544
pancreas_gse83139       635
Name: batch, dtype: int64

SCALEX provides a projection function for mapping new data adata_query onto the reference batch adata_ref.

[19]:
adata=SCALEX('pancreas_query.h5ad',batch_name='batch',min_features=600, min_cells=3,
             outdir='pancreas_projection/', projection='pancreas_output/',show=False,gpu=7)
2021-03-30 20:31:47,177 - root - INFO - Raw dataset shape: (23963, 31884)
2021-03-30 20:31:47,177 - root - INFO - Raw dataset shape: (23963, 31884)
2021-03-30 20:31:47,180 - root - INFO - Preprocessing
2021-03-30 20:31:47,180 - root - INFO - Preprocessing
2021-03-30 20:31:47,236 - root - INFO - Filtering cells
2021-03-30 20:31:47,236 - root - INFO - Filtering cells
filtered out 219 cells that have less than 600 genes expressed
Trying to set attribute `.obs` of view, copying.
2021-03-30 20:31:49,744 - root - INFO - Filtering features
2021-03-30 20:31:49,744 - root - INFO - Filtering features
2021-03-30 20:31:51,094 - root - INFO - Normalizing total per cell
2021-03-30 20:31:51,094 - root - INFO - Normalizing total per cell
normalizing counts per cell
    finished (0:00:00)
2021-03-30 20:31:51,430 - root - INFO - Log1p transforming
2021-03-30 20:31:51,430 - root - INFO - Log1p transforming
2021-03-30 20:31:52,607 - root - INFO - Finding variable features
2021-03-30 20:31:52,607 - root - INFO - Finding variable features
There are 2000 gene in selected genes
2021-03-30 20:31:56,488 - root - INFO - Batch specific maxabs scaling
2021-03-30 20:31:56,488 - root - INFO - Batch specific maxabs scaling
2021-03-30 20:31:59,442 - root - INFO - Processed dataset shape: (23744, 2000)
2021-03-30 20:31:59,442 - root - INFO - Processed dataset shape: (23744, 2000)
2021-03-30 20:32:04,207 - root - INFO - Output dir: pancreas_projection//
2021-03-30 20:32:04,207 - root - INFO - Output dir: pancreas_projection//
... storing 'batch' as categorical
... storing 'celltype' as categorical
... storing 'disease' as categorical
... storing 'donor' as categorical
... storing 'library' as categorical
... storing 'protocol' as categorical
... storing 'leiden' as categorical
2021-03-30 20:32:08,411 - root - INFO - Plot umap
2021-03-30 20:32:08,411 - root - INFO - Plot umap
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:08)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:26)
running Leiden clustering
    finished: found 13 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:10)
WARNING: saving figure to file pancreas_projection/X_umap_reference.pdf
WARNING: saving figure to file pancreas_projection/X_umap_query.pdf

Load integrated data adata that contained adata_ref and adata_query.

[20]:
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
[21]:
embedding(adata,groupby='projection')
_images/tutorial_Projection_pancreas_38_0.png
_images/tutorial_Projection_pancreas_38_1.png
[22]:
adata
[22]:
AnnData object with n_obs × n_vars = 39021 × 2000
    obs: 'batch', 'celltype', 'disease', 'donor', 'library', 'protocol', 'n_genes', 'leiden', 'projection'
    var: 'n_cells-reference', 'highly_variable-reference', 'means-reference', 'dispersions-reference', 'dispersions_norm-reference', 'highly_variable_nbatches-reference', 'highly_variable_intersection-reference'
    uns: 'neighbors', 'umap', 'leiden'
    obsm: 'latent', 'X_umap'
    obsp: 'distances', 'connectivities'

Inspect the batches contained in adata.

[23]:
adata.obs.batch.value_counts()
[23]:
pancreas_gse114297     20573
pancreas_indrop3        3605
pancreas_gse81547       2536
pancreas_celseq2        2440
pancreas_smartseq2      2394
pancreas_indrop1        1937
pancreas_indrop2        1724
pancreas_indrop4        1303
pancreas_celseq         1236
pancreas_fluidigmc1      638
pancreas_gse83139        635
Name: batch, dtype: int64
Visualizing distributions across batches
[24]:
sc.pl.umap(adata[adata.obs.batch.isin(['pancreas_gse114297','pancreas_gse81547','pancreas_gse83139'])],color='batch',legend_fontsize=10)
embedding(adata[adata.obs.batch.isin(['pancreas_gse114297','pancreas_gse81547','pancreas_gse83139'])],legend_fontsize=10)
Trying to set attribute `.uns` of view, copying.
_images/tutorial_Projection_pancreas_43_1.png
Trying to set attribute `.obs` of view, copying.
_images/tutorial_Projection_pancreas_43_3.png
_images/tutorial_Projection_pancreas_43_4.png
_images/tutorial_Projection_pancreas_43_5.png

The projection results is stored as adata.h5ad in the output directory assigned by outdir parameter in SCALE function.

Label transfer

We can also use SCALEX to transfer data from one dataset to another. Here, we demonstrate data transfer between two scRNA-seq datasets by transferring the cell type label from the adata_ref and the adata_query.

[25]:
adata_query=adata[adata.obs.projection=='query']
adata_query
[25]:
View of AnnData object with n_obs × n_vars = 23744 × 2000
    obs: 'batch', 'celltype', 'disease', 'donor', 'library', 'protocol', 'n_genes', 'leiden', 'projection'
    var: 'n_cells-reference', 'highly_variable-reference', 'means-reference', 'dispersions-reference', 'dispersions_norm-reference', 'highly_variable_nbatches-reference', 'highly_variable_intersection-reference'
    uns: 'neighbors', 'umap', 'leiden'
    obsm: 'latent', 'X_umap'
    obsp: 'distances', 'connectivities'
[26]:
adata_ref=adata[adata.obs.projection=='reference']
adata_ref
[26]:
View of AnnData object with n_obs × n_vars = 15277 × 2000
    obs: 'batch', 'celltype', 'disease', 'donor', 'library', 'protocol', 'n_genes', 'leiden', 'projection'
    var: 'n_cells-reference', 'highly_variable-reference', 'means-reference', 'dispersions-reference', 'dispersions_norm-reference', 'highly_variable_nbatches-reference', 'highly_variable_intersection-reference'
    uns: 'neighbors', 'umap', 'leiden'
    obsm: 'latent', 'X_umap'
    obsp: 'distances', 'connectivities'
[27]:
adata_query.obs['celltype_transfer']=label_transfer(adata_ref, adata_query, rep='latent', label='celltype')
Trying to set attribute `.obs` of view, copying.
[28]:
sc.pl.umap(adata_query,color=['celltype_transfer'])
... storing 'celltype_transfer' as categorical
_images/tutorial_Projection_pancreas_50_1.png
[29]:
sc.pl.umap(adata_query,color=['celltype'])
_images/tutorial_Projection_pancreas_51_0.png

Let us first focus on cell types that are conserved with the reference.

[30]:
obs_query = adata_query.obs
conserved_categories = obs_query.celltype.cat.categories.intersection(obs_query.celltype_transfer.cat.categories)  # intersected categories
obs_query_conserved = obs_query.loc[obs_query.celltype.isin(conserved_categories) & obs_query.celltype_transfer.isin(conserved_categories)]  # intersect categories
obs_query_conserved.celltype.cat.remove_unused_categories(inplace=True)  # remove unused categoriyes
obs_query_conserved.celltype_transfer.cat.remove_unused_categories(inplace=True)  # remove unused categoriyes
obs_query_conserved.celltype_transfer.cat.reorder_categories(obs_query_conserved.celltype.cat.categories, inplace=True)  # fix category ordering
pd.crosstab(obs_query_conserved.celltype, obs_query_conserved.celltype_transfer)
[30]:
celltype_transfer acinar activated_stellate alpha alpha_er beta beta_er delta ductal endothelial epsilon gamma macrophage mast quiescent_stellate schwann
celltype
acinar 1346 0 3 0 3 0 0 2 0 0 0 0 0 1 0
activated_stellate 0 1033 0 0 2 0 0 0 0 0 0 0 0 119 1
alpha 2 0 7527 83 13 0 3 0 0 0 11 0 1 0 0
alpha_er 0 0 159 105 2 1 0 0 0 0 0 0 0 0 0
beta 3 0 119 4 8683 34 17 1 0 0 3 0 0 0 0
beta_er 0 0 0 0 41 56 0 1 0 0 0 0 0 0 0
delta 1 0 8 0 34 0 1099 0 0 1 6 0 0 0 0
ductal 2 0 3 0 2 0 3 1889 0 0 0 0 0 0 0
endothelial 0 1 0 0 1 0 1 0 456 0 0 0 0 1 0
epsilon 1 0 0 0 0 0 3 0 0 10 3 0 0 0 0
gamma 0 0 7 2 3 0 1 0 0 2 578 0 0 0 0
macrophage 0 0 0 0 0 0 0 0 0 0 0 58 14 0 0
mast 1 0 0 0 1 0 0 0 0 0 0 3 6 0 0
quiescent_stellate 0 53 0 0 2 0 0 0 0 0 0 0 0 68 1
schwann 1 0 2 0 0 1 0 0 0 0 0 0 0 0 34

Let us now move on to look at all cell types.

[31]:
pd.crosstab(adata_query.obs.celltype, adata_query.obs.celltype_transfer)
[31]:
celltype_transfer acinar activated_stellate alpha alpha_er beta beta_er delta ductal endothelial epithelial epsilon gamma macrophage mast quiescent_stellate schwann
celltype
acinar 1346 0 3 0 3 0 0 2 0 0 0 0 0 0 1 0
activated_stellate 0 1033 0 0 2 0 0 0 0 0 0 0 0 0 119 1
alpha 2 0 7527 83 13 0 3 0 0 1 0 11 0 1 0 0
alpha_er 0 0 159 105 2 1 0 0 0 0 0 0 0 0 0 0
beta 3 0 119 4 8683 34 17 1 0 0 0 3 0 0 0 0
beta_er 0 0 0 0 41 56 0 1 0 0 0 0 0 0 0 0
delta 1 0 8 0 34 0 1099 0 0 0 1 6 0 0 0 0
ductal 2 0 3 0 2 0 3 1889 0 0 0 0 0 0 0 0
endothelial 0 1 0 0 1 0 1 0 456 0 0 0 0 0 1 0
epsilon 1 0 0 0 0 0 3 0 0 0 10 3 0 0 0 0
gamma 0 0 7 2 3 0 1 0 0 0 2 578 0 0 0 0
macrophage 0 0 0 0 0 0 0 0 0 0 0 0 58 14 0 0
mast 1 0 0 0 1 0 0 0 0 1 0 0 3 6 0 0
quiescent_stellate 0 53 0 0 2 0 0 0 0 0 0 0 0 0 68 1
schwann 1 0 2 0 0 1 0 0 0 0 0 0 0 0 0 34

Integrating scATAC-seq data using SCALEX

The following tutorial demonstrates how to use SCALEX for integrating scATAC-seq data.

There are two parts of this tutorial:

  • Seeing the batch effect. This part will show the batch effects of two adult mouse brain datasets from single nucleus ATAC-seq (snATAC) and droplet-based platform (Mouse Brain 10X) that used in SCALEX manuscript.

  • Integrating data using SCALEX. This part will show you how to perform batch correction using SCALEX function in SCALEX.

[1]:
import scalex
from scalex.function import SCALEX
from scalex.plot import embedding
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
import episcanpy as epi
[2]:
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
sc.logging.print_header()
plt.rcParams['axes.unicode_minus']=False
scanpy==1.6.1 anndata==0.7.5 umap==0.4.6 numpy==1.20.1 scipy==1.6.1 pandas==1.1.3 scikit-learn==0.23.2 statsmodels==0.12.0 python-igraph==0.8.3 louvain==0.7.0 leidenalg==0.8.3
[3]:
sns.__version__
[3]:
'0.10.1'
[4]:
scalex.__version__
[4]:
'0.2.0'
Seeing the batch effect

The following data has been used in the SnapATAC paper, has been used here, and can be downloaded from here.

On a unix system, you can uncomment and run the following to download the count matrix in its anndata format.

[5]:
# ! wget http://zhanglab.net/scalex-tutorial/mouse_brain_atac.h5ad
[6]:
adata_raw=sc.read('mouse_brain_atac.h5ad')
adata_raw
[6]:
AnnData object with n_obs × n_vars = 13746 × 479127
    obs: 'batch'

Inspect the batches contained in the dataset.

[7]:
adata_raw.obs.batch.value_counts()
[7]:
snATAC    9646
10X       4100
Name: batch, dtype: int64

The data processing procedure is according to the epiScanpy tutorial [Buenrostro_PBMC_data_processing].

[8]:
epi.pp.filter_cells(adata_raw, min_features=1)
epi.pp.filter_features(adata_raw, min_cells=5)
adata_raw.raw = adata_raw
adata_raw = epi.pp.select_var_feature(adata_raw,
                                      nb_features=30000,
                                      show=False,copy=True)
adata_raw.layers['binary'] = adata_raw.X.copy()
epi.pp.normalize_total(adata_raw)
adata_raw.layers['normalised'] = adata_raw.X.copy()
epi.pp.log1p(adata_raw)
epi.pp.lazy(adata_raw)
epi.tl.leiden(adata_raw)
filtered out 11140 genes that are detected in less than 5 cells
normalizing counts per cell
    finished (0:00:00)
computing PCA
    with n_comps=50
    finished (0:01:46)
computing neighbors
    using 'X_pca' with n_pcs = 50
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:14)
computing tSNE
    using 'X_pca' with n_pcs = 50
    using the 'MulticoreTSNE' package by Ulyanov (2017)
    finished: added
    'X_tsne', tSNE coordinates (adata.obsm) (0:01:35)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:09)
running Leiden clustering
    finished: found 25 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:01)

We observe a batch effect.

[9]:
sc.pl.umap(adata_raw,color=['leiden'],legend_fontsize=10)
_images/tutorial_Integration_scATAC-seq_16_0.png
[10]:
sc.pl.umap(adata_raw,color=['batch'],legend_fontsize=10)
_images/tutorial_Integration_scATAC-seq_17_0.png
[11]:
adata_raw
[11]:
AnnData object with n_obs × n_vars = 13746 × 30076
    obs: 'batch', 'nb_features', 'leiden'
    var: 'n_cells', 'prop_shared_cells', 'variability_score'
    uns: 'log1p', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_colors', 'batch_colors'
    obsm: 'X_pca', 'X_tsne', 'X_umap'
    varm: 'PCs'
    layers: 'binary', 'normalised'
    obsp: 'distances', 'connectivities'
Integrating data using SCALEX

The batch effects can be well-resolved using SCALEX.

Note

Here we use GPU to speed up the calculation process, however, you can get the same level of performance only using cpu.

[12]:
adata=SCALEX('mouse_brain_atac.h5ad',batch_name='batch',profile='ATAC',
            min_features=1, min_cells=5, n_top_features=30000,outdir='ATAC_output/',show=False,gpu=8)
2021-03-30 20:21:45,161 - root - INFO - Raw dataset shape: (13746, 479127)
2021-03-30 20:21:45,166 - root - INFO - Preprocessing
2021-03-30 20:21:45,639 - root - INFO - Filtering cells
2021-03-30 20:21:47,224 - root - INFO - Filtering features
filtered out 11140 genes that are detected in less than 5 cells
2021-03-30 20:21:49,461 - root - INFO - Finding variable features
2021-03-30 20:21:51,582 - root - INFO - Normalizing total per cell
normalizing counts per cell
    finished (0:00:00)
2021-03-30 20:21:51,690 - root - INFO - Batch specific maxabs scaling
2021-03-30 20:22:16,926 - root - INFO - Processed dataset shape: (13746, 30076)
2021-03-30 20:22:17,234 - root - INFO - model
VAE(
  (encoder): Encoder(
    (enc): NN(
      (net): ModuleList(
        (0): Block(
          (fc): Linear(in_features=30076, out_features=1024, bias=True)
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (act): ReLU()
        )
      )
    )
    (mu_enc): NN(
      (net): ModuleList(
        (0): Block(
          (fc): Linear(in_features=1024, out_features=10, bias=True)
        )
      )
    )
    (var_enc): NN(
      (net): ModuleList(
        (0): Block(
          (fc): Linear(in_features=1024, out_features=10, bias=True)
        )
      )
    )
  )
  (decoder): NN(
    (net): ModuleList(
      (0): Block(
        (fc): Linear(in_features=10, out_features=30076, bias=True)
        (norm): DSBatchNorm(
          (bns): ModuleList(
            (0): BatchNorm1d(30076, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (1): BatchNorm1d(30076, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
        )
        (act): Sigmoid()
      )
    )
  )
)
Epochs: 100%|██████████| 141/141 [10:53<00:00,  4.63s/it, recon_loss=992.783,kl_loss=3.643]
2021-03-30 20:33:17,866 - root - INFO - Output dir: ATAC_output//
2021-03-30 20:33:22,489 - root - INFO - Plot umap
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:03)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:09)
running Leiden clustering
    finished: found 16 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:01)
WARNING: saving figure to file ATAC_output/umap.pdf
[13]:
adata
[13]:
AnnData object with n_obs × n_vars = 13746 × 30076
    obs: 'batch', 'n_genes', 'leiden'
    var: 'n_cells', 'prop_shared_cells', 'variability_score'
    uns: 'neighbors', 'umap', 'leiden', 'batch_colors', 'leiden_colors'
    obsm: 'latent', 'X_umap'
    obsp: 'distances', 'connectivities'

While there seems to be some strong batch-effect in all cell types, SCALEX can integrate them homogeneously.

[14]:
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
[15]:
sc.pl.umap(adata,color=['leiden'],legend_fontsize=10)
_images/tutorial_Integration_scATAC-seq_26_0.png
[16]:
sc.pl.umap(adata,color=['batch'],legend_fontsize=10)
_images/tutorial_Integration_scATAC-seq_27_0.png

The integrated data is stored as adata.h5ad in the output directory assigned by outdir parameter in SCALE function.

Integration cross-modality data using SCALEX

The following tutorial demonstrates how to use SCALEX for integrating scRNA-seq and scATAC-seq data

There are mainly two steps:

  • Create a gene activity matrix from scATAC-seq data. This step follows the standard workflow of Signac for scATAC-seq data analysis. We uses the function GeneActivity of Signac and calculate the activity of each gene in the genome by assessing the chromatin accessibility associated with each gene, and create a new gene activity matrix derived from the scATAC-seq data. More details are here.

  • Integrate. We regard gene expression matrix and gene activity matrix as two batches of one dataset and use SCALEX for integration.

For this tutorial, we used a cross-modality PBMC data between scRNA-seq and scATAC-seq provided by 10X Genomics, and both scRNA-seq and scATAC-seq data are available through the 10x Genomics website.

Create a gene activity matrix (R)
[1]:
suppressPackageStartupMessages(library(Signac))
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(GenomeInfoDb))
suppressPackageStartupMessages(library(EnsDb.Hsapiens.v75))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(patchwork))
set.seed(1234)
options(warn=-1)
Pre-processing

We follow the pre-processing workflow of Signac when pre-processing chromatin data. First we creat a Seurat object by two related input files: cell matrix and fragment file.

[2]:
counts <- Read10X_h5(filename = "atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
  file = "atac_v1_pbmc_10k_singlecell.csv",
  header = TRUE,
  row.names = 1
)

chrom_assay <- CreateChromatinAssay(
  counts = counts,
  sep = c(":", "-"),
  genome = 'hg19',
  fragments = 'atac_v1_pbmc_10k_fragments.tsv.gz',
  min.cells = 10,
  min.features = 200
)

pbmc <- CreateSeuratObject(
  counts = chrom_assay,
  assay = "peaks",
  meta.data = metadata
)
Computing hash

Now add gene annotations to the pbmc object for the human genome. genome is suggested to be as same as the genome for scRNA-seq reads mapping on.

[4]:
# extract gene annotations from EnsDb
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v75)

# change to UCSC style since the data was mapped to hg19
seqlevelsStyle(annotations) <- 'UCSC'
genome(annotations) <- "hg19"

# add the gene information to the object
Annotation(pbmc) <- annotations

Compute QC Metrics. If you don’t need to filter cells, ignore this step

[13]:
# compute nucleosome signal score per cell
pbmc <- NucleosomeSignal(object = pbmc)

# compute TSS enrichment score per cell
pbmc <- TSSEnrichment(object = pbmc, fast = FALSE)

# add blacklist ratio and fraction of reads in peaks
pbmc$pct_reads_in_peaks <- pbmc$peak_region_fragments / pbmc$passed_filters * 100
pbmc$blacklist_ratio <- pbmc$blacklist_region_fragments / pbmc$peak_region_fragments
Extracting TSS positions

Finding + strand cut sites

Finding - strand cut sites

Computing mean insertion frequency in flanking regions

Normalizing TSS score

[14]:
pbmc$high.tss <- ifelse(pbmc$TSS.enrichment > 2, 'High', 'Low')
TSSPlot(pbmc, group.by = 'high.tss') + NoLegend()
_images/tutorial_Integration_cross-modality_11_0.png
[15]:
pbmc$nucleosome_group <- ifelse(pbmc$nucleosome_signal > 4, 'NS > 4', 'NS < 4')
FragmentHistogram(object = pbmc, group.by = 'nucleosome_group')
_images/tutorial_Integration_cross-modality_12_0.png
[16]:
VlnPlot(
  object = pbmc,
  features = c('pct_reads_in_peaks', 'peak_region_fragments',
               'TSS.enrichment', 'blacklist_ratio', 'nucleosome_signal'),
  pt.size = 0.1,
  ncol = 5
)
_images/tutorial_Integration_cross-modality_13_0.png

Filter cells that are outliers for the QC metrics.

[17]:
pbmc <- subset(
  x = pbmc,
  subset = peak_region_fragments > 3000 &
    peak_region_fragments < 20000 &
    pct_reads_in_peaks > 15 &
    blacklist_ratio < 0.05 &
    nucleosome_signal < 4 &
    TSS.enrichment > 2
)
pbmc
An object of class Seurat
87561 features across 7060 samples within 1 assay
Active assay: peaks (87561 features, 0 variable features)
Calculate gene activity matrix by GeneActivity() function
[5]:
gene.activities <- GeneActivity(pbmc)
Extracting gene coordinates

Extracting reads overlapping genomic regions

save results

[ ]:
wk_dir <- './'
[2]:
write.table(t(gene.activities), paste(wk_dir,"gene_activity_score.txt",sep=''), sep='\t',quote = F)
Integrate (python)

Now you can use the gene_activity_score together with the gene expression matrix for integration.

You can run SCALE command line directly: SCALEX.py –data_list data1 data2 –batch_categories RNA ATAC -o output_path

  • data1: path or file of scRNA-seq data

  • data2: file of gene_activity_score

and here we show the results before integration and after integration.

[3]:
import scalex
from scalex import SCALEX
from scalex.plot import embedding
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
[19]:
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
sc.logging.print_header()
plt.rcParams['axes.unicode_minus']=False
scanpy==1.7.0 anndata==0.7.5 umap==0.5.0 numpy==1.19.2 scipy==1.5.2 pandas==1.1.3 scikit-learn==0.23.2 statsmodels==0.12.0 python-igraph==0.9.0 leidenalg==0.8.3
[3]:
scalex.__version__
[3]:
'2.0.3.dev'

First we merge the RNA and ATAC data and add metadata information, and this processed data is available here

[ ]:
wk_dir = './'
[10]:
adata = sc.read_h5ad(wk_dir+'pbmc_RNA-ATAC.h5ad')
[11]:
adata
[11]:
AnnData object with n_obs × n_vars = 16492 × 13928
    obs: 'celltype', 'tech', 'batch'
[12]:
sc.pp.filter_cells(adata, min_genes=0)
sc.pp.filter_genes(adata, min_cells=0)

sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
adata = adata[:, adata.var.highly_variable]

sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=30)
sc.tl.umap(adata, min_dist=0.1)
normalizing counts per cell
    finished (0:00:00)
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
    finished (0:00:01)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
... as `zero_center=True`, sparse input is densified and may lead to large memory consumption
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:02)
computing neighbors
    using 'X_pca' with n_pcs = 30
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:14)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:29)
[15]:
sc.pl.umap(adata,color=['batch','celltype'],legend_fontsize=10, ncols=2)
_images/tutorial_Integration_cross-modality_32_0.png
[16]:
wk_dir='./' # wk_dir is your local path to store data and results
[17]:
adata = SCALEX(data_list = [wk_dir+'RNA-ATAC.h5ad'],
              min_features=0,
              min_cells=0,
              outdir=wk_dir+'/pbmc_RNA_ATAC/',
              show=False,
              gpu=7)
2021-03-26 11:46:17,234 - root - INFO - Raw dataset shape: (16492, 13928)
2021-03-26 11:46:17,237 - root - INFO - Preprocessing
2021-03-26 11:46:17,260 - root - INFO - Filtering cells
Trying to set attribute `.obs` of view, copying.
2021-03-26 11:46:21,627 - root - INFO - Filtering features
2021-03-26 11:46:24,745 - root - INFO - Normalizing total per cell
normalizing counts per cell
    finished (0:00:00)
2021-03-26 11:46:25,022 - root - INFO - Log1p transforming
2021-03-26 11:46:26,014 - root - INFO - Finding variable features
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
    finished (0:00:03)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
2021-03-26 11:46:30,637 - root - INFO - Batch specific maxabs scaling
2021-03-26 11:46:32,857 - root - INFO - Processed dataset shape: (16492, 2000)
2021-03-26 11:46:32,908 - root - INFO - model
VAE(
  (encoder): Encoder(
    (enc): NN(
      (net): ModuleList(
        (0): Block(
          (fc): Linear(in_features=2000, out_features=1024, bias=True)
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (act): ReLU()
        )
      )
    )
    (mu_enc): NN(
      (net): ModuleList(
        (0): Block(
          (fc): Linear(in_features=1024, out_features=10, bias=True)
        )
      )
    )
    (var_enc): NN(
      (net): ModuleList(
        (0): Block(
          (fc): Linear(in_features=1024, out_features=10, bias=True)
        )
      )
    )
  )
  (decoder): NN(
    (net): ModuleList(
      (0): Block(
        (fc): Linear(in_features=10, out_features=2000, bias=True)
        (norm): DSBatchNorm(
          (bns): ModuleList(
            (0): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (1): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
        )
        (act): Sigmoid()
      )
    )
  )
)
Epochs: 100%|██████████| 117/117 [06:54<00:00,  3.54s/it, recon_loss=274.876,kl_loss=2.892]
2021-03-26 11:53:35,672 - root - INFO - Output dir: .//pbmc_RNA_ATAC//
2021-03-26 11:53:45,553 - root - INFO - Plot umap
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:03)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:26)
running Leiden clustering
    finished: found 14 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:05)
WARNING: saving figure to file pbmc_RNA_ATAC/umap.pdf
[20]:
sc.pl.umap(adata,color=['batch','celltype'],legend_fontsize=10, ncols=2)
_images/tutorial_Integration_cross-modality_36_0.png
[ ]:

Installation

PyPI install

Pull SCALE from PyPI (consider using pip3 to access Python 3):

pip install scalex

Pytorch

If you have cuda devices, consider install Pytorch cuda version.:

conda install pytorch torchvision torchaudio -c pytorch

Troubleshooting

Anaconda

If you do not have a working installation of Python 3.6 (or later), consider installing Miniconda (see Installing Miniconda).

Installing Miniconda

After downloading Miniconda, in a unix shell (Linux, Mac), run

cd DOWNLOAD_DIR
chmod +x Miniconda3-latest-VERSION.sh
./Miniconda3-latest-VERSION.sh

and accept all suggestions. Either reopen a new terminal or source ~/.bashrc on Linux/ source ~/.bash_profile on Mac. The whole process takes just a couple of minutes.

Usage

SCALEX provide both commanline tool and api function used in jupyter notebook

Command line

Run SCALEX after installation:

SCALEX.py --data_list data1 data2 --batch_categories batch_name1 batch_name2

data_list: data path of each batch of single-cell dataset

batch_categories: name of each batch, batch_categories will range from 0 to N if not specified

Input

Input can be one of following:

  • single file of format h5ad, csv, txt, mtx or their compression file

  • multiple files of above format

Note

h5ad file input

  • SCALEX will use the batch column in the obs of adata format read from h5ad file as batch information

  • Users can specify any columns in the obs with option: --batch_name name

  • If multiple inputs are given, SCALEX can take each file as individual batch by default, and overload previous batch information, users can change the concat name via option --batch_key other_name

Output

Output will be saved in the output folder including:

  • checkpoint: saved model to reproduce results cooperated with option --checkpoint or -c

  • adata.h5ad: preprocessed data and results including, latent, clustering and imputation

  • umap.png: UMAP visualization of latent representations of cells

  • log.txt: log file of training process

Useful options

  • output folder for saveing results: [-o] or [–outdir]

  • filter rare genes, default 3: [–min_cell]

  • filter low quality cells, default 600: [–min_gene]

  • select the number of highly variable genes, keep all genes with -1, default 2000: [–n_top_genes]

Help

Look for more usage of SCALEX:

SCALEX.py --help

API function

Use SCALEX in jupyter notebook:

from scalex.function import SCALEX
adata = SCALEX(data_list, batch_categories)
or

adata = SCALEX([adata_1, adata_2])

Function of parameters are similar to command line options. Input can be the files of adata or a list of AnnData or one concatenated AnnData Output is a Anndata object for further analysis with scanpy.

AnnData

SCALEX supports scanpy and anndata, which provides the AnnData class.

http://falexwolf.de/img/scanpy/anndata.svg

At the most basic level, an AnnData object adata stores a data matrix adata.X, annotation of observations adata.obs and variables adata.var as pd.DataFrame and unstructured annotation adata.uns as dict. Names of observations and variables can be accessed via adata.obs_names and adata.var_names, respectively. AnnData objects can be sliced like dataframes, for example, adata_subset = adata[:, list_of_gene_names]. For more, see this blog post.

To read a data file to an AnnData object, call:

import scanpy as sc
adata = sc.read(filename)

to initialize an AnnData object. Possibly add further annotation using, e.g., pd.read_csv:

import pandas as pd
anno = pd.read_csv(filename_sample_annotation)
adata.obs['cell_groups'] = anno['cell_groups']  # categorical annotation of type pandas.Categorical
adata.obs['time'] = anno['time']                # numerical annotation of type float
# alternatively, you could also set the whole dataframe
# adata.obs = anno

To write, use:

adata.write(filename)
adata.write_csvs(filename)
adata.write_loom(filename)

API

Import SCALEX:

import scalex

Function

SCALEX([data_list, batch_categories, ...])

Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space

label_transfer(ref, query[, rep, label])

Label transfer

Data

Load data

data.load_data(data_list[, ...])

Load dataset with preprocessing

data.concat_data(data_list[, ...])

Concatenate multiple datasets along the observations axis with name batch_key.

data.load_files(root)

Load single cell dataset from files

data.load_file(path[, backed])

Load single cell dataset from file

data.read_mtx(path)

Read mtx format data folder including:

Preprocessing

data.preprocessing(adata[, profile, ...])

Preprocessing single-cell data

data.preprocessing_rna(adata[, ...])

Preprocessing single-cell RNA-seq data

data.preprocessing_atac(adata[, ...])

Preprocessing single-cell ATAC-seq

data.batch_scale(adata[, chunk_size])

Batch-specific scale data

data.reindex(adata, genes[, chunk_size])

Reindex AnnData with gene list

DataLoader

data.SingleCellDataset(adata[, use_layer])

Dataloader of single-cell data

data.BatchSampler(batch_size, batch_id[, ...])

Batch-specific Sampler sampled data of each batch is from the same dataset.

Net

Model

net.vae.VAE(enc, dec[, n_domain])

VAE framework

Layer

net.layer.DSBatchNorm(num_features, n_domain)

Domain-specific Batch Normalization

net.layer.Block(input_dim, output_dim[, ...])

Basic block consist of:

net.layer.NN(input_dim, cfg)

Neural network consist of multi Blocks

net.layer.Encoder(input_dim, cfg)

VAE Encoder

Loss

net.loss.kl_div(mu, var)

net.loss.binary_cross_entropy(recon_x, x)

Utils

net.utils.onehot(y, n)

Make the input tensor one hot tensors

net.utils.EarlyStopping([patience, verbose, ...])

Early stops the training if loss doesn't improve after a given patience.

Plot

plot.embedding(adata[, color, color_map, ...])

plot separated embeddings with others as background

plot.plot_meta(adata[, use_rep, color, ...])

Plot meta correlations among batches

plot.plot_meta2(adata[, use_rep, color, ...])

Plot meta correlations between two batches

plot.plot_confusion(y, y_pred[, save, cmap])

Plot confusion matrix

Metric

Collections of useful measurements for evaluating results.

metrics.batch_entropy_mixing_score(data, batches)

Calculate batch entropy mixing score

metrics.silhouette_score(X, labels, *[, ...])

Compute the mean Silhouette Coefficient of all samples.

Logger

logger.create_logger([name, ch, fh, ...])

News

SCALEX is online on Nature Communications 2022-10-17 SCALEX is available on bioRxiv 2021-04-09

Release

Version 1.0

1.0.0 2022-08-29

Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space

News

SCALEX is online on Nature Communications 2022-10-17 SCALEX is available on bioRxiv 2021-04-09

Indices and tables