Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space
Tutorial
Integration
before integration

after SCALEX integration

Projection
Map new data to the embeddings of reference
A pancreas reference was created by integrating eight batches.
Here, map pancreas_gse81547, pancreas_gse83139 and pancreas_gse114297 to the embeddings of pancreas reference.

Label transfer
Annotate cells in new data through label transfer
Label transfer tabula muris data and mouse kidney data from mouse atlas reference
mouse atlas reference

query tabula muris aging and query mouse kidney


Integration scATAC-seq data

Integration cross-modality data
Integrate scRNA-seq and scATAC-seq dataset

Spatial data (To be updated)
Integrating spatial data with scRNA-seq
Examples
Integrating PBMC data using SCALEX
The following tutorial demonstrates how to use SCALEX for integrating PBMC data.
There are two parts of this tutorial:
Seeing the batch effect. This part will show the batch effects of two PBMC datasets from single cell 3’ and 5’ gene expression libraries that used in SCALEX manuscript.
Integrating data using SCALEX. This part will show you how to perform batch correction using SCALEX function in SCALEX.
[1]:
import scalex
from scalex.function import SCALEX
from scalex.plot import embedding
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
[2]:
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
sc.logging.print_header()
plt.rcParams['axes.unicode_minus']=False
scanpy==1.6.1 anndata==0.7.5 umap==0.4.6 numpy==1.20.1 scipy==1.6.1 pandas==1.1.3 scikit-learn==0.23.2 statsmodels==0.12.0 python-igraph==0.8.3 louvain==0.7.0 leidenalg==0.8.3
[3]:
sns.__version__
[3]:
'0.10.1'
[4]:
scalex.__version__
[4]:
'0.2.0'
Integrating data using SCALEX
The batch effects can be well-resolved using SCALEX.
Note
Here we use GPU to speed up the calculation process, however, you can get the same level of performance only using cpu.
[12]:
# ! wget http://zhanglab.net/scalex-tutorial/pbmc.h5ad
adata=SCALEX('pbmc.h5ad', batch_name='batch', outdir='pbmc_output/')
2021-03-30 20:16:38,586 - root - INFO - Raw dataset shape: (15476, 33694)
2021-03-30 20:16:38,588 - root - INFO - Preprocessing
2021-03-30 20:16:38,616 - root - INFO - Filtering cells
Trying to set attribute `.obs` of view, copying.
2021-03-30 20:16:39,462 - root - INFO - Filtering features
filtered out 13316 genes that are detected in less than 3 cells
2021-03-30 20:16:39,952 - root - INFO - Normalizing total per cell
normalizing counts per cell
finished (0:00:00)
2021-03-30 20:16:40,076 - root - INFO - Log1p transforming
2021-03-30 20:16:40,493 - root - INFO - Finding variable features
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
finished (0:00:01)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
2021-03-30 20:16:42,776 - root - INFO - Batch specific maxabs scaling
2021-03-30 20:16:44,173 - root - INFO - Processed dataset shape: (15476, 2000)
2021-03-30 20:16:44,192 - root - INFO - model
VAE(
(encoder): Encoder(
(enc): NN(
(net): ModuleList(
(0): Block(
(fc): Linear(in_features=2000, out_features=1024, bias=True)
(norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): ReLU()
)
)
)
(mu_enc): NN(
(net): ModuleList(
(0): Block(
(fc): Linear(in_features=1024, out_features=10, bias=True)
)
)
)
(var_enc): NN(
(net): ModuleList(
(0): Block(
(fc): Linear(in_features=1024, out_features=10, bias=True)
)
)
)
)
(decoder): NN(
(net): ModuleList(
(0): Block(
(fc): Linear(in_features=10, out_features=2000, bias=True)
(norm): DSBatchNorm(
(bns): ModuleList(
(0): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(act): Sigmoid()
)
)
)
)
Epochs: 100%|██████████| 125/125 [06:11<00:00, 2.97s/it, recon_loss=182.388,kl_loss=3.988]
2021-03-30 20:23:03,050 - root - INFO - Output dir: pbmc_output//
2021-03-30 20:23:07,564 - root - INFO - Plot umap
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:03)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:12)
running Leiden clustering
finished: found 15 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:00:05)
WARNING: saving figure to file pbmc_output/umap.pdf
[13]:
adata
[13]:
AnnData object with n_obs × n_vars = 15476 × 2000
obs: 'batch', 'celltype', 'protocol', 'celltype0', 'n_genes', 'leiden'
var: 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
uns: 'log1p', 'hvg', 'neighbors', 'umap', 'leiden', 'batch_colors', 'celltype_colors', 'leiden_colors'
obsm: 'latent', 'X_umap'
obsp: 'distances', 'connectivities'
While there seems to be some strong batch-effect in all cell types, SCALEX can integrate them homogeneously.
[14]:
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
[15]:
sc.pl.umap(adata,color=['celltype'],legend_fontsize=10)

[16]:
sc.pl.umap(adata,color=['batch'],legend_fontsize=10)

The integrated data is stored as adata.h5ad
in the output directory assigned by outdir
parameter in SCALEX function.
Projection pancreas data using SCALEX
The following tutorial demonstrates how to use SCALEX for integrating data and projection new data adata_query
onto an annotated reference adata_ref
.
There are five parts of this tutorial:
Seeing the batch effect. This part will show the batch effects of eight pancreas datasets that used in SCALEX manuscript.
Integrating data using SCALEX. This part will show you how to perform batch correction and construct a reference batch
adata_ref
using SCALEX function in SCALEX.Mapping onto a reference batch using projection function. The third part will describe the usage of projection function in SCALEX to map three batches query dataset
adata_query
onto the reference batchadata_ref
you construetd in part two.Visualizing distributions across batches. Often, batches correspond to experiments that one wants to compare. SCALEX v2 offers embedding function to convenient visualize for this.
Label transfer. SCALEX offers label_transfer function to conveniently transfer labels from reference datasets to query datasets.
[1]:
import scalex
from scalex import SCALEX, label_transfer
from scalex.plot import embedding
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
[2]:
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
sc.logging.print_header()
plt.rcParams['axes.unicode_minus']=False
scanpy==1.6.1 anndata==0.7.5 umap==0.4.6 numpy==1.20.1 scipy==1.6.1 pandas==1.1.3 scikit-learn==0.23.2 statsmodels==0.12.0 python-igraph==0.8.3 louvain==0.7.0 leidenalg==0.8.3
[3]:
sns.__version__
[3]:
'0.10.1'
[4]:
scalex.__version__
[4]:
'0.2.0'
Seeing the batch effect
The pancreas data has been used in the Seurat v3 and Harmony paper.
On a unix system, you can uncomment and run the following to download the count matrix in its anndata format.
[5]:
# ! wget http://zhanglab.net/scalex-tutorial/pancreas.h5ad
# ! wget http://zhanglab.net/scalex-tutorial/pancreas_query.h5ad
[6]:
adata_raw=sc.read('pancreas.h5ad')
adata_raw
[6]:
AnnData object with n_obs × n_vars = 16401 × 14895
obs: 'batch', 'celltype', 'disease', 'donor', 'library', 'protocol'
Inspect the batches contained in the dataset.
[7]:
adata_raw.obs.batch.value_counts()
[7]:
pancreas_indrop3 3605
pancreas_celseq2 3072
pancreas_smartseq2 2394
pancreas_indrop1 1937
pancreas_celseq 1728
pancreas_indrop2 1724
pancreas_indrop4 1303
pancreas_fluidigmc1 638
Name: batch, dtype: int64
The data processing procedure is according to the scanpy tutorial [Preprocessing and clustering 3k PBMCs].
[8]:
sc.pp.filter_cells(adata_raw, min_genes=600)
sc.pp.filter_genes(adata_raw, min_cells=3)
adata_raw = adata_raw[:, [gene for gene in adata_raw.var_names if not str(gene).startswith(tuple(['ERCC', 'MT-', 'mt-']))]]
sc.pp.normalize_total(adata_raw, target_sum=1e4)
sc.pp.log1p(adata_raw)
sc.pp.highly_variable_genes(adata_raw, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata_raw.raw = adata_raw
adata_raw = adata_raw[:, adata_raw.var.highly_variable]
sc.pp.scale(adata_raw, max_value=10)
sc.pp.pca(adata_raw)
sc.pp.neighbors(adata_raw)
sc.tl.umap(adata_raw)
filtered out 1124 cells that have less than 600 genes expressed
normalizing counts per cell
finished (0:00:00)
extracting highly variable genes
finished (0:00:01)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
... as `zero_center=True`, sparse input is densified and may lead to large memory consumption
computing PCA
on highly variable genes
with n_comps=50
finished (0:00:03)
computing neighbors
using 'X_pca' with n_pcs = 50
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:15)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:09)
We observe a batch effect.
[9]:
sc.pl.umap(adata_raw,color=['celltype'],legend_fontsize=10)

[10]:
sc.pl.umap(adata_raw,color=['batch'],legend_fontsize=10)

[11]:
adata_raw
[11]:
AnnData object with n_obs × n_vars = 15277 × 2086
obs: 'batch', 'celltype', 'disease', 'donor', 'library', 'protocol', 'n_genes'
var: 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'celltype_colors', 'batch_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
Integrating data using SCALEX
The batch effects can be well-resolved using SCALEX.
Note
Here we use GPU to speed up the calculation process, however, you can get the same level of performance only using cpu.
[12]:
adata_ref=SCALEX('pancreas.h5ad',batch_name='batch',min_features=600, min_cells=3, outdir='pancreas_output/',show=False,gpu=7)
2021-03-30 20:21:06,757 - root - INFO - Raw dataset shape: (16401, 14895)
2021-03-30 20:21:06,759 - root - INFO - Preprocessing
2021-03-30 20:21:06,785 - root - INFO - Filtering cells
filtered out 1124 cells that have less than 600 genes expressed
Trying to set attribute `.obs` of view, copying.
2021-03-30 20:21:09,107 - root - INFO - Filtering features
2021-03-30 20:21:10,607 - root - INFO - Normalizing total per cell
normalizing counts per cell
finished (0:00:00)
2021-03-30 20:21:10,850 - root - INFO - Log1p transforming
2021-03-30 20:21:11,759 - root - INFO - Finding variable features
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
finished (0:00:04)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
2021-03-30 20:21:17,003 - root - INFO - Batch specific maxabs scaling
2021-03-30 20:21:19,528 - root - INFO - Processed dataset shape: (15277, 2000)
2021-03-30 20:21:19,566 - root - INFO - model
VAE(
(encoder): Encoder(
(enc): NN(
(net): ModuleList(
(0): Block(
(fc): Linear(in_features=2000, out_features=1024, bias=True)
(norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): ReLU()
)
)
)
(mu_enc): NN(
(net): ModuleList(
(0): Block(
(fc): Linear(in_features=1024, out_features=10, bias=True)
)
)
)
(var_enc): NN(
(net): ModuleList(
(0): Block(
(fc): Linear(in_features=1024, out_features=10, bias=True)
)
)
)
)
(decoder): NN(
(net): ModuleList(
(0): Block(
(fc): Linear(in_features=10, out_features=2000, bias=True)
(norm): DSBatchNorm(
(bns): ModuleList(
(0): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(4): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(act): Sigmoid()
)
)
)
)
Epochs: 100%|██████████| 127/127 [09:11<00:00, 4.34s/it, recon_loss=266.572,kl_loss=4.667]
2021-03-30 20:30:38,361 - root - INFO - Output dir: pancreas_output//
2021-03-30 20:30:47,018 - root - INFO - Plot umap
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:03)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:10)
running Leiden clustering
finished: found 13 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:00:02)
WARNING: saving figure to file pancreas_output/umap.pdf
[13]:
adata_ref
[13]:
AnnData object with n_obs × n_vars = 15277 × 2000
obs: 'batch', 'celltype', 'disease', 'donor', 'library', 'protocol', 'n_genes', 'leiden'
var: 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
uns: 'log1p', 'hvg', 'neighbors', 'umap', 'leiden', 'batch_colors', 'celltype_colors', 'leiden_colors'
obsm: 'latent', 'X_umap'
obsp: 'distances', 'connectivities'
While there seems to be some strong batch-effect in all cell types, SCALEX can integrate them homogeneously.
[14]:
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
[15]:
sc.pl.umap(adata_ref,color=['celltype'],legend_fontsize=10)

[16]:
sc.pl.umap(adata_ref,color=['batch'],legend_fontsize=10)

The integrated data is stored as adata.h5ad
in the output directory assigned by outdir
parameter in SCALEX function.
Mapping onto a reference batch using projection function
The pancreas query data are available through the Gene Expression Omnibus under accession GSE114297, GSE81547 and GSE83139.
[17]:
adata_query=sc.read('pancreas_query.h5ad')
adata_query
[17]:
AnnData object with n_obs × n_vars = 23963 × 31884
obs: 'batch', 'celltype', 'disease', 'donor', 'protocol'
Inspect the batches contained in adata_query
.
[18]:
adata_query.obs.batch.value_counts()
[18]:
pancreas_gse114297 20784
pancreas_gse81547 2544
pancreas_gse83139 635
Name: batch, dtype: int64
SCALEX provides a projection function for mapping new data adata_query
onto the reference batch adata_ref
.
[19]:
adata=SCALEX('pancreas_query.h5ad',batch_name='batch',min_features=600, min_cells=3,
outdir='pancreas_projection/', projection='pancreas_output/',show=False,gpu=7)
2021-03-30 20:31:47,177 - root - INFO - Raw dataset shape: (23963, 31884)
2021-03-30 20:31:47,177 - root - INFO - Raw dataset shape: (23963, 31884)
2021-03-30 20:31:47,180 - root - INFO - Preprocessing
2021-03-30 20:31:47,180 - root - INFO - Preprocessing
2021-03-30 20:31:47,236 - root - INFO - Filtering cells
2021-03-30 20:31:47,236 - root - INFO - Filtering cells
filtered out 219 cells that have less than 600 genes expressed
Trying to set attribute `.obs` of view, copying.
2021-03-30 20:31:49,744 - root - INFO - Filtering features
2021-03-30 20:31:49,744 - root - INFO - Filtering features
2021-03-30 20:31:51,094 - root - INFO - Normalizing total per cell
2021-03-30 20:31:51,094 - root - INFO - Normalizing total per cell
normalizing counts per cell
finished (0:00:00)
2021-03-30 20:31:51,430 - root - INFO - Log1p transforming
2021-03-30 20:31:51,430 - root - INFO - Log1p transforming
2021-03-30 20:31:52,607 - root - INFO - Finding variable features
2021-03-30 20:31:52,607 - root - INFO - Finding variable features
There are 2000 gene in selected genes
2021-03-30 20:31:56,488 - root - INFO - Batch specific maxabs scaling
2021-03-30 20:31:56,488 - root - INFO - Batch specific maxabs scaling
2021-03-30 20:31:59,442 - root - INFO - Processed dataset shape: (23744, 2000)
2021-03-30 20:31:59,442 - root - INFO - Processed dataset shape: (23744, 2000)
2021-03-30 20:32:04,207 - root - INFO - Output dir: pancreas_projection//
2021-03-30 20:32:04,207 - root - INFO - Output dir: pancreas_projection//
... storing 'batch' as categorical
... storing 'celltype' as categorical
... storing 'disease' as categorical
... storing 'donor' as categorical
... storing 'library' as categorical
... storing 'protocol' as categorical
... storing 'leiden' as categorical
2021-03-30 20:32:08,411 - root - INFO - Plot umap
2021-03-30 20:32:08,411 - root - INFO - Plot umap
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:08)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:26)
running Leiden clustering
finished: found 13 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:00:10)
WARNING: saving figure to file pancreas_projection/X_umap_reference.pdf
WARNING: saving figure to file pancreas_projection/X_umap_query.pdf
Load integrated data adata
that contained adata_ref
and adata_query
.
[20]:
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
[21]:
embedding(adata,groupby='projection')


[22]:
adata
[22]:
AnnData object with n_obs × n_vars = 39021 × 2000
obs: 'batch', 'celltype', 'disease', 'donor', 'library', 'protocol', 'n_genes', 'leiden', 'projection'
var: 'n_cells-reference', 'highly_variable-reference', 'means-reference', 'dispersions-reference', 'dispersions_norm-reference', 'highly_variable_nbatches-reference', 'highly_variable_intersection-reference'
uns: 'neighbors', 'umap', 'leiden'
obsm: 'latent', 'X_umap'
obsp: 'distances', 'connectivities'
Inspect the batches contained in adata
.
[23]:
adata.obs.batch.value_counts()
[23]:
pancreas_gse114297 20573
pancreas_indrop3 3605
pancreas_gse81547 2536
pancreas_celseq2 2440
pancreas_smartseq2 2394
pancreas_indrop1 1937
pancreas_indrop2 1724
pancreas_indrop4 1303
pancreas_celseq 1236
pancreas_fluidigmc1 638
pancreas_gse83139 635
Name: batch, dtype: int64
Visualizing distributions across batches
[24]:
sc.pl.umap(adata[adata.obs.batch.isin(['pancreas_gse114297','pancreas_gse81547','pancreas_gse83139'])],color='batch',legend_fontsize=10)
embedding(adata[adata.obs.batch.isin(['pancreas_gse114297','pancreas_gse81547','pancreas_gse83139'])],legend_fontsize=10)
Trying to set attribute `.uns` of view, copying.

Trying to set attribute `.obs` of view, copying.



The projection results is stored as adata.h5ad
in the output directory assigned by outdir
parameter in SCALE function.
Label transfer
We can also use SCALEX to transfer data from one dataset to another. Here, we demonstrate data transfer between two scRNA-seq datasets by transferring the cell type label from the adata_ref
and the adata_query
.
[25]:
adata_query=adata[adata.obs.projection=='query']
adata_query
[25]:
View of AnnData object with n_obs × n_vars = 23744 × 2000
obs: 'batch', 'celltype', 'disease', 'donor', 'library', 'protocol', 'n_genes', 'leiden', 'projection'
var: 'n_cells-reference', 'highly_variable-reference', 'means-reference', 'dispersions-reference', 'dispersions_norm-reference', 'highly_variable_nbatches-reference', 'highly_variable_intersection-reference'
uns: 'neighbors', 'umap', 'leiden'
obsm: 'latent', 'X_umap'
obsp: 'distances', 'connectivities'
[26]:
adata_ref=adata[adata.obs.projection=='reference']
adata_ref
[26]:
View of AnnData object with n_obs × n_vars = 15277 × 2000
obs: 'batch', 'celltype', 'disease', 'donor', 'library', 'protocol', 'n_genes', 'leiden', 'projection'
var: 'n_cells-reference', 'highly_variable-reference', 'means-reference', 'dispersions-reference', 'dispersions_norm-reference', 'highly_variable_nbatches-reference', 'highly_variable_intersection-reference'
uns: 'neighbors', 'umap', 'leiden'
obsm: 'latent', 'X_umap'
obsp: 'distances', 'connectivities'
[27]:
adata_query.obs['celltype_transfer']=label_transfer(adata_ref, adata_query, rep='latent', label='celltype')
Trying to set attribute `.obs` of view, copying.
[28]:
sc.pl.umap(adata_query,color=['celltype_transfer'])
... storing 'celltype_transfer' as categorical

[29]:
sc.pl.umap(adata_query,color=['celltype'])

Let us first focus on cell types that are conserved with the reference.
[30]:
obs_query = adata_query.obs
conserved_categories = obs_query.celltype.cat.categories.intersection(obs_query.celltype_transfer.cat.categories) # intersected categories
obs_query_conserved = obs_query.loc[obs_query.celltype.isin(conserved_categories) & obs_query.celltype_transfer.isin(conserved_categories)] # intersect categories
obs_query_conserved.celltype.cat.remove_unused_categories(inplace=True) # remove unused categoriyes
obs_query_conserved.celltype_transfer.cat.remove_unused_categories(inplace=True) # remove unused categoriyes
obs_query_conserved.celltype_transfer.cat.reorder_categories(obs_query_conserved.celltype.cat.categories, inplace=True) # fix category ordering
pd.crosstab(obs_query_conserved.celltype, obs_query_conserved.celltype_transfer)
[30]:
celltype_transfer | acinar | activated_stellate | alpha | alpha_er | beta | beta_er | delta | ductal | endothelial | epsilon | gamma | macrophage | mast | quiescent_stellate | schwann |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
celltype | |||||||||||||||
acinar | 1346 | 0 | 3 | 0 | 3 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
activated_stellate | 0 | 1033 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 119 | 1 |
alpha | 2 | 0 | 7527 | 83 | 13 | 0 | 3 | 0 | 0 | 0 | 11 | 0 | 1 | 0 | 0 |
alpha_er | 0 | 0 | 159 | 105 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
beta | 3 | 0 | 119 | 4 | 8683 | 34 | 17 | 1 | 0 | 0 | 3 | 0 | 0 | 0 | 0 |
beta_er | 0 | 0 | 0 | 0 | 41 | 56 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
delta | 1 | 0 | 8 | 0 | 34 | 0 | 1099 | 0 | 0 | 1 | 6 | 0 | 0 | 0 | 0 |
ductal | 2 | 0 | 3 | 0 | 2 | 0 | 3 | 1889 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
endothelial | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 456 | 0 | 0 | 0 | 0 | 1 | 0 |
epsilon | 1 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 10 | 3 | 0 | 0 | 0 | 0 |
gamma | 0 | 0 | 7 | 2 | 3 | 0 | 1 | 0 | 0 | 2 | 578 | 0 | 0 | 0 | 0 |
macrophage | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 58 | 14 | 0 | 0 |
mast | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 6 | 0 | 0 |
quiescent_stellate | 0 | 53 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 68 | 1 |
schwann | 1 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 34 |
Let us now move on to look at all cell types.
[31]:
pd.crosstab(adata_query.obs.celltype, adata_query.obs.celltype_transfer)
[31]:
celltype_transfer | acinar | activated_stellate | alpha | alpha_er | beta | beta_er | delta | ductal | endothelial | epithelial | epsilon | gamma | macrophage | mast | quiescent_stellate | schwann |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
celltype | ||||||||||||||||
acinar | 1346 | 0 | 3 | 0 | 3 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
activated_stellate | 0 | 1033 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 119 | 1 |
alpha | 2 | 0 | 7527 | 83 | 13 | 0 | 3 | 0 | 0 | 1 | 0 | 11 | 0 | 1 | 0 | 0 |
alpha_er | 0 | 0 | 159 | 105 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
beta | 3 | 0 | 119 | 4 | 8683 | 34 | 17 | 1 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 |
beta_er | 0 | 0 | 0 | 0 | 41 | 56 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
delta | 1 | 0 | 8 | 0 | 34 | 0 | 1099 | 0 | 0 | 0 | 1 | 6 | 0 | 0 | 0 | 0 |
ductal | 2 | 0 | 3 | 0 | 2 | 0 | 3 | 1889 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
endothelial | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 456 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
epsilon | 1 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 10 | 3 | 0 | 0 | 0 | 0 |
gamma | 0 | 0 | 7 | 2 | 3 | 0 | 1 | 0 | 0 | 0 | 2 | 578 | 0 | 0 | 0 | 0 |
macrophage | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 58 | 14 | 0 | 0 |
mast | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 6 | 0 | 0 |
quiescent_stellate | 0 | 53 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 68 | 1 |
schwann | 1 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 34 |
Integrating scATAC-seq data using SCALEX
The following tutorial demonstrates how to use SCALEX for integrating scATAC-seq data.
There are two parts of this tutorial:
Seeing the batch effect. This part will show the batch effects of two adult mouse brain datasets from single nucleus ATAC-seq (snATAC) and droplet-based platform (Mouse Brain 10X) that used in SCALEX manuscript.
Integrating data using SCALEX. This part will show you how to perform batch correction using SCALEX function in SCALEX.
[1]:
import scalex
from scalex.function import SCALEX
from scalex.plot import embedding
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
import episcanpy as epi
[2]:
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
sc.logging.print_header()
plt.rcParams['axes.unicode_minus']=False
scanpy==1.6.1 anndata==0.7.5 umap==0.4.6 numpy==1.20.1 scipy==1.6.1 pandas==1.1.3 scikit-learn==0.23.2 statsmodels==0.12.0 python-igraph==0.8.3 louvain==0.7.0 leidenalg==0.8.3
[3]:
sns.__version__
[3]:
'0.10.1'
[4]:
scalex.__version__
[4]:
'0.2.0'
Seeing the batch effect
The following data has been used in the SnapATAC paper, has been used here, and can be downloaded from here.
On a unix system, you can uncomment and run the following to download the count matrix in its anndata format.
[5]:
# ! wget http://zhanglab.net/scalex-tutorial/mouse_brain_atac.h5ad
[6]:
adata_raw=sc.read('mouse_brain_atac.h5ad')
adata_raw
[6]:
AnnData object with n_obs × n_vars = 13746 × 479127
obs: 'batch'
Inspect the batches contained in the dataset.
[7]:
adata_raw.obs.batch.value_counts()
[7]:
snATAC 9646
10X 4100
Name: batch, dtype: int64
The data processing procedure is according to the epiScanpy tutorial [Buenrostro_PBMC_data_processing].
[8]:
epi.pp.filter_cells(adata_raw, min_features=1)
epi.pp.filter_features(adata_raw, min_cells=5)
adata_raw.raw = adata_raw
adata_raw = epi.pp.select_var_feature(adata_raw,
nb_features=30000,
show=False,copy=True)
adata_raw.layers['binary'] = adata_raw.X.copy()
epi.pp.normalize_total(adata_raw)
adata_raw.layers['normalised'] = adata_raw.X.copy()
epi.pp.log1p(adata_raw)
epi.pp.lazy(adata_raw)
epi.tl.leiden(adata_raw)
filtered out 11140 genes that are detected in less than 5 cells
normalizing counts per cell
finished (0:00:00)
computing PCA
with n_comps=50
finished (0:01:46)
computing neighbors
using 'X_pca' with n_pcs = 50
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:14)
computing tSNE
using 'X_pca' with n_pcs = 50
using the 'MulticoreTSNE' package by Ulyanov (2017)
finished: added
'X_tsne', tSNE coordinates (adata.obsm) (0:01:35)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:09)
running Leiden clustering
finished: found 25 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:00:01)
We observe a batch effect.
[9]:
sc.pl.umap(adata_raw,color=['leiden'],legend_fontsize=10)

[10]:
sc.pl.umap(adata_raw,color=['batch'],legend_fontsize=10)

[11]:
adata_raw
[11]:
AnnData object with n_obs × n_vars = 13746 × 30076
obs: 'batch', 'nb_features', 'leiden'
var: 'n_cells', 'prop_shared_cells', 'variability_score'
uns: 'log1p', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_colors', 'batch_colors'
obsm: 'X_pca', 'X_tsne', 'X_umap'
varm: 'PCs'
layers: 'binary', 'normalised'
obsp: 'distances', 'connectivities'
Integrating data using SCALEX
The batch effects can be well-resolved using SCALEX.
Note
Here we use GPU to speed up the calculation process, however, you can get the same level of performance only using cpu.
[12]:
adata=SCALEX('mouse_brain_atac.h5ad',batch_name='batch',profile='ATAC',
min_features=1, min_cells=5, n_top_features=30000,outdir='ATAC_output/',show=False,gpu=8)
2021-03-30 20:21:45,161 - root - INFO - Raw dataset shape: (13746, 479127)
2021-03-30 20:21:45,166 - root - INFO - Preprocessing
2021-03-30 20:21:45,639 - root - INFO - Filtering cells
2021-03-30 20:21:47,224 - root - INFO - Filtering features
filtered out 11140 genes that are detected in less than 5 cells
2021-03-30 20:21:49,461 - root - INFO - Finding variable features
2021-03-30 20:21:51,582 - root - INFO - Normalizing total per cell
normalizing counts per cell
finished (0:00:00)
2021-03-30 20:21:51,690 - root - INFO - Batch specific maxabs scaling
2021-03-30 20:22:16,926 - root - INFO - Processed dataset shape: (13746, 30076)
2021-03-30 20:22:17,234 - root - INFO - model
VAE(
(encoder): Encoder(
(enc): NN(
(net): ModuleList(
(0): Block(
(fc): Linear(in_features=30076, out_features=1024, bias=True)
(norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): ReLU()
)
)
)
(mu_enc): NN(
(net): ModuleList(
(0): Block(
(fc): Linear(in_features=1024, out_features=10, bias=True)
)
)
)
(var_enc): NN(
(net): ModuleList(
(0): Block(
(fc): Linear(in_features=1024, out_features=10, bias=True)
)
)
)
)
(decoder): NN(
(net): ModuleList(
(0): Block(
(fc): Linear(in_features=10, out_features=30076, bias=True)
(norm): DSBatchNorm(
(bns): ModuleList(
(0): BatchNorm1d(30076, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): BatchNorm1d(30076, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(act): Sigmoid()
)
)
)
)
Epochs: 100%|██████████| 141/141 [10:53<00:00, 4.63s/it, recon_loss=992.783,kl_loss=3.643]
2021-03-30 20:33:17,866 - root - INFO - Output dir: ATAC_output//
2021-03-30 20:33:22,489 - root - INFO - Plot umap
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:03)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:09)
running Leiden clustering
finished: found 16 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:00:01)
WARNING: saving figure to file ATAC_output/umap.pdf
[13]:
adata
[13]:
AnnData object with n_obs × n_vars = 13746 × 30076
obs: 'batch', 'n_genes', 'leiden'
var: 'n_cells', 'prop_shared_cells', 'variability_score'
uns: 'neighbors', 'umap', 'leiden', 'batch_colors', 'leiden_colors'
obsm: 'latent', 'X_umap'
obsp: 'distances', 'connectivities'
While there seems to be some strong batch-effect in all cell types, SCALEX can integrate them homogeneously.
[14]:
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
[15]:
sc.pl.umap(adata,color=['leiden'],legend_fontsize=10)

[16]:
sc.pl.umap(adata,color=['batch'],legend_fontsize=10)

The integrated data is stored as adata.h5ad
in the output directory assigned by outdir
parameter in SCALE function.
Integration cross-modality data using SCALEX
The following tutorial demonstrates how to use SCALEX for integrating scRNA-seq and scATAC-seq data
There are mainly two steps:
Create a gene activity matrix from scATAC-seq data. This step follows the standard workflow of Signac for scATAC-seq data analysis. We uses the function GeneActivity of Signac and calculate the activity of each gene in the genome by assessing the chromatin accessibility associated with each gene, and create a new gene activity matrix derived from the scATAC-seq data. More details are here.
Integrate. We regard gene expression matrix and gene activity matrix as two batches of one dataset and use SCALEX for integration.
For this tutorial, we used a cross-modality PBMC data between scRNA-seq and scATAC-seq provided by 10X Genomics, and both scRNA-seq and scATAC-seq data are available through the 10x Genomics website.
Create a gene activity matrix (R)
[1]:
suppressPackageStartupMessages(library(Signac))
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(GenomeInfoDb))
suppressPackageStartupMessages(library(EnsDb.Hsapiens.v75))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(patchwork))
set.seed(1234)
options(warn=-1)
Pre-processing
We follow the pre-processing workflow of Signac when pre-processing chromatin data. First we creat a Seurat object by two related input files: cell matrix and fragment file.
[2]:
counts <- Read10X_h5(filename = "atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
file = "atac_v1_pbmc_10k_singlecell.csv",
header = TRUE,
row.names = 1
)
chrom_assay <- CreateChromatinAssay(
counts = counts,
sep = c(":", "-"),
genome = 'hg19',
fragments = 'atac_v1_pbmc_10k_fragments.tsv.gz',
min.cells = 10,
min.features = 200
)
pbmc <- CreateSeuratObject(
counts = chrom_assay,
assay = "peaks",
meta.data = metadata
)
Computing hash
Now add gene annotations to the pbmc object for the human genome. genome is suggested to be as same as the genome for scRNA-seq reads mapping on.
[4]:
# extract gene annotations from EnsDb
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v75)
# change to UCSC style since the data was mapped to hg19
seqlevelsStyle(annotations) <- 'UCSC'
genome(annotations) <- "hg19"
# add the gene information to the object
Annotation(pbmc) <- annotations
Compute QC Metrics. If you don’t need to filter cells, ignore this step
[13]:
# compute nucleosome signal score per cell
pbmc <- NucleosomeSignal(object = pbmc)
# compute TSS enrichment score per cell
pbmc <- TSSEnrichment(object = pbmc, fast = FALSE)
# add blacklist ratio and fraction of reads in peaks
pbmc$pct_reads_in_peaks <- pbmc$peak_region_fragments / pbmc$passed_filters * 100
pbmc$blacklist_ratio <- pbmc$blacklist_region_fragments / pbmc$peak_region_fragments
Extracting TSS positions
Finding + strand cut sites
Finding - strand cut sites
Computing mean insertion frequency in flanking regions
Normalizing TSS score
[14]:
pbmc$high.tss <- ifelse(pbmc$TSS.enrichment > 2, 'High', 'Low')
TSSPlot(pbmc, group.by = 'high.tss') + NoLegend()

[15]:
pbmc$nucleosome_group <- ifelse(pbmc$nucleosome_signal > 4, 'NS > 4', 'NS < 4')
FragmentHistogram(object = pbmc, group.by = 'nucleosome_group')

[16]:
VlnPlot(
object = pbmc,
features = c('pct_reads_in_peaks', 'peak_region_fragments',
'TSS.enrichment', 'blacklist_ratio', 'nucleosome_signal'),
pt.size = 0.1,
ncol = 5
)

Filter cells that are outliers for the QC metrics.
[17]:
pbmc <- subset(
x = pbmc,
subset = peak_region_fragments > 3000 &
peak_region_fragments < 20000 &
pct_reads_in_peaks > 15 &
blacklist_ratio < 0.05 &
nucleosome_signal < 4 &
TSS.enrichment > 2
)
pbmc
An object of class Seurat
87561 features across 7060 samples within 1 assay
Active assay: peaks (87561 features, 0 variable features)
Calculate gene activity matrix by GeneActivity() function
[5]:
gene.activities <- GeneActivity(pbmc)
Extracting gene coordinates
Extracting reads overlapping genomic regions
save results
[ ]:
wk_dir <- './'
[2]:
write.table(t(gene.activities), paste(wk_dir,"gene_activity_score.txt",sep=''), sep='\t',quote = F)
Integrate (python)
Now you can use the gene_activity_score together with the gene expression matrix for integration.
You can run SCALE command line directly: SCALEX.py –data_list data1 data2 –batch_categories RNA ATAC -o output_path
data1: path or file of scRNA-seq data
data2: file of gene_activity_score
and here we show the results before integration and after integration.
[3]:
import scalex
from scalex import SCALEX
from scalex.plot import embedding
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
[19]:
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white',figsize=(3,3),frameon=True)
sc.logging.print_header()
plt.rcParams['axes.unicode_minus']=False
scanpy==1.7.0 anndata==0.7.5 umap==0.5.0 numpy==1.19.2 scipy==1.5.2 pandas==1.1.3 scikit-learn==0.23.2 statsmodels==0.12.0 python-igraph==0.9.0 leidenalg==0.8.3
[3]:
scalex.__version__
[3]:
'2.0.3.dev'
First we merge the RNA and ATAC data and add metadata information, and this processed data is available here
[ ]:
wk_dir = './'
[10]:
adata = sc.read_h5ad(wk_dir+'pbmc_RNA-ATAC.h5ad')
[11]:
adata
[11]:
AnnData object with n_obs × n_vars = 16492 × 13928
obs: 'celltype', 'tech', 'batch'
[12]:
sc.pp.filter_cells(adata, min_genes=0)
sc.pp.filter_genes(adata, min_cells=0)
sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
adata = adata[:, adata.var.highly_variable]
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=30)
sc.tl.umap(adata, min_dist=0.1)
normalizing counts per cell
finished (0:00:00)
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
finished (0:00:01)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
... as `zero_center=True`, sparse input is densified and may lead to large memory consumption
computing PCA
on highly variable genes
with n_comps=50
finished (0:00:02)
computing neighbors
using 'X_pca' with n_pcs = 30
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:14)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:29)
[15]:
sc.pl.umap(adata,color=['batch','celltype'],legend_fontsize=10, ncols=2)

[16]:
wk_dir='./' # wk_dir is your local path to store data and results
[17]:
adata = SCALEX(data_list = [wk_dir+'RNA-ATAC.h5ad'],
min_features=0,
min_cells=0,
outdir=wk_dir+'/pbmc_RNA_ATAC/',
show=False,
gpu=7)
2021-03-26 11:46:17,234 - root - INFO - Raw dataset shape: (16492, 13928)
2021-03-26 11:46:17,237 - root - INFO - Preprocessing
2021-03-26 11:46:17,260 - root - INFO - Filtering cells
Trying to set attribute `.obs` of view, copying.
2021-03-26 11:46:21,627 - root - INFO - Filtering features
2021-03-26 11:46:24,745 - root - INFO - Normalizing total per cell
normalizing counts per cell
finished (0:00:00)
2021-03-26 11:46:25,022 - root - INFO - Log1p transforming
2021-03-26 11:46:26,014 - root - INFO - Finding variable features
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
finished (0:00:03)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
2021-03-26 11:46:30,637 - root - INFO - Batch specific maxabs scaling
2021-03-26 11:46:32,857 - root - INFO - Processed dataset shape: (16492, 2000)
2021-03-26 11:46:32,908 - root - INFO - model
VAE(
(encoder): Encoder(
(enc): NN(
(net): ModuleList(
(0): Block(
(fc): Linear(in_features=2000, out_features=1024, bias=True)
(norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): ReLU()
)
)
)
(mu_enc): NN(
(net): ModuleList(
(0): Block(
(fc): Linear(in_features=1024, out_features=10, bias=True)
)
)
)
(var_enc): NN(
(net): ModuleList(
(0): Block(
(fc): Linear(in_features=1024, out_features=10, bias=True)
)
)
)
)
(decoder): NN(
(net): ModuleList(
(0): Block(
(fc): Linear(in_features=10, out_features=2000, bias=True)
(norm): DSBatchNorm(
(bns): ModuleList(
(0): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): BatchNorm1d(2000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(act): Sigmoid()
)
)
)
)
Epochs: 100%|██████████| 117/117 [06:54<00:00, 3.54s/it, recon_loss=274.876,kl_loss=2.892]
2021-03-26 11:53:35,672 - root - INFO - Output dir: .//pbmc_RNA_ATAC//
2021-03-26 11:53:45,553 - root - INFO - Plot umap
computing neighbors
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:03)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:26)
running Leiden clustering
finished: found 14 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:00:05)
WARNING: saving figure to file pbmc_RNA_ATAC/umap.pdf
[20]:
sc.pl.umap(adata,color=['batch','celltype'],legend_fontsize=10, ncols=2)

[ ]:
Installation
PyPI install
Pull SCALE from PyPI (consider using pip3
to access Python 3):
pip install scalex
Pytorch
If you have cuda devices, consider install Pytorch cuda version.:
conda install pytorch torchvision torchaudio -c pytorch
Troubleshooting
Anaconda
If you do not have a working installation of Python 3.6 (or later), consider installing Miniconda (see Installing Miniconda).
Installing Miniconda
After downloading Miniconda, in a unix shell (Linux, Mac), run
cd DOWNLOAD_DIR
chmod +x Miniconda3-latest-VERSION.sh
./Miniconda3-latest-VERSION.sh
and accept all suggestions. Either reopen a new terminal or source ~/.bashrc on Linux/ source ~/.bash_profile on Mac. The whole process takes just a couple of minutes.
Usage
SCALEX provide both commanline tool and api function used in jupyter notebook
Command line
Run SCALEX after installation:
SCALEX.py --data_list data1 data2 --batch_categories batch_name1 batch_name2
data_list
: data path of each batch of single-cell dataset
batch_categories
: name of each batch, batch_categories will range from 0 to N if not specified
Input
Input can be one of following:
single file of format h5ad, csv, txt, mtx or their compression file
multiple files of above format
Note
h5ad file input
SCALEX will use the
batch
column in the obs of adata format read from h5ad file as batch informationUsers can specify any columns in the obs with option:
--batch_name
nameIf multiple inputs are given, SCALEX can take each file as individual batch by default, and overload previous batch information, users can change the concat name via option
--batch_key
other_name
Output
Output will be saved in the output folder including:
checkpoint: saved model to reproduce results cooperated with option
--checkpoint
or -cadata.h5ad: preprocessed data and results including, latent, clustering and imputation
umap.png: UMAP visualization of latent representations of cells
log.txt: log file of training process
Useful options
output folder for saveing results: [-o] or [–outdir]
filter rare genes, default 3: [–min_cell]
filter low quality cells, default 600: [–min_gene]
select the number of highly variable genes, keep all genes with -1, default 2000: [–n_top_genes]
Help
Look for more usage of SCALEX:
SCALEX.py --help
API function
Use SCALEX in jupyter notebook:
from scalex.function import SCALEX
adata = SCALEX(data_list, batch_categories)
- or
adata = SCALEX([adata_1, adata_2])
Function of parameters are similar to command line options. Input can be the files of adata or a list of AnnData or one concatenated AnnData Output is a Anndata object for further analysis with scanpy.
AnnData
SCALEX supports scanpy
and anndata
, which provides the AnnData
class.
At the most basic level, an AnnData
object adata stores
a data matrix adata.X, annotation of observations
adata.obs and variables adata.var as pd.DataFrame and unstructured
annotation adata.uns as dict. Names of observations and
variables can be accessed via adata.obs_names and adata.var_names,
respectively. AnnData
objects can be sliced like
dataframes, for example, adata_subset = adata[:, list_of_gene_names].
For more, see this blog post.
To read a data file to an AnnData
object, call:
import scanpy as sc
adata = sc.read(filename)
to initialize an AnnData
object. Possibly add further annotation using, e.g., pd.read_csv:
import pandas as pd
anno = pd.read_csv(filename_sample_annotation)
adata.obs['cell_groups'] = anno['cell_groups'] # categorical annotation of type pandas.Categorical
adata.obs['time'] = anno['time'] # numerical annotation of type float
# alternatively, you could also set the whole dataframe
# adata.obs = anno
To write, use:
adata.write(filename)
adata.write_csvs(filename)
adata.write_loom(filename)
API
Import SCALEX:
import scalex
Function
|
Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space |
|
Label transfer |
Data
Load data
|
Load dataset with preprocessing |
|
Concatenate multiple datasets along the observations axis with name |
|
Load single cell dataset from files |
|
Load single cell dataset from file |
|
Read mtx format data folder including: |
Preprocessing
|
Preprocessing single-cell data |
|
Preprocessing single-cell RNA-seq data |
|
Preprocessing single-cell ATAC-seq |
|
Batch-specific scale data |
|
Reindex AnnData with gene list |
DataLoader
|
Dataloader of single-cell data |
|
Batch-specific Sampler sampled data of each batch is from the same dataset. |
Net
Model
|
VAE framework |
Layer
|
Domain-specific Batch Normalization |
|
Basic block consist of: |
|
Neural network consist of multi Blocks |
|
VAE Encoder |
Loss
|
|
|
Utils
|
Make the input tensor one hot tensors |
|
Early stops the training if loss doesn't improve after a given patience. |
Plot
|
plot separated embeddings with others as background |
|
Plot meta correlations among batches |
|
Plot meta correlations between two batches |
|
Plot confusion matrix |
Metric
Collections of useful measurements for evaluating results.
|
Calculate batch entropy mixing score |
|
Compute the mean Silhouette Coefficient of all samples. |
Logger
|
News
SCALEX is online on Nature Communications 2022-10-17 SCALEX is available on bioRxiv 2021-04-09
Release
Version 1.0
1.0.0 2022-08-29
Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space
News
SCALEX is online on Nature Communications 2022-10-17 SCALEX is available on bioRxiv 2021-04-09