scalex.SCALEX(data_list=None, batch_categories=None, profile='RNA', batch_name='batch', min_features=600, min_cells=3, target_sum=None, n_top_features=None, join='inner', batch_key='batch', processed=False, fraction=None, n_obs=None, use_layer='X', backed=False, batch_size=64, lr=0.0002, max_iteration=30000, seed=124, gpu=0, outdir=None, projection=None, repeat=False, impute=None, chunk_size=20000, ignore_umap=False, verbose=False, assess=False, show=True, eval=False, num_workers=4)

Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space

  • data_list (Union[str, AnnData, List, None]) – A path list of AnnData matrices to concatenate with. Each matrix is referred to as a ‘batch’.

  • batch_categories (Optional[List]) – Categories for the batch annotation. By default, use increasing numbers.

  • profile (str) – Specify the single-cell profile, RNA or ATAC. Default: RNA.

  • batch_name (str) – Use this annotation in obs as batches for training model. Default: ‘batch’.

  • min_features (int) – Filtered out cells that are detected in less than min_features. Default: 600.

  • min_cells (int) – Filtered out genes that are detected in less than min_cells. Default: 3.

  • n_top_features (Optional[int]) – Number of highly-variable genes to keep. Default: 2000.

  • join (str) – Use intersection (‘inner’) or union (‘outer’) of variables of different batches.

  • batch_key (str) – Add the batch annotation to obs using this key. By default, batch_key=’batch’.

  • batch_size (int) – Number of samples per batch to load. Default: 64.

  • lr (float) – Learning rate. Default: 2e-4.

  • max_iteration (int) – Max iterations for training. Training one batch_size samples is one iteration. Default: 30000.

  • seed (int) – Random seed for torch and numpy. Default: 124.

  • gpu (int) – Index of GPU to use if GPU is available. Default: 0.

  • outdir (Optional[str]) – Output directory. Default: ‘output/’.

  • projection (Optional[str]) – Use for new dataset projection. Input the folder containing the pre-trained model. If None, don’t do projection. Default: None.

  • repeat (bool) – Use with projection. If False, concatenate the reference and projection datasets for downstream analysis. If True, only use projection datasets. Default: False.

  • impute (Optional[str]) – If True, calculate the imputed gene expression and store it at adata.layers[‘impute’]. Default: False.

  • chunk_size (int) – Number of samples from the same batch to transform. Default: 20000.

  • ignore_umap (bool) – If True, do not perform UMAP for visualization and leiden for clustering. Default: False.

  • verbose (bool) – Verbosity, True or False. Default: False.

  • assess (bool) – If True, calculate the entropy_batch_mixing score and silhouette score to evaluate integration results. Default: False.

Return type:



  • The output folder contains

  • adata.h5ad – The AnnData matrice after batch effects removal. The low-dimensional representation of the data is stored at adata.obsm[‘latent’].

  • checkpoint – contains the variables of the model and contains the parameters of the model.

  • log.txt – Records raw data information, filter conditions, model parameters etc.

  • umap.pdf – UMAP plot for visualization.