CCIM workflow

Generation of CCIM

We define the cell–cell interaction vector between a pair of cells as the geometric mean of expression values of each cognate ligand–receptor pair. Formally, the interaction vector V between sender cell Ni and receiver cell Nj is given by

$$V_N_iN_j = \left[ \sqrt N_i^l_1 \ast N_j^r_1 ,\sqrt N_i^l_2 \ast N_j^r_2 , \ldots ,\sqrt N_i^l_n \ast N_j^r_n \right],$$

where ln,rn represent a cognate ligand–receptor pair. We chose to multiply ligand and receptor expression values so that zero values of either ligand or receptor expression would result in a zero value for the corresponding index of the interaction vector. Additionally, we chose to take the square root of the product of ligand–receptor expression values so that highly expressed ligand–receptor pairs do not disproportionately drive downstream analysis. This definition is equivalent to the geometric mean. The cell–cell interaction matrix M is constructed by concatenating the cell–cell interaction vectors. M is used as input to low-dimensional embeddings for visualization and nearest neighbor graphs for graph-based clustering.

Weighting CCIM by upstream regulome

The CCIM M can be weighted by ligand–receptor edges that are predicted to be active based on observed downstream gene expression changes. First, we identify genes in the dataset that are variable across some axis of interest. For analyses of single datasets, variable genes can be defined as the set of genes with the highest residual variance in the dataset—for example, by calling FindVariableFeatures as implemented by Seurat. For comparative analyses, Scriabin provides several utility functions to aid in the identification of variable genes between samples or between timepoints, depending on the user’s analytical questions.

Next, the package CelliD23, which provides a convenient and scalable workflow to define single-cell gene signatures, is used to define per-cell gene signatures. In brief, user-defined variable genes are used to embed the dataset into low-dimensional space by MCA. A cell’s gene signature is then defined as the set of genes to which that cell is nearest in the MCA bi-plot. A quantile cutoff is used to threshold gene proximity, by default the 5% of nearest genes.

NicheNet’s20 ligand–target matrix, which denotes the regulatory potential scores between ligands and target genes, is then used to rank ligands based on their predicted ability to result in the per-cell gene signature. First, expressed genes are defined by the percentage of cells in which they are detected (by default, 2.5%). Next, a set of potential ligands is defined as those ligands that are expressed genes and for which at least one receptor is also an expressed gene. Next, the ligand–target matrix is filtered to contain only the set of potential ligands and targets in the set of expressed genes. The authors of NicheNet showed that the Pearson correlation coefficient between a ligand’s target prediction and observed transcriptional response is the most informative metric of ligand activity20. Therefore, the activity of a single ligand for a single cell is defined as the Pearson correlation coefficient between the vector of that cell’s gene signature and the target gene scores for that ligand. For each active ligand, target gene weights for each cell are defined as the ligand–target matrix regulatory score for the top 250 targets for each ligand that appear in a given cell’s gene signature. We selected a Pearson coefficient threshold (by default, 0.075) to define active ligands in each cell.

Finally, we weight individual values of \(V_N_iN_j\). Scriabin supports two methods for weighting the CCIM by predicted ligand activities. Method ‘product’ (default) weights interaction vectors proportionally to predicted ligand activities. The vector of ligand activities for receiver cell Nj, Aj, is scaled so that values above the Pearson threshold lie between two scaling factors (by default, 1.5 and 3), and values below the Pearson threshold are set to 1. The interaction vector is then given by:

$$\beginarrayl\mathopV\nolimits_N_iN_j^\,product = \left[ \sqrt N_i^l_1 \ast N_j^r_1 \ast A_j^l_1 ,\sqrt N_i^l_2 \ast N_j^r_2 \ast A_j^l_2 , \ldots ,\sqrt N_i^l_n \ast N_j^r_n \ast A_j^l_n \right]\endarray$$

Method ‘sum’ treats a ligand activity as orthogonal evidence of receptor expression. Pearson coefficients in the vector of ligand activities for receiver cell Nj, Aj, that are below the Pearson threshold are set to 0. The interaction vector is then given by:

$$\beginarrayl\mathopV\nolimits_N_iN_j^\,sum = \left[ \sqrt N_i^l_1 \ast \left( N_j^r_1 + A_j^l_1 \right) ,\sqrt N_i^l_2 \ast \left( N_j^r_2 + A_j^l_2 \right) , \ldots ,\sqrt N_i^l_n \ast \left( N_j^r_n + A_j^l_n \right) \right]\endarray$$

Use cases for ligand activity weighting methods, as well as other parameters involved in calculating ligand activities, are described in the Supplementary Text.

Downstream analysis of weighted CCIMs

M can be treated analogously to the gene expression matrix and used for downstream analysis tasks, such as dimensionality reduction. After generation and (optional) weighting of M by active ligands, M is placed into an assay of a Seurat object for downstream analysis. M is scaled by ScaleData; latent variables are found via PCA; and the top principal components (PCs) (identified by ElbowPlot for each dataset) are used to embed the dataset in two dimensions using uniform manifold approximation and projection (UMAP)74. Neighbor graphs are constructed by FindNeighbors, which can then be clustered via modularity optimization graph-based clustering75 as implemented by Seurat’s FindClusters73. Differential ligand–receptor edges among clusters, cell types or samples can be identified using FindMarkers. Scriabin provides several utility functions to facilitate visualization of gene expression profiles or other metadata on Seurat objects built from cell–cell interaction matrices.

Summarized interaction graph and binning workflow

Generation of summarized interaction graph

Because M scales exponentially with dataset size, it is frequently impractical to calculate M for all cell–cell pairs Ni,Nj. In this situation, Scriabin supports two workflows that do not require aggregation or subsampling. In the first workflow, a summarized cell–cell interaction graph S is built in lieu of M where \(S_i,\,j = \Sigma V_N_iN_j\). S thus represents the magnitude of predicted interaction across all cognate ligand–receptor pairs expressed by all sender–receiver cell pairs. S is then corrected for associations with sequencing depth by linear regression. The sequencing depth of cell–cell pair Ni,Nj is defined as \(nUMI_N_i + nUMI_N_j\). A linear model is fit to describe the relationship between the summarized interaction score (\(S_i,j = \Sigma V_N_iN_j\), where S is the summarized interaction matrix and \(V_N_iN_j\) is the interaction vector for cell–cell pair Ni,Nj) and the total sequencing depth of each cell–cell pair. The residuals of this model are used as a sequencing depth-corrected S. S may optionally be weighted through prediction of ligand activity as described above. The second workflow is described below in the ‘Interaction program discovery workflow’ subsection.

Dataset binning for comparative CCC analyses

Once summarized interaction graphs are built for multiple samples, alignment of these graphs requires knowledge about which cells between samples represent a shared molecular state. The goal of binning is to assign each cell a bin identity so that S from multiple samples can be summarized into equidimensional matrices based on shared bin identities.

The binning process begins by constructing a shared nearest neighbor (SNN) graph using FindNeighbors, defining connectivity between all cells to be compared. Alternate neighbor graphs—for example, those produced using Seurat’s weighted nearest neighbor (WNN) workflow, which leverages information from multimodal references—can also be used. Next, mutual nearest neighbors (MNNs) are identified between all sub-datasets to be compared using Seurat’s integration workflow (FindIntegrationAnchors)21. In brief, two sub-datasets to be compared are placed into a shared low-dimensional space via diagonalized canonical correlation analysis (CCA), and the canonical correlation vectors are log-normalized. Normalized canonical correlation vectors are then used to identify k-nearest neighbors for each cell in its paired dataset, and the resulting MNN pairings are scored as described21. Low-scoring MNN pairings are then removed, as they have a higher tendency to represent incorrect cell–cell correspondences when orthogonal data are available (Extended Data Fig. 10).

For each cell that participates in an MNN pair, Scriabin defines a bin as that cell and all cells with which it participates in an MNN pair. Considering a dataset of n cells i of which a subset i′ participates in an MNN pair, for each cell in we define a bin jn that contains in and all MNNs of in. Next, Scriabin constructs a connectivity matrix G where Gi,j is the mean connectivity in the SNN graph between cell i and the cells within bin j. Each cell in is assigned a bin identity of the bin jn with which it shares the highest connectivity in G. Thus, at the end of this process, each cell has a single bin identity, which reflects its SNN similarity to a group of cells with cross-dataset MNN connectivity.

However, at this stage, each bin jn may not contain cells from all the samples being compared. Thus, we next optimize for the set of bins that results in the best representation of all samples. Bins j with the lowest total connectivity and lowest multi-sample representation in G are iteratively removed, and cell bin identities are re-scored until the mean sample representation of each bin plateaus. Within-bin connectivity and sample representation are further improved by reassigning cells that result in better sample representation of an incompletely represented bin while maintaining equal or greater SNN connectivity with the cells in that bin. Finally, remaining incompletely represented bins are merged with the nearest completely represented bin with which it shares the highest SNN connectivity. At the end of this process, each cell will, thus, have a single assigned bin identity, where each bin contains cells from all samples to be compared.

Statistical analysis of bin significance

Bins are then tested for the statistical significance of their connectivity structure using a permutation test. For each bin, random bins of the same size and number of cells per sample are generated iteratively (by default, 10,000 times). The connectivity vector of the real bins is tested against each of the random bins by a one-sided Mann–Whitney U-test. If the bin fails 500 or more of these tests (P> 0.05), it is considered non-significant.

Because bin SNN connectivity is generally non-zero, but randomly sampled cells generally have an SNN connectivity of zero, this strategy will tend to return most bins as statistically significantly connected. Thus, we recommend passing high-resolution cell type labels to the binning significance testing. In this situation, randomly generated bins are generated by randomly selecting cells from the same sample and cell type annotation, and the permutation test proceeds as described above. Bins where more than a threshold (by default, 95%) of cells belong to the same cell type annotation are automatically considered significant. This avoids rare cell types that may only form a single bin from being discarded. Cells that were assigned to bins that failed the significance testing are reassigned to the bin with which they share the highest SNN connectivity.

Identification of variable bins

For each bin, a Kruskal–Wallis test is used to assess differences in the magnitude of CCC between cell–cell pairs from different samples. The Kruskal–Wallis P value and test statistic can be used to identify which bins contain cells that exhibit the highest change in prediction interaction scores. Specific samples that contribute to each significantly variable bin’s perturbation are then identified through the Dunn post hoc test. This set of sender and receiver cells can then be used to construct M as described above.

Interaction program discovery workflow

Iterative approximation of a ligand–receptor pair TOM

An alternative to the summarized interaction graph workflow is to instead identify co-expressed ligand–receptor pairs, which we refer to as ‘interaction programs’. This approach represents an adaptation of the well-established WGCNA22 and is scalable to any dataset size and still permits analysis of CCC at single-cell resolution. The first step in this workflow is to generate a signed covariance matrix of ligand–receptor pairs for each sample, defined as

$$s_ij^signed = 0.5 + 0.5cor\left( lr_i,lr_j \right),$$

where \(lr_i,lr_j\) are individual ligand–receptor pair vectors of M. In large datasets, \(s_ij^signed\) is approximated by iteratively generating subsets of M. \(s_ij^signed\) is next converted into an adjacency matrix via soft thresholding

$$a_ij = \left( s_ij^signed \right)^\beta ,$$

where β is the soft power. Soft power is a user-defined parameter that is recommended to be the lowest value that results in a scale-free topology model fit of >0.6. Next, this adjacency matrix is converted into a TOM as described76. This process proceeds separately for each sample to be analyzed in a multi-sample dataset.

Identification and significance testing of interaction programs

The TOM is hierarchically clustered, and interaction programs are identified through adaptive branch pruning of the hierarchical clustering dendrogram. Intramodular connectivity for each ligand–receptor pair in each interaction program is then calculated as described77. If interaction programs are being discovered in a multi-sample dataset, similar modules (defined by Jaccard overlap index above a user-defined threshold) are merged. Next, interaction programs are then tested for statistically significant co-expression structure via a permutation test where random interaction programs are generated 10,000 times. The correlation vector of the real module is tested against each of the random modules by a one-sided Mann–Whitney U-test. If the module fails 500 or more of these tests (P > 0.05), it is considered non-significant. Each sample is tested for significant correlation of each module.

Downstream analysis of interaction programs

Single cells are scored separately for the expression of the ligands and receptors of each significant module with Seurat’s AddModuleScore. This function calculates a module score by comparing the expression level of an individual query gene to other randomly selected control genes expressed at similar levels to the query genes and is, therefore, robust to scoring modules containing both lowly and highly expressed genes as well as to scoring cells with different sequencing depth. Scriabin includes several utility functions to conveniently visualize interaction program expression for sender and receiver cells.

Identification of longitudinal CCC circuits

A longitudinal CCC circuit is composed of S1–L1–R1–S2–L2–R2, where S are sender cells and R are receiver cells at timepoints 1 and 2 and where L1 is expressed by/sensed by S1/R1, and L2 is expressed by/sensed by S2/R2. For computational efficiency, construction of longitudinal CCC circuits starts at the end of the circuit and proceeds upstream. First, ligands L2 predicted by NicheNet to be active in receiver cells at timepoint 2 are identified. Next, sender cells that express L2 and have the L2 in its per-cell gene signature are identified. Among the bins occupied by these S2 candidates, Scriabin then searches for receiver cells at timepoint 1 that occupy the same bin and have the corresponding timepoint 2 ligand L2 within its list of upregulated target genes and identifies the ligand(s) L1 predicted by NicheNet to result in upregulation of that target. Finally, Scriabin identifies S1 candidates that express the timepoint 1 ligands L1 and have L1 in its per-cell gene signature. S1–R1–S2–R2 cell groups that meet these criteria are retained for further analysis. This process repeats for every pair of timepoints. Finally, Scriabin searches for overlap between circuits of sequential timepoint pairs to identify circuits that operate over more than two timepoints.

Ligand–receptor pair databases for analysis

Scriabin supports the use of 15 ligand–receptor interaction databases for all analytical functions; these resources were collected from LIANA78. By default, Scriabin uses the OmniPath database18,19 filtered for curation strength of >7 to ensure that ligand–receptor interactions with strong experimental evidence are included in downstream analysis. Scriabin also supports the use of custom ligand–receptor pair lists for users with specific analytical questions.

Transfection and co-culture of primary NK and B cells

Peripheral blood mononuclear cells (PBMCs) were acquired from a healthy blood donor who was consented for release of genetic data by the Stanford Blood Center. PBMCs were isolated by Ficoll-Paque (GE Healthcare) density gradient centrifugation and cryopreserved in 90% FBS + 10% DMSO (v/v). PBMCs were thawed at 37 °C in complete RPMI 1640 media (supplemented with 10% FBS, L-glutamine and penicillin–streptomycin–amphotericin; RP10) containing benzonase (EMD Millipore). NK cells and B cells were purified from thawed PBMCs by magnetic bead isolation via negative selection according to the manufacturer’s specifications (Miltenyi Biotec, 130-092-657 and 130-101-638, respectively). NK and B cells were maintained in complete RP10 media without additional cytokines to ensure a resting state. All cell culture was performed at 37 °C/5% CO2 in a humidified environment.

eGFP-encoding, CD40-encoding and CD40L-encoding mRNAs were purchased from TriLink BioTechnologies and used without further purification. Notably, open reading frame (ORF) sequences for mRNAs encoding CD40 and CD40L were codon optimized using the codon optimization tool developed by Integrated DNA Technologies: this serves both to improve translational efficacy as well as to enable distinguishing endogenous versus exogenous CD40 and CD40LG mRNAs through sequencing.

mRNAs were delivered to isolated NK and B cells via transfection by charge-altering releasable transporters (CARTs) as previously described79. In brief, CART/mRNA polyplexes were prepared by diluting 0.84 of mRNA (1 μg μl−1) into 14.52 μl of PBS (pH 5.5). To this solution was added 1.44 μl of CART BDK-O7:N7:A13 (2 mM DMSO) to achieve a charge ratio of 10:1 (±, assuming all ionizable cationic groups are protonated). After mixing by finger vortex for 15 s, 2.5 μl of the polyplexes was added to cells and incubated for 6 h in serum-free media. After this incubation, an aliquot was taken from each transfection condition for flow cytometric analysis; FBS was added to a final concentration of 10%; the cells were counted; and NK cells and B cells from the respective transfection conditions were mixed together in a 1:1 ratio for co-culture. Cells were co-cultured for 12 h before analysis by flow cytometry and scRNA-seq.

Flow cytometry

Antibodies used for flow cytometric analyses are listed in Supplementary Table 2. eBioscience Fixable Viability Dye eFluor 780 (Thermo Fisher Scientific) was used as a viability stain. After application of viability stain, cells were surface stained for 20 min at room temperature before acquisition on an Aurora flow cytometer (Cytek Biosciences) and analysis by FlowJo version 10.6.1 software.

scRNA-seq by Seq-Well

The Seq-Well platform for scRNA-seq was used as described previously56,80,81,82,83. Immediately after co-culture, cells were counted and diluted in RP10 to a concentration of 75,000 cells per milliliter. Then, 200 μl of this cell suspension (15,000 cells) was loaded onto Seq-Well arrays pre-loaded with mRNA capture beads (ChemGenes). After four washes with DPBS to remove serum, the arrays were sealed with a polycarbonate membrane (pore size, 0.01 µm) for 30 min at 37 °C. Next, arrays were placed in lysis buffer; transcripts were hybridized to the mRNA capture beads; beads were recovered from the arrays and pooled for downstream processing. Immediately after bead recovery, mRNA transcripts were reverse transcribed using Maxima H-RT (Thermo Fisher Scientific, EPO0753) in a template-switching-based RACE reaction; excess unhybridized bead-conjugated oligonucleotides were removed with Exonuclease I (New England Biolabs (NEB), M0293L); second-strand synthesis was performed with Klenow fragment (NEB, M0212L) to enhance transcript recovery in the event of failed template switching81. Whole-transcriptome amplification (WTA) was performed with KAPA HiFi PCR Mastermix (Kapa Biosystems, KK2602) using approximately 6,000 beads per 50-μl reaction volume. Resulting libraries were then pooled in sets of six (approximately 36,000 beads per pool), and products were purified by Agencourt AMPure XP beads (Beckman Coulter, A63881) with a 0.6× volume wash followed by a 0.8× volume wash. Quality and concentration of WTA products were determined using an Agilent TapeStation, with a mean product size of more than 800 base pairs (bp) and a non-existent primer peak indicating successful preparation. Library preparation was performed with a Nextera XT DNA Library Preparation Kit (Illumina, FC-131-1096) with 1 ng of pooled library using single-index primers. Tagmented and amplified libraries were again purified by Agencourt AMPure XP beads with a 0.6× volume wash followed by a 1.0× volume wash, and quality and concentration were determined by TapeStation analysis. Libraries between 400 bp and 1,000 bp with no primer peaks were considered successful and pooled for sequencing. Sequencing was performed on a NovaSeq 6000 instrument (Illumina; Chan Zuckerberg Biohub). The read structure was paired-end with read 1 beginning from a custom read 1 primer80 containing a 12-bp cell barcode and an 8-bp UMI and with read 2 containing 50 bp of mRNA sequence.

Alignment and quality control of scRNA-seq data

Sequencing reads were aligned and count matrices assembled using STAR84 and dropEst85, respectively. In brief, the mRNA reads in read 2 demultiplexed FASTQ files were tagged with the cell barcode and UMI for the corresponding read in the read 1 FASTQ file using the dropTag function of dropEst. Next, reads were aligned with STAR using the GRCh38.p13 (hg38) human reference genome from Ensembl. This reference also included sequences and annotations for the codon-optimized ORFs for GFP-encoding, CD40-encoding and CD40L-encoding mRNAs so that both endogenous and exogenous mRNAs could be quantified. Count matrices were built from resulting BAM files using dropEst85. Cells that had fewer than 750 UMIs or more than 15,000 UMIs, as well as cells that contained more than 20% of reads from mitochondrial genes or rRNA genes (RNA18S5 or RNA28S5), were considered low quality and removed from further analysis. To remove putative multiplets, cells that expressed more than 75 genes per 100 UMIs were also filtered out.

Pre-processing of scRNA-seq data

The R package Seurat21,73,86 was used for data scaling, transformation, clustering, dimensionality reduction, differential expression analysis and most visualizations. Unless otherwise noted, data were scaled and transformed and variable genes identified using the SCTransform() function, and linear regression was performed to remove unwanted variation due to cell quality (% mitochondrial reads and % rRNA reads). PCA was performed using the 3,000 most highly variable genes, and the first 50 PCs were used to perform UMAP to embed the dataset into two dimensions74,87. Next, the first 50 PCs were used to construct an SNN graph (FindNeighbors()), and this SNN was used to cluster the dataset (FindClusters()). Although upstream quality control removed many dead or low-quality cells, if any clusters were identified that were defined by few canonical cell lineage markers and enriched for genes of mitochondrial or ribosomal origin, these clusters were removed from further analysis88,89.

Annotation of transfected NK and B cells in scRNA-seq data

Because of the strong degree of transcriptional perturbation caused by transfection (Extended Data Fig. 3), we elected to annotate NK and B cells in this dataset by integration with a multimodal reference rather than by graph-based clustering. First, we noted two clusters with high expression of CD3-encoding genes and monocyte-specific genes (including LYZ and CD14), respectively; we considered these clusters contaminating T cells and monocytes and removed them from further analysis. Next, we used the multimodal (whole transcriptome plus 228 cell surface proteins) PBMC dataset published by Hao et al.73 as a reference. We subsetted the reference to contain only NK and B cells, scaled both the transcriptome and protein assays and ran PCA on both modalities. Next, we found multimodal neighbors between the modalities using WNN analysis, which learns the relative utility of each data modality in each cell. Supervised PCA (SPCA) was then run on the WNN SNN graph, which seeks to capture a linear transformation that maximizes its dependency to the WNN SNN graph. These SPCA-reduced dimensions were then used for identification of anchors between the reference and query datasets as previously described21. Finally, these anchors were used to transfer reference cell type annotations to the query dataset.

Processing, analysis and visualization of public scRNA-seq datasets

Published scRNA-seq datasets were acquired as described in the ‘Data availability’ section. In each case, we acquired raw count matrices or processed Seurat objects containing raw count matrices. Any upstream processing was performed as described in the respective manuscripts.

Raw count matrices from Ravindra et al.58 required filtering before downstream analysis; cells meeting the following criteria were kept: >1,000 UMIs, <20,000 UMIs, >500 unique features, <0.85 UMI-to-unique feature ratio, <20% UMIs of mitochondrial origin and <35% reads from ribosomal protein-encoding genes. Pbmc5k and pbmc10k datasets from 10x Genomics were filtered to enforce a minimum number of features per cell of 200 and to remove genes not expressed in at least three cells.

Cell type annotations were provided for the Ji et al.29, Ma et al.41 and Fawkner-Corbett et al.49 datasets, which were used for downstream analytical tasks. For the Ravindra et al.58 dataset, manual annotation of cellular identity was performed by finding differentially expressed genes for each cluster using Seurat’s implementation of the Wilcoxon rank-sum test (FindMarkers()) and comparing those markers to known cell-type-specific genes listed in Ravindra et al.58. PBMC datasets were annotated by WNN projection and label transfer from a multimodal PBMC reference as described73,83.

For analysis of T cell exhaustion in the SCC dataset from Ji et al.29, an exhaustion signature was defined by PDCD1, TOX, CXCL13, CTLA4, TNFRSF9, HAVCR2, LAG3, CD160 and CD244. This signature incorporates several markers of exhausted T cell reported in the literature66,90,91,92. Individual T cells were scored for expression of this signature using Seurat’s AddModuleScore.

Analysis of the impact of sparsity on single-cell-resolution CCC analysis

We collected three scRNA-seq datasets generated from methods with high coverage: Fluidigm C1 pancreas islets21,33, Smart-seq2 uterine decidua13 and Smart-seq2 HNSCC8. We used Scriabin’s CCIM workflow to generate a CCIM for each of these datasets, which we consider to be the GT of CCC for that dataset. Next, we randomly downsampled the datasets to inDrop coverage using downsampleMatrix() from scuttle93,94,95,96,97,98,99,100. For these analyses, inDrop coverage was defined as the mean UMIs per cell of the inDrop dataset included in the pancreas islet dataset available through SeuratData21, which is 5,828 UMIs per cell.

From these downsampled datasets, we generated several interaction matrices:

  1. 1.

    Raw. We calculated an unweighted CCIM through Scriabin’s standard CCIM workflow.

  2. 2.

    ALRA-denoised. We used ALRA32, a denoising algorithm for scRNA-seq, to denoise the downsampled dataset and then built an unweighted CCIM from the denoised dataset.

  3. 3.

    Neighborhood-aggregated. We generated nearest neighbor graphs from each downsampled dataset using Seurat’s FindNeighbors() using the first 15 PCs. Next, we defined the neighborhood for each cell as that cell and its nearest k neighbors. Finally, we defined the transcriptome of each cell as the mean of that cell and its nearest k neighbors. We used values of k between 5 and 100. We used this matrix of neighborhood-aggregated expression values to generate CCIMs.

  4. 4.

    Cluster-aggregated. We used Seurat’s graph-based clustering algorithm with resolutions between 2.5 and 5 to cluster each downsampled dataset. We then generated a matrix of pseudobulk expression vectors for each cluster and used this matrix to generate a cluster–cluster interaction matrix.

Next, we used Seurat’s dataset integration pipeline to integrate each CCIM from the downsampled dataset with the GT CCIM21. The first 30 PCs were used for CCIM integration. Finally, we used the LISI36 to quantify the degree to which the CCIMs from the downsampled datasets recapitulated the GT CCIMs. This value defines the number of datasets in the neighborhood of each GT cell–cell pair and ranges between 1, denoting that only GT cell–cell pairs are present in the neighborhood, and 2, denoting an equal mixture of GT and downsampled cell–cell pairs.

Comparative analyses between Scriabin and published CCC analysis methods

Pbmc5k and pbmc10k datasets from 10x Genomics were used to benchmark the computational efficiency of Scriabin. For single dataset analyses, pbmc5k was randomly subsetted to multiple dataset sizes. Cell type annotations were passed to Connectome38, NATMI17, CellChat15, iTALK16 and SingleCellSignalR (SCA)37, which were run using default parameters defined by LIANA39. The time for these methods to return results was compared to a version of Scriabin that generated and visualized a full dataset summarized interaction graph and returned pseudobulk ligand–receptor pair scores for each cell type annotation. Connectome38 is the only of these packages that supports a full comparative workflow. For comparative analysis, we analyzed differences in CCC between the pbmc5k and pbmc10k datasets. We compared Connectome’s total runtime to the runtime of Scriabin to generate full dataset summarized interaction graphs, perform dataset binning and visualize the most perturbed bins.

Multiple ligand–receptor resources compiled by LIANA39 were used to compare results returned by published CCC analysis methods and Scriabin. This analysis was performed on four datasets: 10x PBMC 5k, Fluidigm C1 pancreas islets21,33, Smart-seq2 uterine decidua13 and Smart-seq2 HNSCC8. The following results parameters were used from each method: prob (CellChat), LRscore (SingleCellSignalR), weight_norm (Connectome), weight_comb (iTALK) and edge_avg_expr (NATMI). To visualize the overlap in results between the methods and resources, we extracted the top 1,000 results from each method–resource pair and calculated the Jaccard index between these top results (as described by ref. 39).

Analysis of spatial transcriptomic datasets with Scriabin

To evaluate if Scriabin returns biologically meaningful CCC edges, we downloaded spatial coordinates and gene expression count matrices from 10 spatial transcriptomic datasets from the 10x Visium platform available at We also analyzed a spatial transcriptomic dataset published by Ma et al.41 of a human leprosy granuloma. We treated each count matrix analogously to scRNA-seq data, performing data transformation and dimensionality reduction as described above. We calculated per-cell gene signatures for each dataset based on variable genes across the dataset, which we then used to rank ligands based on their predicted ability to result in the observed gene expression profile using NicheNet20. Next, we constructed a summarized interaction graph using a ligand–receptor pair database that was restricted to membrane-bound ligands and receptors, which we weighted according to the predicted ligand activities. Finally, we compared the distance quantile of the top 1% of interacting cell–cell pairs compared to randomly permuted distances.

Analysis of CRISPRa Perturb-seq data

To quantify Scriabin’s ability to detect changes in CCC at single-cell resolution, we analyzed data from a pooled genetic perturbation screen. We elected to analyze the Perturb-seq dataset published by Schmidt et al.40 as this dataset was collected on primary cells and contained a high number of gRNAs (15) targeting cell surface ligands or receptors used in CCC. We collected a processed and publicly available h5Seurat object of the anti-human CD3/CD28 re-stimulated T cells from the Schmidt et al.40 dataset from The authors’ gRNA calls were used for all downstream analysis; we identified gRNAs g in this dataset that targeted cell surface ligands or receptors that were present in OmniPath’s ligand–receptor interaction database. The dataset was then subsetted to include only cells transduced with a gRNA targeting one of these cell surface ligands or receptors or cells transduced with a non-targeting gRNA. Untransduced T cells were removed from further analysis. We repeated the following process for each gRNA gi. Given a gRNA, gA, targeting a ligand-encoding gene A: we isolated cells transduced with gA and cells transduced with a non-targeting gRNA. From this subsetted dataset, we generated a CCIM without ligand activity ranking using Scriabin’s CCIM workflow. We next isolated interaction vectors V for ligand A and all receptors of A, RA. For each interaction vector VAR, we constructed a receiver operating characteristic (ROC) curve using VAR as the predictor variable and the gRNA assignment (either gA or non-targeting) as the response variable to quantify and visualize the sensitivity and specificity of the prediction.


For all box plot features: minimum whisker, 25th percentile −1.5 × interquartile range (IQR) or the lowest value within; minimum box, 25th percentile; center, median; maximum box, 75th percentile; maximum whisker, 75th percentile +1.5 × IQR or greatest value within.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.


Leave a Reply

Your email address will not be published. Required fields are marked *