Preprocessing and clustering 3k PBMCs
This showcase reproduces Seurat's Guided Clustering Tutorial
Overview
Loading data
The data consists of 3k PBMCs from a Healthy Donor and is freely available from 10x Genomics.
You can either download the data manually
mkdir pbmc3k
wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
cd pbmc3k; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz --strip-components 2
The read_10X
function reads the data from the cellranger 10X pipeline and returns a labelled count matrix. Each entry indicated the number of molecules detected for each feature/gene (columns) and each cell (rows).
using Severo
X = read_10X("pbmc3k/")
The meaning of the rows and columns is different from the representation used by other packages like Seurat
Alternatively, the dataset
function can be used to load a dataset from a predefined collection. For example, the PBMC collection is known by Cell.jl
and can be easily loaded as follows:
using Severo
X = dataset("PBMC", "3k")
# The matrix can be indexed using names or indices. For instance,
# we can look at specific genes in the first thirty cells
#X[1:30, ["CD3D", "TCL1A", "MS4A1"]]
2700×32738 Named sparse matrix with 2286884 Int32 nonzero entries:
[AGATTCCTGACGAG-1, "AL627309.1"] = 1
[CCAAAGTGCTACGA-1, "AL627309.1"] = 1
[CCGTACACGTTGGT-1, "AL627309.1"] = 1
[CGACTGCTTCCTCG-1, "AL627309.1"] = 1
[CTATGTTGTCTCGC-1, "AL627309.1"] = 1
[CTGTGAGACGAACT-1, "AL627309.1"] = 1
[GATACTCTATCGGT-1, "AL627309.1"] = 1
[GCTAGATGAGCTCA-1, "AL627309.1"] = 1
[GCTTAACTACTGGT-1, "AL627309.1"] = 1
⋮
[TCTCAAACCTAAGC-1, "SRSF10.1"] = 1
[TGAGACACTCAAGC-1, "SRSF10.1"] = 1
[TGATCGGATATGCG-1, "SRSF10.1"] = 1
[TGTAGGTGCTATGG-1, "SRSF10.1"] = 1
[TTATCCGAGAAAGT-1, "SRSF10.1"] = 1
[TTCAGTACCGACTA-1, "SRSF10.1"] = 1
[TTGAATGATCTCAT-1, "SRSF10.1"] = 1
[TTGAGGACTACGCA-1, "SRSF10.1"] = 1
[TTTATCCTGTTGTG-1, "SRSF10.1"] = 1
[TTTCCAGAGGTGAG-1, "SRSF10.1"] = 1
The count data is stored in a sparse matrix format, only storing non-zero elements of the matrix. Any values not shown are zero.
Severo.read_data
— Functionread_data(path::AbstractString; kw...)
Tries to identify and read a count matrix in any of the supported formats
Arguments:
fname
: pathkw
: additional keyword arguments are passed on
Returns values:
Returns labeled sparse matrix containing the counts
Severo.read_10X
— Functionread_10X(dirname::AbstractString; unique_features=true)
Read count matrix from 10X genomics
Arguments:
dirname
: path to directory containing matrix.mtx, genes.tsv (or features.tsv), and barcodes.tsv from 10Xunique_features
: should feature names be made unique (default: true)
Returns values:
Returns labeled sparse matrix containing the counts
Missing docstring for dataset
. Check Documenter's build log for details.
Preprocessing
using Plots
using StatsPlots
pyplot() # hide
plot_highest_expressed(X)
savefig("he-plot.svg"); nothing # hide
Filtering
Two filtering function are available for filtering cells and features/genes based on basic criteria such as the number of features detected, the number of cells for which a feature is detected and the total number of counts.
X = filter_cells(X, min_features=200)
X = filter_features(X, min_cells=3)
2700×13714 Named sparse matrix with 2282976 Int32 nonzero entries:
[AGATTCCTGACGAG-1, "AL627309.1"] = 1
[CCAAAGTGCTACGA-1, "AL627309.1"] = 1
[CCGTACACGTTGGT-1, "AL627309.1"] = 1
[CGACTGCTTCCTCG-1, "AL627309.1"] = 1
[CTATGTTGTCTCGC-1, "AL627309.1"] = 1
[CTGTGAGACGAACT-1, "AL627309.1"] = 1
[GATACTCTATCGGT-1, "AL627309.1"] = 1
[GCTAGATGAGCTCA-1, "AL627309.1"] = 1
[GCTTAACTACTGGT-1, "AL627309.1"] = 1
⋮
[TCTCAAACCTAAGC-1, "SRSF10.1"] = 1
[TGAGACACTCAAGC-1, "SRSF10.1"] = 1
[TGATCGGATATGCG-1, "SRSF10.1"] = 1
[TGTAGGTGCTATGG-1, "SRSF10.1"] = 1
[TTATCCGAGAAAGT-1, "SRSF10.1"] = 1
[TTCAGTACCGACTA-1, "SRSF10.1"] = 1
[TTGAATGATCTCAT-1, "SRSF10.1"] = 1
[TTGAGGACTACGCA-1, "SRSF10.1"] = 1
[TTTATCCTGTTGTG-1, "SRSF10.1"] = 1
[TTTCCAGAGGTGAG-1, "SRSF10.1"] = 1
This filters out cells with less than 200 features/genes and features detected in less than 3 cells. For convenience, filter_counts
combines these two into a single function, similar to Seurat. Beware that the order in which you call these functions can matter.
Severo.filter_features
— Functionfilter_features(; min_cells = 0)
Partial version of filterfeatures(A::NamedCountMatrix; mincells = 0)
Severo.filter_cells
— Functionfilter_cells(; min_features = 0, min_feature_count = 0, min_umi = 0)
Partial version of filtercells(A::NamedCountMatrix; minfeatures = 0, minfeaturecount = 0, min_umi = 0)
Severo.filter_counts
— Functionfilter_counts(; min_cells = 0, min_features = 0, min_feature_count = 0, min_umi = 0)
Partial version of filtercounts(A::NamedCountMatrix; mincells = 0, minfeatures = 0, minfeaturecount = 0, minumi = 0)
Normalization
After removing unwanted cells from the dataset, the next step is to normalize the data. The basic method normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor and optionally log-transforming the result.
Y = normalize_cells(X, method=:lognormalize, scale_factor=1e4)
2700×13714 Named sparse matrix with 2282976 Float64 nonzero entries:
[AGATTCCTGACGAG-1, "AL627309.1"] = 1.538222672557476
[CCAAAGTGCTACGA-1, "AL627309.1"] = 1.3631966075373008
[CCGTACACGTTGGT-1, "AL627309.1"] = 1.3945254817849748
[CGACTGCTTCCTCG-1, "AL627309.1"] = 1.787190785781838
[CTATGTTGTCTCGC-1, "AL627309.1"] = 1.3514100666239552
[CTGTGAGACGAACT-1, "AL627309.1"] = 1.6848688061171004
[GATACTCTATCGGT-1, "AL627309.1"] = 1.6623488639622728
[GCTAGATGAGCTCA-1, "AL627309.1"] = 1.6749250073944073
[GCTTAACTACTGGT-1, "AL627309.1"] = 1.8821985210730219
⋮
[TCTCAAACCTAAGC-1, "SRSF10.1"] = 2.050332029149289
[TGAGACACTCAAGC-1, "SRSF10.1"] = 1.873284442922057
[TGATCGGATATGCG-1, "SRSF10.1"] = 2.021955420152209
[TGTAGGTGCTATGG-1, "SRSF10.1"] = 1.2128182558075358
[TTATCCGAGAAAGT-1, "SRSF10.1"] = 1.2133165770732028
[TTCAGTACCGACTA-1, "SRSF10.1"] = 1.529100016033425
[TTGAATGATCTCAT-1, "SRSF10.1"] = 1.7863632500815836
[TTGAGGACTACGCA-1, "SRSF10.1"] = 0.9465443048463601
[TTTATCCTGTTGTG-1, "SRSF10.1"] = 1.31322083398467
[TTTCCAGAGGTGAG-1, "SRSF10.1"] = 1.7178390681467346
Severo.normalize_cells
— Functionnormalize_cells(; method = :lognormalize, scale_factor::Real = 1.0, dtype::Type{T} = Float64) where T <: AbstractFloat
Partial version of normalizecells(X::NamedCountMatrix; method = :lognormalize, scalefactor::Real = 1.0, dtype::Type{T} = Float64) where T <: AbstractFloat
Identifying highly variable features
Next, we select a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others).
hvf = find_variable_features(X, 2000; method=:vst)
2000-element Named SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}
features │
──────────────────────┼──────
"PPBP" │ 3316
"LYZ" │ 8682
"S100A9" │ 908
"IGLL5" │ 13275
"GNLY" │ 1804
"FTL" │ 12944
"PF4" │ 3315
"FTH1" │ 7755
⋮ ⋮
"EIF3D" │ 13383
"POLR2I" │ 12681
"ZCCHC10" │ 3924
"SEC61B" │ 6669
"RP11-164H13.1" │ 9585
"NELL2" │ 8455
"NF1" │ 11047
"GTF2H2" │ 3747
Severo.find_variable_features
— Functionfind_variable_features(nfeatures = 2000; method = :vst, dtype::Type{<:AbstractFloat} = Float64, kw...)
Partial version of findvariablefeatures(counts::NamedCountMatrix, nfeatures = 2000; method = :vst, dtype::Type{<:AbstractFloat} = Float64, kw...)
Dimensionality reduction
Scaling
Prior to dimensional reduction techniques like PCA, it's a good idea to scale_features
the data. Scaling performs two basic operations:
- Shifts the expression of each gene, so that the mean expression across cells is 0
- Scales the expression of each gene, so that the standard deviation across cells is 1. This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
- Clips values exceeding standard deviation of
scale_max
Y = Y[:,hvf] # only use highly-variable features
S = scale_features(Y; scale_max=10)
CenteredMatrix:
A = 2700×2000 Named sparse matrix with 515841 Float64 nonzero entries:
[AAACCGTGCTTCCG-1, "PPBP"] = 3.05759
[AAATCAACCCTATT-1, "PPBP"] = 1.983
[AAATCCCTGCTATG-1, "PPBP"] = 5.03146
[AACCGCCTCTACGA-1, "PPBP"] = 2.44748
[AAGCCTGACATGCA-1, "PPBP"] = 3.68051
[AATCCTTGACGGGA-1, "PPBP"] = 3.12117
[ACCCACTGGTTCAG-1, "PPBP"] = 10.143
[ACCTGAGATATCGG-1, "PPBP"] = 9.63048
[ACCTGGCTGTCTTT-1, "PPBP"] = 7.59301
⋮
[TGAGTGACTGAGCT-1, "GTF2H2"] = 7.15279
[TGCGATGACCTCGT-1, "GTF2H2"] = 7.51358
[TGGGTATGAAGAGT-1, "GTF2H2"] = 6.33385
[TGTGAGACTGTCAG-1, "GTF2H2"] = 6.86631
[TTAGGGACGCGAAG-1, "GTF2H2"] = 7.54787
[TTAGTCTGAAAGCA-1, "GTF2H2"] = 6.76249
[TTCCAAACTCCCAC-1, "GTF2H2"] = 10.1315
[TTCGTATGTCCTTA-1, "GTF2H2"] = 5.26091
[TTGGTACTGAATCC-1, "GTF2H2"] = 8.17128
[TTTAGCTGTACTCT-1, "GTF2H2"] = 4.47778
mu = 2000-element Named Vector{Float64}
features │
──────────────────────┼──────────
"PPBP" │ 0.142967
"LYZ" │ 0.972527
"S100A9" │ 0.645189
"IGLL5" │ 0.18421
"GNLY" │ 0.407293
"FTL" │ 2.94261
"PF4" │ 0.110369
"FTH1" │ 3.22873
⋮ ⋮
"EIF3D" │ 0.775623
"POLR2I" │ 0.396716
"ZCCHC10" │ 0.211964
"SEC61B" │ 0.656028
"RP11-164H13.1" │ 0.0889054
"NELL2" │ 0.25572
"NF1" │ 0.130034
"GTF2H2" │ 0.131466
Severo.scale_features
— Functionscale_features(; scale_max::Real = Inf, dtype::Type{<:AbstractFloat} = Float64)
Partial version of scalefeatures(X::(NamedArray{T, 2, SparseMatrixCSC{T, Int64}} where T); scalemax::Real = Inf, dtype::Type{<:AbstractFloat} = Float64)
Principal component analysis
Reduce the dimensionality of the data by running principal component analysis (PCA), which reveals the main axes of variation and de-noises the data.
em = embedding(S, 15, method=:pca)
PC-1
Positive: WeakRefStrings.String31["CST3", "TYROBP", "LST1", "AIF1", "FTL", "FCN1", "LYZ", "FTH1", "FCER1G", "S100A9"]
Negative: WeakRefStrings.String31["LTB", "IL32", "CD2", "ACAP1", "STK17A", "CTSW", "CD247", "GIMAP5", "AQP3", "CCL5"]
PC-2
Positive: WeakRefStrings.String31["CD79A", "MS4A1", "TCL1A", "HLA-DRA", "HLA-DQA1", "HLA-DQB1", "LINC00926", "CD79B", "HLA-DRB1", "CD74"]
Negative: WeakRefStrings.String31["NKG7", "PRF1", "CST7", "GZMA", "GZMB", "FGFBP2", "CTSW", "GNLY", "GZMH", "SPON2"]
PC-3
Positive: WeakRefStrings.String31["HLA-DQA1", "CD79A", "HLA-DQB1", "CD79B", "HLA-DPB1", "CD74", "HLA-DPA1", "MS4A1", "HLA-DRB1", "HLA-DRB5"]
Negative: WeakRefStrings.String31["PPBP", "PF4", "SDPR", "SPARC", "GNG11", "NRGN", "GP9", "RGS18", "TUBB1", "HIST1H2AC"]
PC-4
Positive: WeakRefStrings.String31["HLA-DQA1", "CD79A", "CD79B", "HLA-DQB1", "MS4A1", "CD74", "HLA-DPB1", "HIST1H2AC", "PF4", "HLA-DQA2"]
Negative: WeakRefStrings.String31["RPS2", "VIM", "S100A6", "S100A8", "S100A4", "IL32", "S100A9", "TMSB10", "GIMAP7", "S100A10"]
PC-5
Positive: WeakRefStrings.String31["GZMB", "FGFBP2", "NKG7", "GNLY", "PRF1", "CCL4", "CST7", "SPON2", "GZMA", "GZMH"]
Negative: WeakRefStrings.String31["LTB", "VIM", "AQP3", "PPA1", "MAL", "RPS2", "KIAA0101", "CD2", "CORO1B", "CYTIP"]
plot_elbow(em)
savefig("elbow-plot.svg"); nothing # hide
Severo.embedding
— Functionembedding(ncomponents::Int64 = 50; method = :pca, kw...)
Partial version of embedding(X, ncomponents::Int64 = 50; method = :pca, kw...)
Clustering the cells
Briefly, the clustering embed cells in a graph structure with edges between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected 'quasi-cliques' or 'communities'.
Computing the neighborhood graph
First, we need to find the k-nearest neighbors of all the cells based on the euclidean distance in PCA space, and then compute a graph based on the shared overlap in the local neighborhoods between any two cells (Jaccard similarity). These step are performed using the nearest_neighbours
and jaccard_index
functions. For convenience, the two steps are also combined in the shared_nearest_neighours
function.
nn = nearest_neighbours(em, 20, dims=1:10)
snn = jaccard_index(nn, prune=1/15)
# or as a single step
#snn = nearest_neighbours(em, 20, dims=1:10, prune=1/15)
2700×2700 Named sparse matrix with 198466 Float64 nonzero entries:
[AAACATACAACCAC-1, AAACATACAACCAC-1] = 1.0
[AACTCTTGCAGGAG-1, AAACATACAACCAC-1] = 0.1111111111111111
[AATCTAGAATCGGT-1, AAACATACAACCAC-1] = 0.08108108108108109
[ACCACGCTGCTGTA-1, AAACATACAACCAC-1] = 0.17647058823529413
[ACGCCGGAAATGCC-1, AAACATACAACCAC-1] = 0.17647058823529413
[AGAGATGATTGTGG-1, AAACATACAACCAC-1] = 0.08108108108108109
[AGCGGGCTTGCCAA-1, AAACATACAACCAC-1] = 0.1111111111111111
[AGGAGTCTGGTTTG-1, AAACATACAACCAC-1] = 0.1111111111111111
[AGGCAACTGAAGGC-1, AAACATACAACCAC-1] = 0.08108108108108109
⋮
[TCGGTAGAGTAGGG-1, TTTGCATGCCTCAC-1] = 0.08108108108108109
[TGATAAACTTTCAC-1, TTTGCATGCCTCAC-1] = 0.08108108108108109
[TGATACCTGTTGGT-1, TTTGCATGCCTCAC-1] = 0.08108108108108109
[TGCCAGCTTGGCAT-1, TTTGCATGCCTCAC-1] = 0.14285714285714285
[TGGTTACTGTTCTT-1, TTTGCATGCCTCAC-1] = 0.1111111111111111
[TGTAACCTAGAGGC-1, TTTGCATGCCTCAC-1] = 0.1111111111111111
[TTAGAATGTGGTGT-1, TTTGCATGCCTCAC-1] = 0.08108108108108109
[TTAGCTACAACCGT-1, TTTGCATGCCTCAC-1] = 0.1111111111111111
[TTCGATTGAGCATC-1, TTTGCATGCCTCAC-1] = 0.14285714285714285
[TTTGCATGCCTCAC-1, TTTGCATGCCTCAC-1] = 1.0
Severo.nearest_neighbours
— Functionnearest_neighbours(k::Int64; dims = (:), metric::SemiMetric = Euclidean(), include_self::Bool = true, ntables::Int64 = 2 * size(X, 2)) where T
Partial version of nearestneighbours(X::NamedArray{T, 2}, k::Int64; dims = (:), metric::SemiMetric = Euclidean(), includeself::Bool = true, ntables::Int64 = 2 * size(X, 2)) where T
Severo.jaccard_index
— Functionjaccard_index(k::Int64; prune::Real = 1 / 15, dtype::Type{R} = Float64) where R <: AbstractFloat
Partial version of jaccard_index(nn::(NamedArray{T, 2} where T), k::Int64; prune::Real = 1 / 15, dtype::Type{R} = Float64) where R <: AbstractFloat
Severo.shared_nearest_neighbours
— Functionshared_nearest_neighbours(k::Int64; kw...) where T
Partial version of sharednearestneighbours(X::NamedArray{T, 2}, k::Int64; kw...) where T
Clustering the neighborhood graph
Clustering is performed using a graph-based method called Louvain clustering
which tries to detect communities based on optimizing a modularity function. Since this is a greedy algorithm, which can get stuck in local minima based on random initialization, multiple restarts are required to find a more "global" minima.
lbls = cluster(snn)
2700-element Named Vector{Int64}
cells │
─────────────────┼───
AAACATACAACCAC-1 │ 1
AAACATTGAGCTAC-1 │ 2
AAACATTGATCAGC-1 │ 1
AAACCGTGCTTCCG-1 │ 3
AAACCGTGTATGCG-1 │ 4
AAACGCACTGGTAC-1 │ 1
AAACGCTGACCAGT-1 │ 5
AAACGCTGGTTCTT-1 │ 5
⋮ ⋮
TTTCAGTGTGCAGT-1 │ 2
TTTCCAGAGGTGAG-1 │ 1
TTTCGAACACCTGA-1 │ 10
TTTCGAACTCTCAT-1 │ 8
TTTCTACTGAGGCA-1 │ 2
TTTCTACTTCCTCG-1 │ 2
TTTGCATGAGAGGC-1 │ 2
TTTGCATGCCTCAC-1 │ 7
Severo.cluster
— Functioncluster(; kw...)
Partial version of cluster(SNN::NeighbourGraph; kw...)
Finding differentially expressed features
We can find markers that define clusters via differential expression. It identifies positive and negative markers for a single cluster compared to the other cells. find_markers
performs this process for every cluster. The function returns a DataFrame
containing for each marker: the p-value, the log-foldchange and statistical score.
dx = find_markers(X, lbls; method=:wilcoxon)
dx = filter_rank_markers(dx)
dx[1:10, :]
10 rows × 5 columns
group | score | pval | logfc | feature | |
---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | String3… | |
1 | 1 | 15.9247 | 4.26806e-57 | -1.70564 | HLA-DRB1 |
2 | 1 | 15.0714 | 2.49925e-51 | -2.51564 | HLA-DRA |
3 | 1 | 14.9186 | 2.49458e-50 | -2.20198 | TYROBP |
4 | 1 | 14.8299 | 9.38941e-50 | -1.10674 | HLA-DRB5 |
5 | 1 | 14.6245 | 1.96067e-48 | -1.66733 | FCER1G |
6 | 1 | 14.5823 | 3.64305e-48 | -1.65854 | HLA-DPA1 |
7 | 1 | 13.9825 | 1.99417e-44 | -1.66391 | HLA-DPB1 |
8 | 1 | 12.4352 | 1.68319e-35 | -1.96046 | CD74 |
9 | 1 | 12.3283 | 6.37965e-35 | -0.700372 | HLA-DMA |
10 | 1 | 12.0139 | 3.00308e-33 | -0.845171 | CFD |
Performance can be improved by first filtering the features based on the percentage of cells and the fold-change between the groups.
# filtering and DE on the lognormalized data
sel = prefilter_markers(Y, lbls, logfc_threshold=0.25, min_pct=0.1, log=true, only_pos=true)
dx = find_markers(Y, lbls; method=:t, log=true, selection=sel) # t-test
dx = filter_rank_markers(dx)
dx[1:10, :]
10 rows × 5 columns
group | score | pval | logfc | feature | |
---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | String | |
1 | 1 | 30.3806 | 6.1892e-150 | 1.20543 | IL32 |
2 | 1 | 29.6015 | 1.24438e-146 | 1.27376 | LTB |
3 | 1 | 15.2485 | 2.05504e-45 | 1.20156 | CD2 |
4 | 1 | 13.6709 | 2.71139e-39 | 0.473947 | VIM |
5 | 1 | 13.6335 | 2.68983e-38 | 0.565327 | GIMAP7 |
6 | 1 | 12.7573 | 3.69621e-33 | 1.23104 | AQP3 |
7 | 1 | 12.064 | 5.70538e-31 | 0.613794 | ANXA1 |
8 | 1 | 10.583 | 3.51297e-24 | 0.859993 | TRADD |
9 | 1 | 9.05947 | 1.7556e-18 | 0.751047 | MAL |
10 | 1 | 8.72365 | 2.06908e-17 | 0.744041 | TNFAIP8 |
Severo.find_markers
— Functionfind_markers(X::Union{NamedCountMatrix, NamedDataMatrix}, idents::NamedVector{<:Integer};
method=:wilcoxon, selection::Union{Nothing, NamedArray{Bool, 2}, AbstractArray{Bool,2}}=nothing, log::Bool=false, kw...)
Finds markers (differentially expressed genes) for each of the classes in a dataset.
Arguments:
-`X`: count or data matrix
-`idents`: class identity for each cell
-`method`: Which test to use, supported are: [wilcoxon, t]
-`selection`: a selection of features and groups that should be considered
-`log`: the data is in log-scale (default = false)
-`kw...`: additional parameters passed down to the method
Return values:
A DataFrame
containing a list of putative markers with associated statistics (p-values and scores) and log fold-changes.
Severo.filter_rank_markers
— Functionfilter_rank_markers(de::DataFrame; pval_thresh::Real=1e-2, ngenes::Integer=typemax(Int64))
Filters and ranks a list of markers (differentially expressed genes).
Arguments:
-`de`: list of markers returned by [find_markers](@ref)
-`pval_thresh`: only keep markers with pval < pval_thresh
-`count`: the number of highest-ranked markers to keep
-`rankby_abs`: rank based on absolute value of the scores
Return values:
A DataFrame
containing a ranked list of filtered markers.
Severo.prefilter_markers
— Functionprefilter_markers(X::Union{NamedCountMatrix, NamedDataMatrix}, idents::NamedVector{<:Integer};
logfc_threshold::Real=0.0, min_pct::Real=0.0, min_diff_pct::Real=-Inf, only_pos:Bool=false, log::Bool=false)
Filter features for each of the classes in a dataset.
Arguments:
-`X`: count or data matrix
-`idents`: class identity for each cell
-`logfc_threshold`: Limit testing to features which show, on average, at least X-fold difference (log-scale) between the two groups of cells
-`min_pct`: only test features that are detected in a minimum fraction of `min_pct` cells in either of the two populations
-`min_diff_pct`: only test features that show a minimum difference in the fraction of detection between the two groups.
-`only_pos`: only return features with positive log fold-change
-`log`: the data is in log-scale (default = false)
Return values:
Selection matrix for each feature and class
plot_violin(X, ["CST3", "NKG7", "PPBP", "S100A4"], lbls)
savefig("violin-plot.svg"); nothing # hide