Preprocessing and clustering 3k PBMCs

This showcase reproduces Seurat's Guided Clustering Tutorial

Overview

Loading data

The data consists of 3k PBMCs from a Healthy Donor and is freely available from 10x Genomics.

You can either download the data manually

mkdir pbmc3k
wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
cd pbmc3k; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz --strip-components 2

The read_10X function reads the data from the cellranger 10X pipeline and returns a labelled count matrix. Each entry indicated the number of molecules detected for each feature/gene (columns) and each cell (rows).

using Severo
X = read_10X("pbmc3k/")
Warning

The meaning of the rows and columns is different from the representation used by other packages like Seurat

Alternatively, the dataset function can be used to load a dataset from a predefined collection. For example, the PBMC collection is known by Cell.jl and can be easily loaded as follows:

using Severo
X = dataset("PBMC", "3k")

# The matrix can be indexed using names or indices. For instance,
# we can look at specific genes in the first thirty cells
#X[1:30, ["CD3D", "TCL1A", "MS4A1"]]
2700×32738 Named sparse matrix with 2286884 Int32 nonzero entries:
	[AGATTCCTGACGAG-1,          "AL627309.1"]  =  1
	[CCAAAGTGCTACGA-1,          "AL627309.1"]  =  1
	[CCGTACACGTTGGT-1,          "AL627309.1"]  =  1
	[CGACTGCTTCCTCG-1,          "AL627309.1"]  =  1
	[CTATGTTGTCTCGC-1,          "AL627309.1"]  =  1
	[CTGTGAGACGAACT-1,          "AL627309.1"]  =  1
	[GATACTCTATCGGT-1,          "AL627309.1"]  =  1
	[GCTAGATGAGCTCA-1,          "AL627309.1"]  =  1
	[GCTTAACTACTGGT-1,          "AL627309.1"]  =  1
	                                           ⋮
	[TCTCAAACCTAAGC-1,            "SRSF10.1"]  =  1
	[TGAGACACTCAAGC-1,            "SRSF10.1"]  =  1
	[TGATCGGATATGCG-1,            "SRSF10.1"]  =  1
	[TGTAGGTGCTATGG-1,            "SRSF10.1"]  =  1
	[TTATCCGAGAAAGT-1,            "SRSF10.1"]  =  1
	[TTCAGTACCGACTA-1,            "SRSF10.1"]  =  1
	[TTGAATGATCTCAT-1,            "SRSF10.1"]  =  1
	[TTGAGGACTACGCA-1,            "SRSF10.1"]  =  1
	[TTTATCCTGTTGTG-1,            "SRSF10.1"]  =  1
	[TTTCCAGAGGTGAG-1,            "SRSF10.1"]  =  1

The count data is stored in a sparse matrix format, only storing non-zero elements of the matrix. Any values not shown are zero.

Severo.read_dataFunction
read_data(path::AbstractString; kw...)

Tries to identify and read a count matrix in any of the supported formats

Arguments:

  • fname: path
  • kw: additional keyword arguments are passed on

Returns values:

Returns labeled sparse matrix containing the counts

source
Severo.read_10XFunction
read_10X(dirname::AbstractString; unique_features=true)

Read count matrix from 10X genomics

Arguments:

  • dirname: path to directory containing matrix.mtx, genes.tsv (or features.tsv), and barcodes.tsv from 10X
  • unique_features: should feature names be made unique (default: true)

Returns values:

Returns labeled sparse matrix containing the counts

source
Missing docstring.

Missing docstring for dataset. Check Documenter's build log for details.

Preprocessing

using Plots
using StatsPlots
pyplot() # hide
plot_highest_expressed(X)
savefig("he-plot.svg"); nothing # hide

Filtering

Two filtering function are available for filtering cells and features/genes based on basic criteria such as the number of features detected, the number of cells for which a feature is detected and the total number of counts.

X = filter_cells(X, min_features=200)
X = filter_features(X, min_cells=3)
2700×13714 Named sparse matrix with 2282976 Int32 nonzero entries:
	[AGATTCCTGACGAG-1,          "AL627309.1"]  =  1
	[CCAAAGTGCTACGA-1,          "AL627309.1"]  =  1
	[CCGTACACGTTGGT-1,          "AL627309.1"]  =  1
	[CGACTGCTTCCTCG-1,          "AL627309.1"]  =  1
	[CTATGTTGTCTCGC-1,          "AL627309.1"]  =  1
	[CTGTGAGACGAACT-1,          "AL627309.1"]  =  1
	[GATACTCTATCGGT-1,          "AL627309.1"]  =  1
	[GCTAGATGAGCTCA-1,          "AL627309.1"]  =  1
	[GCTTAACTACTGGT-1,          "AL627309.1"]  =  1
	                                           ⋮
	[TCTCAAACCTAAGC-1,            "SRSF10.1"]  =  1
	[TGAGACACTCAAGC-1,            "SRSF10.1"]  =  1
	[TGATCGGATATGCG-1,            "SRSF10.1"]  =  1
	[TGTAGGTGCTATGG-1,            "SRSF10.1"]  =  1
	[TTATCCGAGAAAGT-1,            "SRSF10.1"]  =  1
	[TTCAGTACCGACTA-1,            "SRSF10.1"]  =  1
	[TTGAATGATCTCAT-1,            "SRSF10.1"]  =  1
	[TTGAGGACTACGCA-1,            "SRSF10.1"]  =  1
	[TTTATCCTGTTGTG-1,            "SRSF10.1"]  =  1
	[TTTCCAGAGGTGAG-1,            "SRSF10.1"]  =  1

This filters out cells with less than 200 features/genes and features detected in less than 3 cells. For convenience, filter_counts combines these two into a single function, similar to Seurat. Beware that the order in which you call these functions can matter.

Severo.filter_featuresFunction
filter_features(; min_cells = 0)

Partial version of filterfeatures(A::NamedCountMatrix; mincells = 0)

source
Severo.filter_cellsFunction
filter_cells(; min_features = 0, min_feature_count = 0, min_umi = 0)

Partial version of filtercells(A::NamedCountMatrix; minfeatures = 0, minfeaturecount = 0, min_umi = 0)

source
Severo.filter_countsFunction
filter_counts(; min_cells = 0, min_features = 0, min_feature_count = 0, min_umi = 0)

Partial version of filtercounts(A::NamedCountMatrix; mincells = 0, minfeatures = 0, minfeaturecount = 0, minumi = 0)

source

Normalization

After removing unwanted cells from the dataset, the next step is to normalize the data. The basic method normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor and optionally log-transforming the result.

Y = normalize_cells(X, method=:lognormalize, scale_factor=1e4)
2700×13714 Named sparse matrix with 2282976 Float64 nonzero entries:
	[AGATTCCTGACGAG-1,          "AL627309.1"]  =  1.538222672557476
	[CCAAAGTGCTACGA-1,          "AL627309.1"]  =  1.3631966075373008
	[CCGTACACGTTGGT-1,          "AL627309.1"]  =  1.3945254817849748
	[CGACTGCTTCCTCG-1,          "AL627309.1"]  =  1.787190785781838
	[CTATGTTGTCTCGC-1,          "AL627309.1"]  =  1.3514100666239552
	[CTGTGAGACGAACT-1,          "AL627309.1"]  =  1.6848688061171004
	[GATACTCTATCGGT-1,          "AL627309.1"]  =  1.6623488639622728
	[GCTAGATGAGCTCA-1,          "AL627309.1"]  =  1.6749250073944073
	[GCTTAACTACTGGT-1,          "AL627309.1"]  =  1.8821985210730219
	                                           ⋮
	[TCTCAAACCTAAGC-1,            "SRSF10.1"]  =  2.050332029149289
	[TGAGACACTCAAGC-1,            "SRSF10.1"]  =  1.873284442922057
	[TGATCGGATATGCG-1,            "SRSF10.1"]  =  2.021955420152209
	[TGTAGGTGCTATGG-1,            "SRSF10.1"]  =  1.2128182558075358
	[TTATCCGAGAAAGT-1,            "SRSF10.1"]  =  1.2133165770732028
	[TTCAGTACCGACTA-1,            "SRSF10.1"]  =  1.529100016033425
	[TTGAATGATCTCAT-1,            "SRSF10.1"]  =  1.7863632500815836
	[TTGAGGACTACGCA-1,            "SRSF10.1"]  =  0.9465443048463601
	[TTTATCCTGTTGTG-1,            "SRSF10.1"]  =  1.31322083398467
	[TTTCCAGAGGTGAG-1,            "SRSF10.1"]  =  1.7178390681467346
Severo.normalize_cellsFunction
normalize_cells(; method = :lognormalize, scale_factor::Real = 1.0, dtype::Type{T} = Float64) where T <: AbstractFloat

Partial version of normalizecells(X::NamedCountMatrix; method = :lognormalize, scalefactor::Real = 1.0, dtype::Type{T} = Float64) where T <: AbstractFloat

source

Identifying highly variable features

Next, we select a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others).

hvf = find_variable_features(X, 2000; method=:vst)
2000-element Named SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}
features              │ 
──────────────────────┼──────
"PPBP"                │  3316
"LYZ"                 │  8682
"S100A9"              │   908
"IGLL5"               │ 13275
"GNLY"                │  1804
"FTL"                 │ 12944
"PF4"                 │  3315
"FTH1"                │  7755
⋮                           ⋮
"EIF3D"               │ 13383
"POLR2I"              │ 12681
"ZCCHC10"             │  3924
"SEC61B"              │  6669
"RP11-164H13.1"       │  9585
"NELL2"               │  8455
"NF1"                 │ 11047
"GTF2H2"              │  3747
Severo.find_variable_featuresFunction
find_variable_features(nfeatures = 2000; method = :vst, dtype::Type{<:AbstractFloat} = Float64, kw...)

Partial version of findvariablefeatures(counts::NamedCountMatrix, nfeatures = 2000; method = :vst, dtype::Type{<:AbstractFloat} = Float64, kw...)

source

Dimensionality reduction

Scaling

Prior to dimensional reduction techniques like PCA, it's a good idea to scale_features the data. Scaling performs two basic operations:

  • Shifts the expression of each gene, so that the mean expression across cells is 0
  • Scales the expression of each gene, so that the standard deviation across cells is 1. This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
  • Clips values exceeding standard deviation of scale_max
Y = Y[:,hvf] # only use highly-variable features
S = scale_features(Y; scale_max=10)
CenteredMatrix:
  A = 2700×2000 Named sparse matrix with 515841 Float64 nonzero entries:
	[AAACCGTGCTTCCG-1,                "PPBP"]  =  3.05759
	[AAATCAACCCTATT-1,                "PPBP"]  =  1.983
	[AAATCCCTGCTATG-1,                "PPBP"]  =  5.03146
	[AACCGCCTCTACGA-1,                "PPBP"]  =  2.44748
	[AAGCCTGACATGCA-1,                "PPBP"]  =  3.68051
	[AATCCTTGACGGGA-1,                "PPBP"]  =  3.12117
	[ACCCACTGGTTCAG-1,                "PPBP"]  =  10.143
	[ACCTGAGATATCGG-1,                "PPBP"]  =  9.63048
	[ACCTGGCTGTCTTT-1,                "PPBP"]  =  7.59301
	                                           ⋮
	[TGAGTGACTGAGCT-1,              "GTF2H2"]  =  7.15279
	[TGCGATGACCTCGT-1,              "GTF2H2"]  =  7.51358
	[TGGGTATGAAGAGT-1,              "GTF2H2"]  =  6.33385
	[TGTGAGACTGTCAG-1,              "GTF2H2"]  =  6.86631
	[TTAGGGACGCGAAG-1,              "GTF2H2"]  =  7.54787
	[TTAGTCTGAAAGCA-1,              "GTF2H2"]  =  6.76249
	[TTCCAAACTCCCAC-1,              "GTF2H2"]  =  10.1315
	[TTCGTATGTCCTTA-1,              "GTF2H2"]  =  5.26091
	[TTGGTACTGAATCC-1,              "GTF2H2"]  =  8.17128
	[TTTAGCTGTACTCT-1,              "GTF2H2"]  =  4.47778
  mu = 2000-element Named Vector{Float64}
features              │ 
──────────────────────┼──────────
"PPBP"                │  0.142967
"LYZ"                 │  0.972527
"S100A9"              │  0.645189
"IGLL5"               │   0.18421
"GNLY"                │  0.407293
"FTL"                 │   2.94261
"PF4"                 │  0.110369
"FTH1"                │   3.22873
⋮                               ⋮
"EIF3D"               │  0.775623
"POLR2I"              │  0.396716
"ZCCHC10"             │  0.211964
"SEC61B"              │  0.656028
"RP11-164H13.1"       │ 0.0889054
"NELL2"               │   0.25572
"NF1"                 │  0.130034
"GTF2H2"              │  0.131466
Severo.scale_featuresFunction
scale_features(; scale_max::Real = Inf, dtype::Type{<:AbstractFloat} = Float64)

Partial version of scalefeatures(X::(NamedArray{T, 2, SparseMatrixCSC{T, Int64}} where T); scalemax::Real = Inf, dtype::Type{<:AbstractFloat} = Float64)

source

Principal component analysis

Reduce the dimensionality of the data by running principal component analysis (PCA), which reveals the main axes of variation and de-noises the data.

em = embedding(S, 15, method=:pca)
PC-1
Positive: WeakRefStrings.String31["CST3", "TYROBP", "LST1", "AIF1", "FTL", "FCN1", "LYZ", "FTH1", "FCER1G", "S100A9"]
Negative: WeakRefStrings.String31["LTB", "IL32", "CD2", "ACAP1", "STK17A", "CTSW", "CD247", "GIMAP5", "AQP3", "CCL5"]
PC-2
Positive: WeakRefStrings.String31["CD79A", "MS4A1", "TCL1A", "HLA-DRA", "HLA-DQA1", "HLA-DQB1", "LINC00926", "CD79B", "HLA-DRB1", "CD74"]
Negative: WeakRefStrings.String31["NKG7", "PRF1", "CST7", "GZMA", "GZMB", "FGFBP2", "CTSW", "GNLY", "GZMH", "SPON2"]
PC-3
Positive: WeakRefStrings.String31["HLA-DQA1", "CD79A", "HLA-DQB1", "CD79B", "HLA-DPB1", "CD74", "HLA-DPA1", "MS4A1", "HLA-DRB1", "HLA-DRB5"]
Negative: WeakRefStrings.String31["PPBP", "PF4", "SDPR", "SPARC", "GNG11", "NRGN", "GP9", "RGS18", "TUBB1", "HIST1H2AC"]
PC-4
Positive: WeakRefStrings.String31["HLA-DQA1", "CD79A", "CD79B", "HLA-DQB1", "MS4A1", "CD74", "HLA-DPB1", "HIST1H2AC", "PF4", "HLA-DQA2"]
Negative: WeakRefStrings.String31["RPS2", "VIM", "S100A6", "S100A8", "S100A4", "IL32", "S100A9", "TMSB10", "GIMAP7", "S100A10"]
PC-5
Positive: WeakRefStrings.String31["GZMB", "FGFBP2", "NKG7", "GNLY", "PRF1", "CCL4", "CST7", "SPON2", "GZMA", "GZMH"]
Negative: WeakRefStrings.String31["LTB", "VIM", "AQP3", "PPA1", "MAL", "RPS2", "KIAA0101", "CD2", "CORO1B", "CYTIP"]
plot_elbow(em)
savefig("elbow-plot.svg"); nothing # hide

Severo.embeddingFunction
embedding(ncomponents::Int64 = 50; method = :pca, kw...)

Partial version of embedding(X, ncomponents::Int64 = 50; method = :pca, kw...)

source

Clustering the cells

Briefly, the clustering embed cells in a graph structure with edges between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected 'quasi-cliques' or 'communities'.

Computing the neighborhood graph

First, we need to find the k-nearest neighbors of all the cells based on the euclidean distance in PCA space, and then compute a graph based on the shared overlap in the local neighborhoods between any two cells (Jaccard similarity). These step are performed using the nearest_neighbours and jaccard_index functions. For convenience, the two steps are also combined in the shared_nearest_neighours function.

nn = nearest_neighbours(em, 20, dims=1:10)
snn = jaccard_index(nn, prune=1/15)
# or as a single step
#snn = nearest_neighbours(em, 20, dims=1:10, prune=1/15)
2700×2700 Named sparse matrix with 198466 Float64 nonzero entries:
	[AAACATACAACCAC-1, AAACATACAACCAC-1]  =  1.0
	[AACTCTTGCAGGAG-1, AAACATACAACCAC-1]  =  0.1111111111111111
	[AATCTAGAATCGGT-1, AAACATACAACCAC-1]  =  0.08108108108108109
	[ACCACGCTGCTGTA-1, AAACATACAACCAC-1]  =  0.17647058823529413
	[ACGCCGGAAATGCC-1, AAACATACAACCAC-1]  =  0.17647058823529413
	[AGAGATGATTGTGG-1, AAACATACAACCAC-1]  =  0.08108108108108109
	[AGCGGGCTTGCCAA-1, AAACATACAACCAC-1]  =  0.1111111111111111
	[AGGAGTCTGGTTTG-1, AAACATACAACCAC-1]  =  0.1111111111111111
	[AGGCAACTGAAGGC-1, AAACATACAACCAC-1]  =  0.08108108108108109
	                                      ⋮
	[TCGGTAGAGTAGGG-1, TTTGCATGCCTCAC-1]  =  0.08108108108108109
	[TGATAAACTTTCAC-1, TTTGCATGCCTCAC-1]  =  0.08108108108108109
	[TGATACCTGTTGGT-1, TTTGCATGCCTCAC-1]  =  0.08108108108108109
	[TGCCAGCTTGGCAT-1, TTTGCATGCCTCAC-1]  =  0.14285714285714285
	[TGGTTACTGTTCTT-1, TTTGCATGCCTCAC-1]  =  0.1111111111111111
	[TGTAACCTAGAGGC-1, TTTGCATGCCTCAC-1]  =  0.1111111111111111
	[TTAGAATGTGGTGT-1, TTTGCATGCCTCAC-1]  =  0.08108108108108109
	[TTAGCTACAACCGT-1, TTTGCATGCCTCAC-1]  =  0.1111111111111111
	[TTCGATTGAGCATC-1, TTTGCATGCCTCAC-1]  =  0.14285714285714285
	[TTTGCATGCCTCAC-1, TTTGCATGCCTCAC-1]  =  1.0
Severo.nearest_neighboursFunction
nearest_neighbours(k::Int64; dims = (:), metric::SemiMetric = Euclidean(), include_self::Bool = true, ntables::Int64 = 2 * size(X, 2)) where T

Partial version of nearestneighbours(X::NamedArray{T, 2}, k::Int64; dims = (:), metric::SemiMetric = Euclidean(), includeself::Bool = true, ntables::Int64 = 2 * size(X, 2)) where T

source
Severo.jaccard_indexFunction
jaccard_index(k::Int64; prune::Real = 1 / 15, dtype::Type{R} = Float64) where R <: AbstractFloat

Partial version of jaccard_index(nn::(NamedArray{T, 2} where T), k::Int64; prune::Real = 1 / 15, dtype::Type{R} = Float64) where R <: AbstractFloat

source

Clustering the neighborhood graph

Clustering is performed using a graph-based method called Louvain clustering which tries to detect communities based on optimizing a modularity function. Since this is a greedy algorithm, which can get stuck in local minima based on random initialization, multiple restarts are required to find a more "global" minima.

lbls = cluster(snn)
2700-element Named Vector{Int64}
cells            │ 
─────────────────┼───
AAACATACAACCAC-1 │  1
AAACATTGAGCTAC-1 │  2
AAACATTGATCAGC-1 │  1
AAACCGTGCTTCCG-1 │  3
AAACCGTGTATGCG-1 │  4
AAACGCACTGGTAC-1 │  1
AAACGCTGACCAGT-1 │  5
AAACGCTGGTTCTT-1 │  5
⋮                   ⋮
TTTCAGTGTGCAGT-1 │  2
TTTCCAGAGGTGAG-1 │  1
TTTCGAACACCTGA-1 │ 10
TTTCGAACTCTCAT-1 │  8
TTTCTACTGAGGCA-1 │  2
TTTCTACTTCCTCG-1 │  2
TTTGCATGAGAGGC-1 │  2
TTTGCATGCCTCAC-1 │  7
Severo.clusterFunction
cluster(; kw...)

Partial version of cluster(SNN::NeighbourGraph; kw...)

source

Finding differentially expressed features

We can find markers that define clusters via differential expression. It identifies positive and negative markers for a single cluster compared to the other cells. find_markers performs this process for every cluster. The function returns a DataFrame containing for each marker: the p-value, the log-foldchange and statistical score.

dx = find_markers(X, lbls; method=:wilcoxon)
dx = filter_rank_markers(dx)
dx[1:10, :]

10 rows × 5 columns

groupscorepvallogfcfeature
Int64Float64Float64Float64String3…
1115.92474.26806e-57-1.70564HLA-DRB1
2115.07142.49925e-51-2.51564HLA-DRA
3114.91862.49458e-50-2.20198TYROBP
4114.82999.38941e-50-1.10674HLA-DRB5
5114.62451.96067e-48-1.66733FCER1G
6114.58233.64305e-48-1.65854HLA-DPA1
7113.98251.99417e-44-1.66391HLA-DPB1
8112.43521.68319e-35-1.96046CD74
9112.32836.37965e-35-0.700372HLA-DMA
10112.01393.00308e-33-0.845171CFD

Performance can be improved by first filtering the features based on the percentage of cells and the fold-change between the groups.

# filtering and DE on the lognormalized data
sel = prefilter_markers(Y, lbls, logfc_threshold=0.25, min_pct=0.1, log=true, only_pos=true)
dx = find_markers(Y, lbls; method=:t, log=true, selection=sel) # t-test
dx = filter_rank_markers(dx)
dx[1:10, :]

10 rows × 5 columns

groupscorepvallogfcfeature
Int64Float64Float64Float64String
1130.38066.1892e-1501.20543IL32
2129.60151.24438e-1461.27376LTB
3115.24852.05504e-451.20156CD2
4113.67092.71139e-390.473947VIM
5113.63352.68983e-380.565327GIMAP7
6112.75733.69621e-331.23104AQP3
7112.0645.70538e-310.613794ANXA1
8110.5833.51297e-240.859993TRADD
919.059471.7556e-180.751047MAL
1018.723652.06908e-170.744041TNFAIP8
Severo.find_markersFunction
find_markers(X::Union{NamedCountMatrix, NamedDataMatrix}, idents::NamedVector{<:Integer};
    method=:wilcoxon, selection::Union{Nothing, NamedArray{Bool, 2}, AbstractArray{Bool,2}}=nothing, log::Bool=false, kw...)

Finds markers (differentially expressed genes) for each of the classes in a dataset.

Arguments:

-`X`: count or data matrix
-`idents`: class identity for each cell
-`method`: Which test to use, supported are: [wilcoxon, t]
-`selection`: a selection of features and groups that should be considered
-`log`: the data is in log-scale (default = false)
-`kw...`: additional parameters passed down to the method

Return values:

A DataFrame containing a list of putative markers with associated statistics (p-values and scores) and log fold-changes.

source
Severo.filter_rank_markersFunction
filter_rank_markers(de::DataFrame; pval_thresh::Real=1e-2, ngenes::Integer=typemax(Int64))

Filters and ranks a list of markers (differentially expressed genes).

Arguments:

-`de`: list of markers returned by [find_markers](@ref)
-`pval_thresh`: only keep markers with pval < pval_thresh
-`count`: the number of highest-ranked markers to keep
-`rankby_abs`: rank based on absolute value of the scores

Return values:

A DataFrame containing a ranked list of filtered markers.

source
Severo.prefilter_markersFunction
prefilter_markers(X::Union{NamedCountMatrix, NamedDataMatrix}, idents::NamedVector{<:Integer};
        logfc_threshold::Real=0.0, min_pct::Real=0.0, min_diff_pct::Real=-Inf, only_pos:Bool=false, log::Bool=false)

Filter features for each of the classes in a dataset.

Arguments:

-`X`: count or data matrix
-`idents`: class identity for each cell
-`logfc_threshold`: Limit testing to features which show, on average, at least X-fold difference (log-scale) between the two groups of cells
-`min_pct`: only test features that are detected in a minimum fraction of `min_pct` cells in either of the two populations
-`min_diff_pct`: only test features that show a minimum difference in the fraction of detection between the two groups.
-`only_pos`: only return features with positive log fold-change
-`log`: the data is in log-scale (default = false)

Return values:

Selection matrix for each feature and class

source
plot_violin(X, ["CST3", "NKG7", "PPBP", "S100A4"], lbls)
savefig("violin-plot.svg"); nothing # hide