Preprocessing and clustering 3k PBMCs

This showcase reproduces Seurat's Guided Clustering Tutorial

Overview

Loading data

The data consists of 3k PBMCs from a Healthy Donor and is freely available from 10x Genomics.

You can either download the data manually

mkdir pbmc3k
wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
cd pbmc3k; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz --strip-components 2

The read_10X function reads the data from the cellranger 10X pipeline and returns a labelled count matrix. Each entry indicated the number of molecules detected for each feature/gene (columns) and each cell (rows).

using Severo
X = read_10X("pbmc3k/")
Warning

The meaning of the rows and columns is different from the representation used by other packages like Seurat

Alternatively, the dataset function can be used to load a dataset from a predefined collection. For example, the PBMC collection is known by Cell.jl and can be easily loaded as follows:

using Severo
X = dataset("PBMC", "3k")

# The matrix can be indexed using names or indices. For instance,
# we can look at specific genes in the first thirty cells
#X[1:30, ["CD3D", "TCL1A", "MS4A1"]]
2700×32738 Named sparse matrix with 2286884 Int32 nonzero entries:
	[AGATTCCTGACGAG-1,          "AL627309.1"]  =  1
	[CCAAAGTGCTACGA-1,          "AL627309.1"]  =  1
	[CCGTACACGTTGGT-1,          "AL627309.1"]  =  1
	[CGACTGCTTCCTCG-1,          "AL627309.1"]  =  1
	[CTATGTTGTCTCGC-1,          "AL627309.1"]  =  1
	[CTGTGAGACGAACT-1,          "AL627309.1"]  =  1
	[GATACTCTATCGGT-1,          "AL627309.1"]  =  1
	[GCTAGATGAGCTCA-1,          "AL627309.1"]  =  1
	[GCTTAACTACTGGT-1,          "AL627309.1"]  =  1
	                                           ⋮
	[TCTCAAACCTAAGC-1,            "SRSF10.1"]  =  1
	[TGAGACACTCAAGC-1,            "SRSF10.1"]  =  1
	[TGATCGGATATGCG-1,            "SRSF10.1"]  =  1
	[TGTAGGTGCTATGG-1,            "SRSF10.1"]  =  1
	[TTATCCGAGAAAGT-1,            "SRSF10.1"]  =  1
	[TTCAGTACCGACTA-1,            "SRSF10.1"]  =  1
	[TTGAATGATCTCAT-1,            "SRSF10.1"]  =  1
	[TTGAGGACTACGCA-1,            "SRSF10.1"]  =  1
	[TTTATCCTGTTGTG-1,            "SRSF10.1"]  =  1
	[TTTCCAGAGGTGAG-1,            "SRSF10.1"]  =  1

The count data is stored in a sparse matrix format, only storing non-zero elements of the matrix. Any values not shown are zero.

Severo.read_dataFunction
read_data(path::AbstractString; kw...)

Tries to identify and read a count matrix in any of the supported formats

Arguments:

  • fname: path
  • kw: additional keyword arguments are passed on

Returns values:

Returns labeled sparse matrix containing the counts

source
Severo.read_10XFunction
read_10X(dirname::AbstractString; unique_features=true)

Read count matrix from 10X genomics

Arguments:

  • dirname: path to directory containing matrix.mtx, genes.tsv (or features.tsv), and barcodes.tsv from 10X
  • unique_features: should feature names be made unique (default: true)

Returns values:

Returns labeled sparse matrix containing the counts

source
Missing docstring.

Missing docstring for dataset. Check Documenter's build log for details.

Preprocessing

using Plots
using StatsPlots
plot_highest_expressed(X)
┌ Warning: Package Severo does not have StatsPlots in its dependencies:
│ - If you have Severo checked out for development and have
│   added StatsPlots as a dependency but haven't updated your primary
│   environment's manifest file, try `Pkg.resolve()`.
│ - Otherwise you may need to report an issue with Severo
└ Loading StatsPlots into Severo from project dependency, future warnings for Severo are suppressed.
[ Info: Installing matplotlib via the Conda matplotlib package...
[ Info: Running `conda install -q -y matplotlib` in root environment
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /home/runner/.julia/conda/3

  added / updated specs:
    - matplotlib


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    alsa-lib-1.2.3             |       h516909a_0         560 KB  conda-forge
    brotli-1.0.9               |       h7f98852_6          18 KB  conda-forge
    brotli-bin-1.0.9           |       h7f98852_6          19 KB  conda-forge
    cycler-0.11.0              |     pyhd8ed1ab_0          10 KB  conda-forge
    dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
    expat-2.4.4                |       h9c3ff4c_0         183 KB  conda-forge
    font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
    font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
    font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
    font-ttf-ubuntu-0.83       |       hab24e00_0         1.9 MB  conda-forge
    fontconfig-2.13.94         |       ha180cfb_0         323 KB  conda-forge
    fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
    fonts-conda-forge-1        |                0           4 KB  conda-forge
    fonttools-4.29.1           |   py39h3811e60_0         1.6 MB  conda-forge
    freetype-2.10.4            |       h0708190_1         890 KB  conda-forge
    gettext-0.19.8.1           |    h73d1719_1008         3.6 MB  conda-forge
    gst-plugins-base-1.18.5    |       hf529b03_3         2.6 MB  conda-forge
    gstreamer-1.18.5           |       h9f60fe5_3         2.0 MB  conda-forge
    icu-69.1                   |       h9c3ff4c_0        13.2 MB  conda-forge
    jbig-2.1                   |    h7f98852_2003          43 KB  conda-forge
    jpeg-9e                    |       h7f98852_0         269 KB  conda-forge
    kiwisolver-1.3.2           |   py39h1a9c180_1          80 KB  conda-forge
    krb5-1.19.2                |       hcc1bbae_3         1.4 MB  conda-forge
    lcms2-2.12                 |       hddcbb42_0         443 KB  conda-forge
    lerc-3.0                   |       h9c3ff4c_0         216 KB  conda-forge
    libbrotlicommon-1.0.9      |       h7f98852_6          65 KB  conda-forge
    libbrotlidec-1.0.9         |       h7f98852_6          33 KB  conda-forge
    libbrotlienc-1.0.9         |       h7f98852_6         286 KB  conda-forge
    libclang-13.0.0            |default_hc23dcda_0        11.1 MB  conda-forge
    libdeflate-1.8             |       h7f98852_0          67 KB  conda-forge
    libedit-3.1.20191231       |       he28a2e2_2         121 KB  conda-forge
    libevent-2.1.10            |       h9b69904_4         1.1 MB  conda-forge
    libglib-2.70.2             |       h174f98d_1         3.1 MB  conda-forge
    libiconv-1.16              |       h516909a_0         1.4 MB  conda-forge
    libllvm13-13.0.0           |       hf817b99_0        33.6 MB  conda-forge
    libogg-1.3.4               |       h7f98852_1         206 KB  conda-forge
    libopus-1.3.1              |       h7f98852_1         255 KB  conda-forge
    libpng-1.6.37              |       h21135ba_2         306 KB  conda-forge
    libpq-14.1                 |       hd57d9b9_1         3.0 MB  conda-forge
    libtiff-4.3.0              |       h6f004c6_2         614 KB  conda-forge
    libuuid-2.32.1             |    h7f98852_1000          28 KB  conda-forge
    libvorbis-1.3.7            |       h9c3ff4c_0         280 KB  conda-forge
    libwebp-base-1.2.2         |       h7f98852_1         824 KB  conda-forge
    libxcb-1.13                |    h7f98852_1004         391 KB  conda-forge
    libxkbcommon-1.0.3         |       he3ba5ed_0         581 KB  conda-forge
    libxml2-2.9.12             |       h885dcf4_1         760 KB  conda-forge
    lz4-c-1.9.3                |       h9c3ff4c_1         179 KB  conda-forge
    matplotlib-3.5.1           |   py39hf3d152e_0           6 KB  conda-forge
    matplotlib-base-3.5.1      |   py39h2fa2bec_0         7.4 MB  conda-forge
    munkres-1.1.4              |     pyh9f0ad1d_0          12 KB  conda-forge
    mysql-common-8.0.28        |       ha770c72_0         1.8 MB  conda-forge
    mysql-libs-8.0.28          |       hfa10184_0         1.9 MB  conda-forge
    nspr-4.32                  |       h9c3ff4c_1         233 KB  conda-forge
    nss-3.74                   |       hb5efdd6_0         2.1 MB  conda-forge
    olefile-0.46               |     pyh9f0ad1d_1          32 KB  conda-forge
    openjpeg-2.4.0             |       hb52868f_1         444 KB  conda-forge
    packaging-21.3             |     pyhd8ed1ab_0          36 KB  conda-forge
    pcre-8.45                  |       h9c3ff4c_0         253 KB  conda-forge
    pillow-8.4.0               |   py39ha612740_0         706 KB  conda-forge
    pthread-stubs-0.4          |    h36c2ea0_1001           5 KB  conda-forge
    pyparsing-3.0.7            |     pyhd8ed1ab_0          79 KB  conda-forge
    pyqt-5.12.3                |   py39hf3d152e_8          22 KB  conda-forge
    pyqt-impl-5.12.3           |   py39hde8b62d_8         5.9 MB  conda-forge
    pyqt5-sip-4.19.18          |   py39he80948d_8         311 KB  conda-forge
    pyqtchart-5.12             |   py39h0fcd23e_8         257 KB  conda-forge
    pyqtwebengine-5.12.1       |   py39h0fcd23e_8         174 KB  conda-forge
    python-dateutil-2.8.2      |     pyhd8ed1ab_0         240 KB  conda-forge
    qt-5.12.9                  |       ha98a1a1_5        98.8 MB  conda-forge
    tornado-6.1                |   py39h3811e60_2         646 KB  conda-forge
    unicodedata2-14.0.0        |   py39h3811e60_0         545 KB  conda-forge
    xorg-libxau-1.0.9          |       h7f98852_0          13 KB  conda-forge
    xorg-libxdmcp-1.1.3        |       h7f98852_0          19 KB  conda-forge
    zstd-1.5.2                 |       ha95c52a_0         458 KB  conda-forge
    ------------------------------------------------------------
                                           Total:       211.3 MB

The following NEW packages will be INSTALLED:

  alsa-lib           conda-forge/linux-64::alsa-lib-1.2.3-h516909a_0
  brotli             conda-forge/linux-64::brotli-1.0.9-h7f98852_6
  brotli-bin         conda-forge/linux-64::brotli-bin-1.0.9-h7f98852_6
  cycler             conda-forge/noarch::cycler-0.11.0-pyhd8ed1ab_0
  dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3
  expat              conda-forge/linux-64::expat-2.4.4-h9c3ff4c_0
  font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0
  font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0
  font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0
  font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-hab24e00_0
  fontconfig         conda-forge/linux-64::fontconfig-2.13.94-ha180cfb_0
  fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0
  fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0
  fonttools          conda-forge/linux-64::fonttools-4.29.1-py39h3811e60_0
  freetype           conda-forge/linux-64::freetype-2.10.4-h0708190_1
  gettext            conda-forge/linux-64::gettext-0.19.8.1-h73d1719_1008
  gst-plugins-base   conda-forge/linux-64::gst-plugins-base-1.18.5-hf529b03_3
  gstreamer          conda-forge/linux-64::gstreamer-1.18.5-h9f60fe5_3
  icu                conda-forge/linux-64::icu-69.1-h9c3ff4c_0
  jbig               conda-forge/linux-64::jbig-2.1-h7f98852_2003
  jpeg               conda-forge/linux-64::jpeg-9e-h7f98852_0
  kiwisolver         conda-forge/linux-64::kiwisolver-1.3.2-py39h1a9c180_1
  krb5               conda-forge/linux-64::krb5-1.19.2-hcc1bbae_3
  lcms2              conda-forge/linux-64::lcms2-2.12-hddcbb42_0
  lerc               conda-forge/linux-64::lerc-3.0-h9c3ff4c_0
  libbrotlicommon    conda-forge/linux-64::libbrotlicommon-1.0.9-h7f98852_6
  libbrotlidec       conda-forge/linux-64::libbrotlidec-1.0.9-h7f98852_6
  libbrotlienc       conda-forge/linux-64::libbrotlienc-1.0.9-h7f98852_6
  libclang           conda-forge/linux-64::libclang-13.0.0-default_hc23dcda_0
  libdeflate         conda-forge/linux-64::libdeflate-1.8-h7f98852_0
  libedit            conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2
  libevent           conda-forge/linux-64::libevent-2.1.10-h9b69904_4
  libglib            conda-forge/linux-64::libglib-2.70.2-h174f98d_1
  libiconv           conda-forge/linux-64::libiconv-1.16-h516909a_0
  libllvm13          conda-forge/linux-64::libllvm13-13.0.0-hf817b99_0
  libogg             conda-forge/linux-64::libogg-1.3.4-h7f98852_1
  libopus            conda-forge/linux-64::libopus-1.3.1-h7f98852_1
  libpng             conda-forge/linux-64::libpng-1.6.37-h21135ba_2
  libpq              conda-forge/linux-64::libpq-14.1-hd57d9b9_1
  libtiff            conda-forge/linux-64::libtiff-4.3.0-h6f004c6_2
  libuuid            conda-forge/linux-64::libuuid-2.32.1-h7f98852_1000
  libvorbis          conda-forge/linux-64::libvorbis-1.3.7-h9c3ff4c_0
  libwebp-base       conda-forge/linux-64::libwebp-base-1.2.2-h7f98852_1
  libxcb             conda-forge/linux-64::libxcb-1.13-h7f98852_1004
  libxkbcommon       conda-forge/linux-64::libxkbcommon-1.0.3-he3ba5ed_0
  libxml2            conda-forge/linux-64::libxml2-2.9.12-h885dcf4_1
  lz4-c              conda-forge/linux-64::lz4-c-1.9.3-h9c3ff4c_1
  matplotlib         conda-forge/linux-64::matplotlib-3.5.1-py39hf3d152e_0
  matplotlib-base    conda-forge/linux-64::matplotlib-base-3.5.1-py39h2fa2bec_0
  munkres            conda-forge/noarch::munkres-1.1.4-pyh9f0ad1d_0
  mysql-common       conda-forge/linux-64::mysql-common-8.0.28-ha770c72_0
  mysql-libs         conda-forge/linux-64::mysql-libs-8.0.28-hfa10184_0
  nspr               conda-forge/linux-64::nspr-4.32-h9c3ff4c_1
  nss                conda-forge/linux-64::nss-3.74-hb5efdd6_0
  olefile            conda-forge/noarch::olefile-0.46-pyh9f0ad1d_1
  openjpeg           conda-forge/linux-64::openjpeg-2.4.0-hb52868f_1
  packaging          conda-forge/noarch::packaging-21.3-pyhd8ed1ab_0
  pcre               conda-forge/linux-64::pcre-8.45-h9c3ff4c_0
  pillow             conda-forge/linux-64::pillow-8.4.0-py39ha612740_0
  pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-h36c2ea0_1001
  pyparsing          conda-forge/noarch::pyparsing-3.0.7-pyhd8ed1ab_0
  pyqt               conda-forge/linux-64::pyqt-5.12.3-py39hf3d152e_8
  pyqt-impl          conda-forge/linux-64::pyqt-impl-5.12.3-py39hde8b62d_8
  pyqt5-sip          conda-forge/linux-64::pyqt5-sip-4.19.18-py39he80948d_8
  pyqtchart          conda-forge/linux-64::pyqtchart-5.12-py39h0fcd23e_8
  pyqtwebengine      conda-forge/linux-64::pyqtwebengine-5.12.1-py39h0fcd23e_8
  python-dateutil    conda-forge/noarch::python-dateutil-2.8.2-pyhd8ed1ab_0
  qt                 conda-forge/linux-64::qt-5.12.9-ha98a1a1_5
  tornado            conda-forge/linux-64::tornado-6.1-py39h3811e60_2
  unicodedata2       conda-forge/linux-64::unicodedata2-14.0.0-py39h3811e60_0
  xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.9-h7f98852_0
  xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.3-h7f98852_0
  zstd               conda-forge/linux-64::zstd-1.5.2-ha95c52a_0


Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
┌ Warning: seriestype boxplot has been moved to StatsPlots.  To use: `Pkg.add("StatsPlots"); using StatsPlots`
└ @ Plots ~/.julia/packages/Plots/Xv2qA/src/args.jl:1575
┌ Warning: seriestype boxplot has been moved to StatsPlots.  To use: `Pkg.add("StatsPlots"); using StatsPlots`
└ @ Plots ~/.julia/packages/Plots/Xv2qA/src/args.jl:1575
┌ Warning: seriestype boxplot has been moved to StatsPlots.  To use: `Pkg.add("StatsPlots"); using StatsPlots`
└ @ Plots ~/.julia/packages/Plots/Xv2qA/src/args.jl:1575
┌ Warning: seriestype boxplot has been moved to StatsPlots.  To use: `Pkg.add("StatsPlots"); using StatsPlots`
└ @ Plots ~/.julia/packages/Plots/Xv2qA/src/args.jl:1575
┌ Warning: seriestype boxplot has been moved to StatsPlots.  To use: `Pkg.add("StatsPlots"); using StatsPlots`
└ @ Plots ~/.julia/packages/Plots/Xv2qA/src/args.jl:1575
┌ Warning: seriestype boxplot has been moved to StatsPlots.  To use: `Pkg.add("StatsPlots"); using StatsPlots`
└ @ Plots ~/.julia/packages/Plots/Xv2qA/src/args.jl:1575
┌ Warning: seriestype boxplot has been moved to StatsPlots.  To use: `Pkg.add("StatsPlots"); using StatsPlots`
└ @ Plots ~/.julia/packages/Plots/Xv2qA/src/args.jl:1575
┌ Warning: seriestype boxplot has been moved to StatsPlots.  To use: `Pkg.add("StatsPlots"); using StatsPlots`
└ @ Plots ~/.julia/packages/Plots/Xv2qA/src/args.jl:1575
┌ Warning: seriestype boxplot has been moved to StatsPlots.  To use: `Pkg.add("StatsPlots"); using StatsPlots`
└ @ Plots ~/.julia/packages/Plots/Xv2qA/src/args.jl:1575
┌ Warning: seriestype boxplot has been moved to StatsPlots.  To use: `Pkg.add("StatsPlots"); using StatsPlots`
└ @ Plots ~/.julia/packages/Plots/Xv2qA/src/args.jl:1575
┌ Warning: seriestype boxplot has been moved to StatsPlots.  To use: `Pkg.add("StatsPlots"); using StatsPlots`
└ @ Plots ~/.julia/packages/Plots/Xv2qA/src/args.jl:1575
┌ Warning: Keyword argument hover not supported with Plots.PyPlotBackend().  Choose from: Set([:zforeground_color_guide, :top_margin, :group, :inset_subplots, :stride, :background_color, :ytickfontsize, :yforeground_color_text, :yguidefontcolor, :tickfontfamily, :show_empty_bins, :seriesalpha, :seriescolor, :xrotation, :ztick_direction, :xgrid, :colorbar_tickfontfamily, :ygridalpha, :colorbar_tickfonthalign, :dpi, :zlims, :colorbar, :xtick_direction, :legend_font_family, :zflip, :ticks, :linealpha, :overwrite_figure, :arrow, :normalize, :linestyle, :xflip, :yforeground_color_guide, :zgrid, :fillcolor, :ygrid, :bar_width, :colorbar_scale, :background_color_inside, :bins, :zguide, :zforeground_color_text, :yscale, :legend_font_color, :weights, :xgridalpha, :ygridstyle, :clims, :xtickfontcolor, :fill_z, :xguide, :markershape, :background_color_subplot, :colorbar_tickfontrotation, :fillalpha, :markerstrokewidth, :tick_direction, :ztickfontfamily, :xguidefontfamily, :gridlinewidth, :foreground_color_subplot, :xgridlinewidth, :yguidefontsize, :foreground_color, :foreground_color_text, :x, :yerror, :rotation, :zgridlinewidth, :discrete_values, :ytick_direction, :grid, :ribbon, :xguidefontsize, :xforeground_color_axis, :xdiscrete_values, :background_color_outside, :titlefontcolor, :xgridstyle, :line_z, :size, :orientation, :gridstyle, :projection, :markersize, :legend_foreground_color, :camera, :ydiscrete_values, :xforeground_color_grid, :seriestype, :yflip, :quiver, :zticks, :markerstrokecolor, :fillrange, :colorbar_ticks, :xlims, :xforeground_color_border, :markercolor, :yrotation, :xtickfontsize, :ylink, :levels, :color_palette, :connections, :yforeground_color_grid, :lims, :zgridstyle, :foreground_color_border, :xscale, :marker_z, :markerstrokealpha, :left_margin, :markeralpha, :annotations, :window_title, :zrotation, :foreground_color_axis, :zguidefontcolor, :ygridlinewidth, :zlink, :zscale, :smooth, :colorbar_tickfontcolor, :xticks, :guidefontsize, :zguidefontsize, :y, :margin, :ytickfontcolor, :foreground_color_title, :bottom_margin, :zdiscrete_values, :yforeground_color_border, :zguidefontfamily, :framestyle, :zgridalpha, :zerror, :ztickfontcolor, :scale, :legend_position, :linecolor, :html_output_format, :legend_title, :zforeground_color_border, :yguide_position, :legend_font_pointsize, :contours, :title, :tickfontcolor, :subplot_index, :flip, :xguide_position, :legend_background_color, :tickfontsize, :z, :yforeground_color_axis, :foreground_color_grid, :linewidth, :ztickfontsize, :gridalpha, :xerror, :guidefontfamily, :ylims, :bar_position, :contour_labels, :zguide_position, :xguidefontcolor, :primary, :colorbar_tickfontsize, :xtickfontfamily, :ytickfontfamily, :aspect_ratio, :xforeground_color_text, :foreground_color_guide, :show, :link, :guide_position, :colorbar_title, :xforeground_color_guide, :subplot, :label, :guide, :titlelocation, :guidefontcolor, :colorbar_tickfontvalign, :bar_edges, :yguide, :titlefontsize, :titlefont, :titlefontfamily, :zforeground_color_axis, :yguidefontfamily, :zforeground_color_grid, :layout, :colorbar_entry, :polar, :right_margin, :xlink, :series_annotations, :yticks])
└ @ Plots ~/.julia/packages/Plots/Xv2qA/src/args.jl:1607

Filtering

Two filtering function are available for filtering cells and features/genes based on basic criteria such as the number of features detected, the number of cells for which a feature is detected and the total number of counts.

X = filter_cells(X, min_features=200)
X = filter_features(X, min_cells=3)
2700×13714 Named sparse matrix with 2282976 Int32 nonzero entries:
	[AGATTCCTGACGAG-1,          "AL627309.1"]  =  1
	[CCAAAGTGCTACGA-1,          "AL627309.1"]  =  1
	[CCGTACACGTTGGT-1,          "AL627309.1"]  =  1
	[CGACTGCTTCCTCG-1,          "AL627309.1"]  =  1
	[CTATGTTGTCTCGC-1,          "AL627309.1"]  =  1
	[CTGTGAGACGAACT-1,          "AL627309.1"]  =  1
	[GATACTCTATCGGT-1,          "AL627309.1"]  =  1
	[GCTAGATGAGCTCA-1,          "AL627309.1"]  =  1
	[GCTTAACTACTGGT-1,          "AL627309.1"]  =  1
	                                           ⋮
	[TCTCAAACCTAAGC-1,            "SRSF10.1"]  =  1
	[TGAGACACTCAAGC-1,            "SRSF10.1"]  =  1
	[TGATCGGATATGCG-1,            "SRSF10.1"]  =  1
	[TGTAGGTGCTATGG-1,            "SRSF10.1"]  =  1
	[TTATCCGAGAAAGT-1,            "SRSF10.1"]  =  1
	[TTCAGTACCGACTA-1,            "SRSF10.1"]  =  1
	[TTGAATGATCTCAT-1,            "SRSF10.1"]  =  1
	[TTGAGGACTACGCA-1,            "SRSF10.1"]  =  1
	[TTTATCCTGTTGTG-1,            "SRSF10.1"]  =  1
	[TTTCCAGAGGTGAG-1,            "SRSF10.1"]  =  1

This filters out cells with less than 200 features/genes and features detected in less than 3 cells. For convenience, filter_counts combines these two into a single function, similar to Seurat. Beware that the order in which you call these functions can matter.

Severo.filter_featuresFunction
filter_features(; min_cells = 0)

Partial version of filterfeatures(A::NamedCountMatrix; mincells = 0)

source
Severo.filter_cellsFunction
filter_cells(; min_features = 0, min_feature_count = 0, min_umi = 0)

Partial version of filtercells(A::NamedCountMatrix; minfeatures = 0, minfeaturecount = 0, min_umi = 0)

source
Severo.filter_countsFunction
filter_counts(; min_cells = 0, min_features = 0, min_feature_count = 0, min_umi = 0)

Partial version of filtercounts(A::NamedCountMatrix; mincells = 0, minfeatures = 0, minfeaturecount = 0, minumi = 0)

source

Normalization

After removing unwanted cells from the dataset, the next step is to normalize the data. The basic method normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor and optionally log-transforming the result.

Y = normalize_cells(X, method=:lognormalize, scale_factor=1e4)
2700×13714 Named sparse matrix with 2282976 Float64 nonzero entries:
	[AGATTCCTGACGAG-1,          "AL627309.1"]  =  1.538222672557476
	[CCAAAGTGCTACGA-1,          "AL627309.1"]  =  1.3631966075373008
	[CCGTACACGTTGGT-1,          "AL627309.1"]  =  1.3945254817849748
	[CGACTGCTTCCTCG-1,          "AL627309.1"]  =  1.787190785781838
	[CTATGTTGTCTCGC-1,          "AL627309.1"]  =  1.3514100666239552
	[CTGTGAGACGAACT-1,          "AL627309.1"]  =  1.6848688061171004
	[GATACTCTATCGGT-1,          "AL627309.1"]  =  1.6623488639622728
	[GCTAGATGAGCTCA-1,          "AL627309.1"]  =  1.6749250073944073
	[GCTTAACTACTGGT-1,          "AL627309.1"]  =  1.8821985210730219
	                                           ⋮
	[TCTCAAACCTAAGC-1,            "SRSF10.1"]  =  2.050332029149289
	[TGAGACACTCAAGC-1,            "SRSF10.1"]  =  1.873284442922057
	[TGATCGGATATGCG-1,            "SRSF10.1"]  =  2.021955420152209
	[TGTAGGTGCTATGG-1,            "SRSF10.1"]  =  1.2128182558075358
	[TTATCCGAGAAAGT-1,            "SRSF10.1"]  =  1.2133165770732028
	[TTCAGTACCGACTA-1,            "SRSF10.1"]  =  1.529100016033425
	[TTGAATGATCTCAT-1,            "SRSF10.1"]  =  1.7863632500815836
	[TTGAGGACTACGCA-1,            "SRSF10.1"]  =  0.9465443048463601
	[TTTATCCTGTTGTG-1,            "SRSF10.1"]  =  1.31322083398467
	[TTTCCAGAGGTGAG-1,            "SRSF10.1"]  =  1.7178390681467346
Severo.normalize_cellsFunction
normalize_cells(; method = :lognormalize, scale_factor::Real = 1.0, dtype::Type{T} = Float64) where T <: AbstractFloat

Partial version of normalizecells(X::NamedCountMatrix; method = :lognormalize, scalefactor::Real = 1.0, dtype::Type{T} = Float64) where T <: AbstractFloat

source

Identifying highly variable features

Next, we select a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others).

hvf = find_variable_features(X, 2000; method=:vst)
2000-element Named SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}
features              │ 
──────────────────────┼──────
"PPBP"                │  3316
"LYZ"                 │  8682
"S100A9"              │   908
"IGLL5"               │ 13275
"GNLY"                │  1804
"FTL"                 │ 12944
"PF4"                 │  3315
"FTH1"                │  7755
⋮                           ⋮
"EIF3D"               │ 13383
"POLR2I"              │ 12681
"ZCCHC10"             │  3924
"SEC61B"              │  6669
"RP11-164H13.1"       │  9585
"NELL2"               │  8455
"NF1"                 │ 11047
"GTF2H2"              │  3747
Severo.find_variable_featuresFunction
find_variable_features(nfeatures = 2000; method = :vst, dtype::Type{<:AbstractFloat} = Float64, kw...)

Partial version of findvariablefeatures(counts::NamedCountMatrix, nfeatures = 2000; method = :vst, dtype::Type{<:AbstractFloat} = Float64, kw...)

source

Dimensionality reduction

Scaling

Prior to dimensional reduction techniques like PCA, it's a good idea to scale_features the data. Scaling performs two basic operations:

  • Shifts the expression of each gene, so that the mean expression across cells is 0
  • Scales the expression of each gene, so that the standard deviation across cells is 1. This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
  • Clips values exceeding standard deviation of scale_max
Y = Y[:,hvf] # only use highly-variable features
S = scale_features(Y; scale_max=10)
CenteredMatrix:
  A = 2700×2000 Named sparse matrix with 515841 Float64 nonzero entries:
	[AAACCGTGCTTCCG-1,                "PPBP"]  =  3.05759
	[AAATCAACCCTATT-1,                "PPBP"]  =  1.983
	[AAATCCCTGCTATG-1,                "PPBP"]  =  5.03146
	[AACCGCCTCTACGA-1,                "PPBP"]  =  2.44748
	[AAGCCTGACATGCA-1,                "PPBP"]  =  3.68051
	[AATCCTTGACGGGA-1,                "PPBP"]  =  3.12117
	[ACCCACTGGTTCAG-1,                "PPBP"]  =  10.143
	[ACCTGAGATATCGG-1,                "PPBP"]  =  9.63048
	[ACCTGGCTGTCTTT-1,                "PPBP"]  =  7.59301
	                                           ⋮
	[TGAGTGACTGAGCT-1,              "GTF2H2"]  =  7.15279
	[TGCGATGACCTCGT-1,              "GTF2H2"]  =  7.51358
	[TGGGTATGAAGAGT-1,              "GTF2H2"]  =  6.33385
	[TGTGAGACTGTCAG-1,              "GTF2H2"]  =  6.86631
	[TTAGGGACGCGAAG-1,              "GTF2H2"]  =  7.54787
	[TTAGTCTGAAAGCA-1,              "GTF2H2"]  =  6.76249
	[TTCCAAACTCCCAC-1,              "GTF2H2"]  =  10.1315
	[TTCGTATGTCCTTA-1,              "GTF2H2"]  =  5.26091
	[TTGGTACTGAATCC-1,              "GTF2H2"]  =  8.17128
	[TTTAGCTGTACTCT-1,              "GTF2H2"]  =  4.47778
  mu = 2000-element Named Vector{Float64}
features              │ 
──────────────────────┼──────────
"PPBP"                │  0.142967
"LYZ"                 │  0.972527
"S100A9"              │  0.645189
"IGLL5"               │   0.18421
"GNLY"                │  0.407293
"FTL"                 │   2.94261
"PF4"                 │  0.110369
"FTH1"                │   3.22873
⋮                               ⋮
"EIF3D"               │  0.775623
"POLR2I"              │  0.396716
"ZCCHC10"             │  0.211964
"SEC61B"              │  0.656028
"RP11-164H13.1"       │ 0.0889054
"NELL2"               │   0.25572
"NF1"                 │  0.130034
"GTF2H2"              │  0.131466
Severo.scale_featuresFunction
scale_features(; scale_max::Real = Inf, dtype::Type{<:AbstractFloat} = Float64, features::Union{Nothing, AbstractArray} = nothing)

Partial version of scalefeatures(X::(NamedArray{T, 2, SparseMatrixCSC{T, Int64}} where T); scalemax::Real = Inf, dtype::Type{<:AbstractFloat} = Float64, features::Union{Nothing, AbstractArray} = nothing)

source

Principal component analysis

Reduce the dimensionality of the data by running principal component analysis (PCA), which reveals the main axes of variation and de-noises the data.

em = embedding(S, 15, method=:pca)
PC-1
Positive: InlineStrings.String31["CST3", "TYROBP", "LST1", "AIF1", "FTL", "FCN1", "LYZ", "FTH1", "FCER1G", "S100A9"]
Negative: InlineStrings.String31["LTB", "IL32", "CD2", "ACAP1", "STK17A", "CTSW", "CD247", "GIMAP5", "AQP3", "CCL5"]
PC-2
Positive: InlineStrings.String31["CD79A", "MS4A1", "TCL1A", "HLA-DRA", "HLA-DQA1", "HLA-DQB1", "LINC00926", "CD79B", "HLA-DRB1", "CD74"]
Negative: InlineStrings.String31["NKG7", "PRF1", "CST7", "GZMA", "GZMB", "FGFBP2", "CTSW", "GNLY", "GZMH", "SPON2"]
PC-3
Positive: InlineStrings.String31["HLA-DQA1", "CD79A", "HLA-DQB1", "CD79B", "HLA-DPB1", "CD74", "HLA-DPA1", "MS4A1", "HLA-DRB1", "HLA-DRB5"]
Negative: InlineStrings.String31["PPBP", "PF4", "SDPR", "SPARC", "GNG11", "NRGN", "GP9", "RGS18", "TUBB1", "HIST1H2AC"]
PC-4
Positive: InlineStrings.String31["HLA-DQA1", "CD79A", "CD79B", "HLA-DQB1", "MS4A1", "CD74", "HLA-DPB1", "HIST1H2AC", "PF4", "HLA-DQA2"]
Negative: InlineStrings.String31["RPS2", "VIM", "S100A6", "S100A8", "S100A4", "IL32", "S100A9", "TMSB10", "GIMAP7", "S100A10"]
PC-5
Positive: InlineStrings.String31["GZMB", "FGFBP2", "NKG7", "GNLY", "PRF1", "CCL4", "CST7", "SPON2", "GZMA", "GZMH"]
Negative: InlineStrings.String31["LTB", "VIM", "AQP3", "PPA1", "MAL", "RPS2", "KIAA0101", "CD2", "CORO1B", "CYTIP"]
plot_elbow(em)

Severo.embeddingFunction
embedding(ncomponents::Int64 = 50; method = :pca, kw...)

Partial version of embedding(X, ncomponents::Int64 = 50; method = :pca, kw...)

source

Clustering the cells

Briefly, the clustering embed cells in a graph structure with edges between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected 'quasi-cliques' or 'communities'.

Computing the neighborhood graph

First, we need to find the k-nearest neighbors of all the cells based on the euclidean distance in PCA space, and then compute a graph based on the shared overlap in the local neighborhoods between any two cells (Jaccard similarity). These step are performed using the nearest_neighbours and jaccard_index functions. For convenience, the two steps are also combined in the shared_nearest_neighours function.

nn = nearest_neighbours(em, 20, dims=1:10)
snn = jaccard_index(nn, prune=1/15)
# or as a single step
#snn = nearest_neighbours(em, 20, dims=1:10, prune=1/15)
2700×2700 Named sparse matrix with 198538 Float64 nonzero entries:
	[AAACATACAACCAC-1, AAACATACAACCAC-1]  =  1.0
	[AAATTGACTCGCTC-1, AAACATACAACCAC-1]  =  0.08108108108108109
	[AACTCTTGCAGGAG-1, AAACATACAACCAC-1]  =  0.1111111111111111
	[AATCTAGAATCGGT-1, AAACATACAACCAC-1]  =  0.08108108108108109
	[ACCACGCTGCTGTA-1, AAACATACAACCAC-1]  =  0.14285714285714285
	[ACGCCGGAAATGCC-1, AAACATACAACCAC-1]  =  0.14285714285714285
	[ACTAGGTGGAACTC-1, AAACATACAACCAC-1]  =  0.08108108108108109
	[AGAGATGATTGTGG-1, AAACATACAACCAC-1]  =  0.08108108108108109
	[AGCCGTCTGAGAGC-1, AAACATACAACCAC-1]  =  0.08108108108108109
	                                      ⋮
	[TCGGTAGAGTAGGG-1, TTTGCATGCCTCAC-1]  =  0.08108108108108109
	[TGATAAACTTTCAC-1, TTTGCATGCCTCAC-1]  =  0.08108108108108109
	[TGATACCTGTTGGT-1, TTTGCATGCCTCAC-1]  =  0.08108108108108109
	[TGCCAGCTTGGCAT-1, TTTGCATGCCTCAC-1]  =  0.14285714285714285
	[TGGTTACTGTTCTT-1, TTTGCATGCCTCAC-1]  =  0.1111111111111111
	[TGTAACCTAGAGGC-1, TTTGCATGCCTCAC-1]  =  0.1111111111111111
	[TTAGAATGTGGTGT-1, TTTGCATGCCTCAC-1]  =  0.08108108108108109
	[TTAGCTACAACCGT-1, TTTGCATGCCTCAC-1]  =  0.1111111111111111
	[TTCGATTGAGCATC-1, TTTGCATGCCTCAC-1]  =  0.14285714285714285
	[TTTGCATGCCTCAC-1, TTTGCATGCCTCAC-1]  =  1.0
Severo.nearest_neighboursFunction
nearest_neighbours(k::Int64; dims = (:), metric::SemiMetric = Euclidean(), include_self::Bool = true, ntables::Int64 = 2 * size(X, 2)) where T

Partial version of nearestneighbours(X::NamedArray{T, 2}, k::Int64; dims = (:), metric::SemiMetric = Euclidean(), includeself::Bool = true, ntables::Int64 = 2 * size(X, 2)) where T

source
Severo.jaccard_indexFunction
jaccard_index(k::Int64; prune::Real = 1 / 15, dtype::Type{R} = Float64) where R <: AbstractFloat

Partial version of jaccard_index(nn::(NamedArray{T, 2} where T), k::Int64; prune::Real = 1 / 15, dtype::Type{R} = Float64) where R <: AbstractFloat

source

Clustering the neighborhood graph

Clustering is performed using a graph-based method called Louvain clustering which tries to detect communities based on optimizing a modularity function. Since this is a greedy algorithm, which can get stuck in local minima based on random initialization, multiple restarts are required to find a more "global" minima.

lbls = cluster(snn)
2700-element Named Vector{Int64}
cells            │ 
─────────────────┼───
AAACATACAACCAC-1 │  1
AAACATTGAGCTAC-1 │  2
AAACATTGATCAGC-1 │  1
AAACCGTGCTTCCG-1 │  3
AAACCGTGTATGCG-1 │  4
AAACGCACTGGTAC-1 │  1
AAACGCTGACCAGT-1 │  5
AAACGCTGGTTCTT-1 │  5
⋮                   ⋮
TTTCAGTGTGCAGT-1 │  2
TTTCCAGAGGTGAG-1 │  1
TTTCGAACACCTGA-1 │ 10
TTTCGAACTCTCAT-1 │  8
TTTCTACTGAGGCA-1 │  2
TTTCTACTTCCTCG-1 │  2
TTTGCATGAGAGGC-1 │  2
TTTGCATGCCTCAC-1 │  7
Severo.clusterFunction
cluster(; kw...)

Partial version of cluster(SNN::NeighbourGraph; kw...)

source

Finding differentially expressed features

We can find markers that define clusters via differential expression. It identifies positive and negative markers for a single cluster compared to the other cells. find_markers performs this process for every cluster. The function returns a DataFrame containing for each marker: the p-value, the log-foldchange and statistical score.

dx = find_markers(X, lbls; method=:wilcoxon)
dx = filter_rank_markers(dx)
dx[1:10, :]

10 rows × 5 columns

groupscorepvallogfcfeature
Int64Float64Float64Float64String31
1115.58099.8217e-55-1.67251HLA-DRB1
2115.26181.37483e-52-2.51018HLA-DRA
3115.18514.43631e-52-2.2133TYROBP
4114.56364.78877e-48-1.66148FCER1G
5114.41783.99753e-47-1.06897HLA-DRB5
6114.2246.50439e-46-1.64141HLA-DPA1
7113.73865.96083e-43-1.64782HLA-DPB1
8112.48998.47673e-36-1.94933CD74
9112.33755.68872e-35-0.694127HLA-DMA
10112.23332.06408e-34-0.850923CFD

Performance can be improved by first filtering the features based on the percentage of cells and the fold-change between the groups.

# filtering and DE on the lognormalized data
sel = prefilter_markers(Y, lbls, logfc_threshold=0.25, min_pct=0.1, log=true, only_pos=true)
dx = find_markers(Y, lbls; method=:t, log=true, selection=sel) # t-test
dx = filter_rank_markers(dx)
dx[1:10, :]

10 rows × 5 columns

groupscorepvallogfcfeature
Int64Float64Float64Float64String
1130.64812.0335e-1521.22346IL32
2129.14765.85058e-1431.28312LTB
3115.42742.16513e-461.21357CD2
4114.46961.37475e-430.490875VIM
5114.00623.44367e-400.582623GIMAP7
6112.74673.65549e-331.23198AQP3
7111.47122.27143e-280.599281ANXA1
8110.64181.9189e-240.853336TRADD
919.212043.80925e-190.769177TNFAIP8
1018.679193.45651e-170.716185MAL
Severo.find_markersFunction
find_markers(X::Union{NamedCountMatrix, NamedDataMatrix}, idents::NamedVector{<:Integer};
    method=:wilcoxon, selection::Union{Nothing, NamedArray{Bool, 2}, AbstractArray{Bool,2}}=nothing, log::Bool=false, kw...)

Finds markers (differentially expressed genes) for each of the classes in a dataset.

Arguments:

-`X`: count or data matrix
-`idents`: class identity for each cell
-`method`: Which test to use, supported are: [wilcoxon, t]
-`selection`: a selection of features and groups that should be considered
-`log`: the data is in log-scale (default = false)
-`kw...`: additional parameters passed down to the method

Return values:

A DataFrame containing a list of putative markers with associated statistics (p-values and scores) and log fold-changes.

source
Severo.filter_rank_markersFunction
filter_rank_markers(de::DataFrame; pval_thresh::Real=1e-2, ngenes::Integer=typemax(Int64))

Filters and ranks a list of markers (differentially expressed genes).

Arguments:

-`de`: list of markers returned by [find_markers](@ref)
-`pval_thresh`: only keep markers with pval < pval_thresh
-`count`: the number of highest-ranked markers to keep
-`rankby_abs`: rank based on absolute value of the scores

Return values:

A DataFrame containing a ranked list of filtered markers.

source
Severo.prefilter_markersFunction
prefilter_markers(X::Union{NamedCountMatrix, NamedDataMatrix}, idents::NamedVector{<:Integer};
        logfc_threshold::Real=0.0, min_pct::Real=0.0, min_diff_pct::Real=-Inf, only_pos:Bool=false, log::Bool=false)

Filter features for each of the classes in a dataset.

Arguments:

-`X`: count or data matrix
-`idents`: class identity for each cell
-`logfc_threshold`: Limit testing to features which show, on average, at least X-fold difference (log-scale) between the two groups of cells
-`min_pct`: only test features that are detected in a minimum fraction of `min_pct` cells in either of the two populations
-`min_diff_pct`: only test features that show a minimum difference in the fraction of detection between the two groups.
-`only_pos`: only return features with positive log fold-change
-`log`: the data is in log-scale (default = false)

Return values:

Selection matrix for each feature and class

source
plot_violin(X, ["CST3", "NKG7", "PPBP", "S100A4"], lbls)
savefig("violin-plot.svg"); nothing # hide