Preprocessing and clustering 3k PBMCs

This showcase reproduces Seurat's Guided Clustering Tutorial

Overview

Preprocessing and clustering 3k PBMCs

Overview
Loading data
Preprocessing
Dimensionality reduction
Clustering the cells
Finding differentially expressed features

Loading data

The data consists of 3k PBMCs from a Healthy Donor and is freely available from 10x Genomics.

You can either download the data manually

mkdir pbmc3k
wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
cd pbmc3k; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz --strip-components 2

The read_10X function reads the data from the cellranger 10X pipeline and returns a labelled count matrix. Each entry indicated the number of molecules detected for each feature/gene (columns) and each cell (rows).

using Severo
X = read_10X("pbmc3k/")

Warning

The meaning of the rows and columns is different from the representation used by other packages like Seurat

Alternatively, the dataset function can be used to load a dataset from a predefined collection. For example, the PBMC collection is known by Cell.jl and can be easily loaded as follows:

using Severo
X = dataset("PBMC", "3k")

# The matrix can be indexed using names or indices. For instance,
# we can look at specific genes in the first thirty cells
#X[1:30, ["CD3D", "TCL1A", "MS4A1"]]

2700×32738 Named sparse matrix with 2286884 Int32 nonzero entries:
	[AGATTCCTGACGAG-1,          "AL627309.1"]  =  1
	[CCAAAGTGCTACGA-1,          "AL627309.1"]  =  1
	[CCGTACACGTTGGT-1,          "AL627309.1"]  =  1
	[CGACTGCTTCCTCG-1,          "AL627309.1"]  =  1
	[CTATGTTGTCTCGC-1,          "AL627309.1"]  =  1
	[CTGTGAGACGAACT-1,          "AL627309.1"]  =  1
	[GATACTCTATCGGT-1,          "AL627309.1"]  =  1
	[GCTAGATGAGCTCA-1,          "AL627309.1"]  =  1
	[GCTTAACTACTGGT-1,          "AL627309.1"]  =  1
	                                           ⋮
	[TCTCAAACCTAAGC-1,            "SRSF10.1"]  =  1
	[TGAGACACTCAAGC-1,            "SRSF10.1"]  =  1
	[TGATCGGATATGCG-1,            "SRSF10.1"]  =  1
	[TGTAGGTGCTATGG-1,            "SRSF10.1"]  =  1
	[TTATCCGAGAAAGT-1,            "SRSF10.1"]  =  1
	[TTCAGTACCGACTA-1,            "SRSF10.1"]  =  1
	[TTGAATGATCTCAT-1,            "SRSF10.1"]  =  1
	[TTGAGGACTACGCA-1,            "SRSF10.1"]  =  1
	[TTTATCCTGTTGTG-1,            "SRSF10.1"]  =  1
	[TTTCCAGAGGTGAG-1,            "SRSF10.1"]  =  1

The count data is stored in a sparse matrix format, only storing non-zero elements of the matrix. Any values not shown are zero.

Severo.read_data — Function

read_data(path::AbstractString; kw...)

Tries to identify and read a count matrix in any of the supported formats

Arguments:

fname: path
kw: additional keyword arguments are passed on

Returns values:

Returns labeled sparse matrix containing the counts

	group	score	pval	logfc	feature
	Int64	Float64	Float64	Float64	String3…
1	1	15.9247	4.26806e-57	-1.70564	HLA-DRB1
2	1	15.0714	2.49925e-51	-2.51564	HLA-DRA
3	1	14.9186	2.49458e-50	-2.20198	TYROBP
4	1	14.8299	9.38941e-50	-1.10674	HLA-DRB5
5	1	14.6245	1.96067e-48	-1.66733	FCER1G
6	1	14.5823	3.64305e-48	-1.65854	HLA-DPA1
7	1	13.9825	1.99417e-44	-1.66391	HLA-DPB1
8	1	12.4352	1.68319e-35	-1.96046	CD74
9	1	12.3283	6.37965e-35	-0.700372	HLA-DMA
10	1	12.0139	3.00308e-33	-0.845171	CFD

	group	score	pval	logfc	feature
	Int64	Float64	Float64	Float64	String
1	1	30.3806	6.1892e-150	1.20543	IL32
2	1	29.6015	1.24438e-146	1.27376	LTB
3	1	15.2485	2.05504e-45	1.20156	CD2
4	1	13.6709	2.71139e-39	0.473947	VIM
5	1	13.6335	2.68983e-38	0.565327	GIMAP7
6	1	12.7573	3.69621e-33	1.23104	AQP3
7	1	12.064	5.70538e-31	0.613794	ANXA1
8	1	10.583	3.51297e-24	0.859993	TRADD
9	1	9.05947	1.7556e-18	0.751047	MAL
10	1	8.72365	2.06908e-17	0.744041	TNFAIP8