Preprocessing and clustering 3k PBMCs

This showcase reproduces Seurat's Guided Clustering Tutorial

Loading data

The data consists of 3k PBMCs from a Healthy Donor and is freely available from 10x Genomics.

You can either download the data manually

mkdir pbmc3k
wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
cd pbmc3k; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz --strip-components 2

The read_10X function reads the data from the cellranger 10X pipeline and returns a labelled count matrix. Each entry indicated the number of molecules detected for each feature/gene (columns) and each cell (rows).

using Severo
X = read_10X("pbmc3k/")

Warning

The meaning of the rows and columns is different from the representation used by other packages like Seurat

Alternatively, the dataset function can be used to load a dataset from a predefined collection. For example, the PBMC collection is known by Cell.jl and can be easily loaded as follows:

using Severo
X = dataset("PBMC", "3k")

# The matrix can be indexed using names or indices. For instance,
# we can look at specific genes in the first thirty cells
#X[1:30, ["CD3D", "TCL1A", "MS4A1"]]

2700×32738 Named sparse matrix with 2286884 Int32 nonzero entries:
	[AGATTCCTGACGAG-1,          "AL627309.1"]  =  1
	[CCAAAGTGCTACGA-1,          "AL627309.1"]  =  1
	[CCGTACACGTTGGT-1,          "AL627309.1"]  =  1
	[CGACTGCTTCCTCG-1,          "AL627309.1"]  =  1
	[CTATGTTGTCTCGC-1,          "AL627309.1"]  =  1
	[CTGTGAGACGAACT-1,          "AL627309.1"]  =  1
	[GATACTCTATCGGT-1,          "AL627309.1"]  =  1
	[GCTAGATGAGCTCA-1,          "AL627309.1"]  =  1
	[GCTTAACTACTGGT-1,          "AL627309.1"]  =  1
	                                           ⋮
	[TCTCAAACCTAAGC-1,            "SRSF10.1"]  =  1
	[TGAGACACTCAAGC-1,            "SRSF10.1"]  =  1
	[TGATCGGATATGCG-1,            "SRSF10.1"]  =  1
	[TGTAGGTGCTATGG-1,            "SRSF10.1"]  =  1
	[TTATCCGAGAAAGT-1,            "SRSF10.1"]  =  1
	[TTCAGTACCGACTA-1,            "SRSF10.1"]  =  1
	[TTGAATGATCTCAT-1,            "SRSF10.1"]  =  1
	[TTGAGGACTACGCA-1,            "SRSF10.1"]  =  1
	[TTTATCCTGTTGTG-1,            "SRSF10.1"]  =  1
	[TTTCCAGAGGTGAG-1,            "SRSF10.1"]  =  1

The count data is stored in a sparse matrix format, only storing non-zero elements of the matrix. Any values not shown are zero.

Severo.read_data — Function

read_data(path::AbstractString; kw...)

Tries to identify and read a count matrix in any of the supported formats

Arguments:

fname: path
kw: additional keyword arguments are passed on

Returns values:

Returns labeled sparse matrix containing the counts

	group	score	pval	logfc	feature
	Int64	Float64	Float64	Float64	String31
1	1	15.5809	9.8217e-55	-1.67251	HLA-DRB1
2	1	15.2618	1.37483e-52	-2.51018	HLA-DRA
3	1	15.1851	4.43631e-52	-2.2133	TYROBP
4	1	14.5636	4.78877e-48	-1.66148	FCER1G
5	1	14.4178	3.99753e-47	-1.06897	HLA-DRB5
6	1	14.224	6.50439e-46	-1.64141	HLA-DPA1
7	1	13.7386	5.96083e-43	-1.64782	HLA-DPB1
8	1	12.4899	8.47673e-36	-1.94933	CD74
9	1	12.3375	5.68872e-35	-0.694127	HLA-DMA
10	1	12.2333	2.06408e-34	-0.850923	CFD

	group	score	pval	logfc	feature
	Int64	Float64	Float64	Float64	String
1	1	30.6481	2.0335e-152	1.22346	IL32
2	1	29.1476	5.85058e-143	1.28312	LTB
3	1	15.4274	2.16513e-46	1.21357	CD2
4	1	14.4696	1.37475e-43	0.490875	VIM
5	1	14.0062	3.44367e-40	0.582623	GIMAP7
6	1	12.7467	3.65549e-33	1.23198	AQP3
7	1	11.4712	2.27143e-28	0.599281	ANXA1
8	1	10.6418	1.9189e-24	0.853336	TRADD
9	1	9.21204	3.80925e-19	0.769177	TNFAIP8
10	1	8.67919	3.45651e-17	0.716185	MAL