Generating 2D Gating Hierarhcy from Clustered Cytometry Data

Our algorithm cluster-to-gates(C2G) can visualize a clustered cytometry data in 2D gating hierarchy. The overall input to C2G are two variables: "data" and "label". "data" is a N-by-M matrix where N is the number of cells and M is the number of markers. "label" is a N-by-1 matrix and this matrix represent which cluster each cell belongs to (0 for ungated cells). In this page, we will demonstrate how to use C2G to generate gating hierarchy on a simulated data and on a CyTOF dataset that are manually gated. The software and testing data can be downloaded here .

Contents

Initialization and load the simulated data

This simulated data set contains 20000 cells, which consists of 5 populations.Two of the populations are considered as target and labeled as 1 and 2. The cells in other populations are labeled 0 which means "unlabeled". A visualization of the original data is also shown in the following sections.

addpath('src')
addpath('libs')
load('testdata/simulated.mat','data','label');
markernames ={'Marker 1','Marker 2','Marker 3'};
fprintf('Size of "data" is %d-by-%d\n',size(data));
fprintf('Size of "label" is %d-by-%d\n',size(label));
Size of "data" is 20000-by-3
Size of "label" is 20000-by-1

Visualize the simulated data and pre-clustered simulated data in 3D

In real application, this section is not necessay.

col = [0.8 0.8 0.8;hsv(length(unique(label))-1)];
figure('Position',[680 478 560 420]);
scatter3(data(:,1),data(:,2),data(:,3),1,col(label+1,:));
xlabel('Marker 1')
ylabel('Marker 2')
zlabel('Marker 3')
axis([-5 15 -5 15 -5 15]);

Run the analysis on the simulated data

This section performs the analysis on the simulated data. If your data has no "unlabeled" cells, precluster step can be skipped. If you skip precluster, the second and third parameter in function "C2G" should be the same.

% Precluster the simulated data
rng(9464);
preclustered_label = cluster_ungated(data,label);

% Call main part of the program and return a object m that store the
% results.
m = C2G(data,preclustered_label,label,'markernames',markernames);
% Draw the obtained gating hierarchy
m.view_gates(data,markernames,'n_lines',1,'onepanel',true);
% Show statistics
outtable = m.show_f_score(label);
Step:  0	[100%]
[Selected] marker pair Marker 2 and Marker 3
Population	Gate	    TP	    FP	    FN	F-score
         1	   1	  5411	 14589	     0	  0.426
         2	   1	  9166	 10834	     0	  0.629
Normalized Mutual Information = 0.321
Step:  1	[100%]
[Selected] marker pair Marker 1 and Marker 2
Population	Gate	    TP	    FP	    FN	F-score
         1	   2	  5411	  9183	     0	  0.541
         2	   2	  9166	  5428	     0	  0.772
Normalized Mutual Information = 0.736
Step:  2	[100%]No further separation. Node 3 is gate for cell population 1
Step:  3	[100%]No further separation. Node 4 is gate for cell population 2
Population	Gate	    TP	    FP	    FN	F-score
         1	   3	  5411	    18	     0	  0.998
         2	   4	  9166	     1	     0	  1.000
Normalized Mutual Information = 0.996

Load a moderate-sized CyTOF dataset with ~10,000 cells

This is a dataset about T cell signaling (Krishnaswamy et al, Science, 2014). It can be downloaded from here. This section will load multiple fcs files. Each fcs file correspond to one cell population manually gated by the authors. C2G will automatically generate cell labels based on fsc files. FCS files are store in the "testdata" folder. "CD4_Effmen.fcs", "CD4_naive.fcs", and "CD8_naive.fcs" are cells of target populations (manually defined) and "ctr.fcs" contain all cells. In this example, we only use surface protein markers.

clear
close all
addpath('src')
addpath('libs')
fdname = 'testdata';
[ori_data,ori_l,ori_markers]=load_mul_fcs(fdname,'ctr.fcs');

surface_idx = [3 4 6 8 9 11 12 13 22 24 25 27];
data = ori_data(:,surface_idx);
markers = ori_markers(surface_idx);
n_markers = length(markers);

Generate gating hierarchy for manually gated populations

In this section, we'll use the data loaded from previous section and treat the three manually defined cell populations as target populations. Here, number of "unlabeled" cells is around two-fold of cells in target populations. The results (the table below) show that C2G can capture manually defined populations at very high accuracy.

% Precluster the ungated cells
rng(9464)
label = cluster_ungated(data,ori_l);
% Perform the anlysis
m_ori = C2G(data,label,ori_l,'markernames',markers);
% Visualize the results
m_ori.view_gates(data,markers,'n_lines',3,'ignore_small',0,'onepanel',true);
m_ori.show_f_score(ori_l);
Population	Gate	    TP	    FP	    FN	F-score
         1	   7	   216	    13	     8	  0.954
         2	   8	  2267	    30	    78	  0.977
         3	   6	   488	    46	    37	  0.922
Normalized Mutual Information = 0.893

Generate gating hierarchy for K-means defined populations (K=10)

In this section, we will use the same CyTOF data used in previous section. The cell labels are defined by k-means algorithm where k equal to 10. All of the 10 clusters will be considered as target populations, in other words,there're no "unlabeled" cells.

rng(9464)
km_l = kmeans(data,10);
new_km_l = km_l; % Since all populiations is known, no need to pre-cluster
% Perform the anlysis
m_km = C2G(data,km_l,km_l,'markernames',markers);
% Visualize the results
m_km.view_gates(data,markers,'n_lines',4,'ignore_small',300);
m_km.show_f_score(km_l);
Population	Gate	    TP	    FP	    FN	F-score
         1	  27	   746	   155	    42	  0.883
         2	  16	   798	   139	   115	  0.863
         3	  32	  1204	   176	    85	  0.902
         4	  24	  1853	   243	   277	  0.877
         5	  31	  1418	   161	    74	  0.923
         6	  18	  1163	   165	   208	  0.862
         7	  21	   714	    27	    38	  0.956
         8	  20	   138	    12	    18	  0.902
         9	  22	   609	    38	    46	  0.935
        10	  23	   192	    15	    26	  0.904
Normalized Mutual Information = 0.780

Apply C2G to large CyTOF dataset with ~140,000 cells

This is a larger CyTOF dataset with 140k cells and 21 protein markers (Spitzer, Matthew H., et al, Science, 2015 ). This data can be download for free from here. (The dataset is hosted on Cytobank and you need to register to download it) This dataset is clustered by k-means where k equal to 10. All of the 10 k-means defined clusters are treated as target populations. This part will take around 10 minutes on a desktop.

[ori_data, marker] = readfcs_v2('testdata/bigdata/TIN_BLD1_Untreated_Day3.fcs');
% Transform the CyTOF
data = flow_arcsinh(ori_data,5);
% Select protein markers
marker_idx = [10,16,17,18,19,21,22,24,29,31,32,33,39,40,41,46,47,49,50,51,52];
d = data(marker_idx,:)';
rng(9464);
label = kmeans(d,10);
% Perform the anlysis
tic;m = C2G(d, label, label,'markernames',marker(marker_idx));toc;
% Visualize the results
w = warning ('off','all');
m.view_gates(d,marker(marker_idx),'ignore_small',3000,'n_lines',4);
warning(w);
m.show_f_score(label);
Population	Gate	    TP	    FP	    FN	F-score
         1	  29	 15390	   534	  1172	  0.947
         2	  47	  6683	   255	   544	  0.944
         3	  26	 10863	 14070	  1338	  0.585
         4	  44	 19313	    85	   822	  0.977
         5	  53	  9833	   301	  1164	  0.931
         6	  26	 13421	 11512	   784	  0.686
         7	   4	 13324	    35	   442	  0.982
         8	  54	 29331	   833	  1526	  0.961
         9	  28	  6993	   460	   398	  0.942
        10	  56	  7804	   218	   726	  0.943
Normalized Mutual Information = 0.859

Reference

Krishnaswamy, Smita, Matthew H. Spitzer, Michael Mingueneau, Sean C. Bendall, Oren Litvin, Erica Stone, Dana Pe’er, and Garry P. Nolan. "Conditional density-based analysis of T cell signaling in single-cell data." Science 346, no. 6213 (2014): 1250689.

Spitzer, Matthew H., Pier Federico Gherardini, Gabriela K. Fragiadakis, Nupur Bhattacharya, Robert T. Yuan, Andrew N. Hotson, Rachel Finck et al. "An interactive reference framework for modeling a dynamic immune system." Science 349, no. 6244 (2015): 1259425.