空间转录组分析-02-数据结构

Scanpy的Anndata对象

![anndata](index.zh-cn.assets/webp (1).jpg)

anndata

转录组分析常用的python库Scanpy在读取转录组文件时会创建一个独特的分析对象，即Anndata对象（类似R语言中的Seurat对象），后续的分析都是基于该对象进行的。Anndata是Annotated data的简称，主要用于整合转录组分析中所用到的各种信息，其主要分区（slot）包括：

adata.X：表达矩阵，以稀疏矩阵的形式存储。【scipy sparse matrix】

adata.var：存储了基因名，以及基因的注释信息（即基因的metadata）。【pandas dataframe】

adata.obs：存储了细胞/测序单元的名称以及相应的metadata。【pandas dataframe】

adata.var_names：存储了基因名。【pandas Index】

adata.obs_names：存储了spots名，即每个spots的barcode。【pandas Index】

adata.obsm：存储了数据的多维注释，例如每一个spots的图像二维坐标，主成分空间的50维坐标，umap图坐标等等

adata.obsp：以矩阵的形式记录了i个spot和第j个spot之间的相关信息，如连通性、距离等等

adata.varm：对特征的多维注释，与obsm类似

adata.varp：对特征的配对注释，与obsp类似

adata.uns：以字典的形式存储了非结构化的数据，如绘图所用的颜色集合等

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


adata

'''
AnnData object with n_obs × n_vars = 2939 × 2000
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts', 'clusters', 'exhaustion_score'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'spatial', 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'clusters', 'clusters_colors', 'rank_genes_groups'
    obsm: 'spatial', 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
'''

Pandas的Dataframe

Scanpy库的Anndata对象的obs和var分区都是以Pandas库的Dataframe对象存储数据的，在后续分析中可能会涉及到对数据框对象进行操作。以下介绍了该对象的数据结构，以及对数据框的一些基础操作。

pandas dataframe

读取dataframe

1
2
3
4
5


# 打开保存的*.h5ad文件
adata = sc.read_h5ad('prostate.data.anndata.h5ad')

# 提取adata的某一个属性，例如obs，即为一个数据框
df = adata.obs

该数据框中记录了每一个样本点的metadata信息，并以对应的barcode作为索引。

此外，也可以从csv直接读取数据框：

1
2


# 可以指定index_col,将某一行作为行名读入。否则，将会默认以0,1,2,...作为行名。
df = pd.read_csv('XXX.csv',index_col=0,sep='\t')

数据框的操作

查看数据框

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92


# 查看数据框前五行
df.head(5)

'''
                    in_tissue  array_row  ...  n_counts  clusters
AAACAATCTACTAGCA-1          1          3  ...   12054.0        12
AAACACCAATAACTGC-1          1         59  ...   18697.0         4
AAACAGAGCGACTCCT-1          1         14  ...    9192.0         0
AAACAGCTTTCAGAAG-1          1         43  ...   18037.0         7
AAACAGGGTCTATATT-1          1         47  ...   22535.0        10
'''

# 查看数据框的信息
df.info()

'''
<class 'pandas.core.frame.DataFrame'>
Index: 2939 entries, AAACAATCTACTAGCA-1 to TTGTTTGTGTAAATTC-1
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   in_tissue                    2939 non-null   int64   
 1   array_row                    2939 non-null   int64   
 2   array_col                    2939 non-null   int64   
 3   n_genes_by_counts            2939 non-null   int32   
 4   log1p_n_genes_by_counts      2939 non-null   float64 
 5   total_counts                 2939 non-null   float32 
 6   log1p_total_counts           2939 non-null   float32 
 7   pct_counts_in_top_50_genes   2939 non-null   float64 
 8   pct_counts_in_top_100_genes  2939 non-null   float64 
 9   pct_counts_in_top_200_genes  2939 non-null   float64 
 10  pct_counts_in_top_500_genes  2939 non-null   float64 
 11  total_counts_mt              2939 non-null   float32 
 12  log1p_total_counts_mt        2939 non-null   float32 
 13  pct_counts_mt                2939 non-null   float32 
 14  n_counts                     2939 non-null   float32 
 15  clusters                     2939 non-null   category
dtypes: category(1), float32(6), float64(5), int32(1), int64(3)
memory usage: 355.1+ KB
'''

# 对数据框的每一列做描述性统计
df.describe()

'''
       in_tissue    array_row  ...  pct_counts_mt      n_counts
count     2939.0  2939.000000  ...         2939.0   2939.000000
mean         1.0    29.536577  ...            0.0  17354.753906
std          0.0    16.150319  ...            0.0   6646.829590
min          1.0     0.000000  ...            0.0   5027.000000
25%          1.0    17.000000  ...            0.0  12212.000000
50%          1.0    30.000000  ...            0.0  16512.000000
75%          1.0    42.000000  ...            0.0  21702.000000
max          1.0    66.000000  ...            0.0  34988.000000

[8 rows x 15 columns]
'''

# df的shape属性，可以得到df的行数和列数
df.shape

'''
(2939, 16)
'''

# 查看行名和列名
df.index

'''
Index(['AAACAATCTACTAGCA-1', 'AAACACCAATAACTGC-1', 'AAACAGAGCGACTCCT-1',
       'AAACAGCTTTCAGAAG-1', 'AAACAGGGTCTATATT-1', 'AAACCCGAACGAAATC-1',
       'AAACCGGGTAGGTACC-1', 'AAACCGTTCGTCCAGG-1', 'AAACCTCATGAAGTTG-1',
       'AAACGAAGAACATACC-1',
       ...
       'TTGTGGTAGGAGGGAT-1', 'TTGTGTATGCCACCAA-1', 'TTGTGTTTCCCGAAAG-1',
       'TTGTTAGCAAATTCGA-1', 'TTGTTCAGTGTGCTAC-1', 'TTGTTCTAGATACGCT-1',
       'TTGTTGTGTGTCAAGA-1', 'TTGTTTCATTAGTCTA-1', 'TTGTTTCCATACAACT-1',
       'TTGTTTGTGTAAATTC-1'],
      dtype='object', length=2939)
'''

df.columns

'''
Index(['in_tissue', 'array_row', 'array_col', 'n_genes_by_counts',
       'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts',
       'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes',
       'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes',
       'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts',
       'clusters'],
      dtype='object')
'''

修改行名和列名

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


# 设置行名为某一列。在方括号内传入一个行名即可。inplace=True会对原本的数据框进行修改，而设置为False则不会改变原始的数据框。
# index可以有重复
df.set_index(['in_tissue'],inplace=True)

# 设置某一行的行名为自定义值
rn = df.index
rn_list = rn.tolist()
rn_list[3] = 'Hello World!'
df.index = rn_list

# 也可以用df.rename(),传入键值对即可完成改名
df.rename(index={'AAACAGAGCGACTCCT-1':'Hello World!'}, inplace=True)

# 修改列名。此外修改行名的其他方法也适用于修改列名
df.rename(columns={'in_tissue':'In Tissue'}, inplace=True)

提取行和列

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


# 提取某一列。两种方式都可以
df['clusters']
df.clusters

# 提取多个列
df[['clusters','in_tissue']]

# 使用函数提取行或列，loc即按名称提取，iloc即按排序提取。
# 顺序为先行名，后列名。
df.loc['AAACGAGACGGTTGAT-1']
df.loc['AAACGAGACGGTTGAT-1',] # 效果同上
df.loc['AAACGAGACGGTTGAT-1',:] # 效果同上

df.loc[,'in_tissue'] # 提取'in_tissue'列

# 使用iloc提取
df.iloc[0] #提取第一行
df.iloc[1:3,1:3] # 提取第一至三行，以及一至三列。包含index和行名。

# 按条件提取行或列
# 提取'in_tissue'为1的所有行
df[df['in_tissue']==1]

# 使用str.startswith方法进行过滤。在过滤基因名称的时候非常有用。
df[df['XXX'].str.startwith('XXX')]

增加/删除行或列

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


# 增加一行
df.loc['new_col']=['1','2',...]

# 增加一列
df['new_col'] = ['1','2',...]

# 删除行,axis=1即为删除列，=0为删除行
df.drop('AAACGAGACGGTTGAT-1',axis = 0)

# 删除列
df.drop('in_tissue',axis = 1)

# 传入list即可删除多行/列