生医计算·Biopython+KEGG – 夏小俊课题组的主页

前言：本文将对Biopython和KEGG API做简要的介绍，并提供一些Biopython调用KEGG API的生信实战DEMO。

What is Biopython?

顾名思义，Biopython旨在通过创建高质量，可重用的模块和类，使其尽可能容易地将Python用于生物信息学。其内容包括各种生物信息学文件格式（BLAST，Clustalw，ASTA，Genbank 等）的解析器，对在线服务（NCBI，Expasy 等）的访问，常见和不常见程序的接口（Clustalw， DSSP，MSMS 等），标准序列类，各种聚类模块，KD树数据结构，以及文档。 Biopython包的具体参数和安装方法可参考其官网和文档。

Biopython支持大部分生信重要数据库的主要操作，其中自然包括KEGG。KEGG旨在帮助研究人员从分子水平的信息（尤其是通过基因组测序和其他高通量生成的大规模分子数据集）中了解生物系统（例如细胞，生物体和生态系统）的高级功能和实用性实验技术，以其收录的大量生物信号通路（Pathways）闻名。目前，Biopython已实现KEGG化合物、酶和图谱的解析器和编写器，使用通用解析器来处理其他格式，并提供了调用API的接口。

KEGG API简介

KEGG API是KEGG数据库资源的REST型的应用程序接口，基本操作由URL实现。

URL基本形式

http://rest.kegg.jp/<operation>/<argument>[/<argument2[/<argument3> ...]]
<operation> = info | list | find | get | conv | link | ddi

API的相关各数据库与其标识符可参考其官方文档。以下介绍一些主要操作：

查询：find

http://rest.kegg.jp/find/<database>/<query>
<database> = pathway | brite | module | ko | genome | genes | <org> | vg | ag |
             ligand | compound | glycan | reaction | rclass | enzyme | network |
             variant | disease | drug | dgroup | environ | <medicus>

http://rest.kegg.jp/find/<database>/<query>/<option>
<database> = compound | drug
<option> = formula | exact_mass | mol_weight | nop

样例：http://rest.kegg.jp/find/compound/N8-Acetylspermidine

关联：link

http://rest.kegg.jp/link/<target_db>/<source_db>
<target_db> = <database>
<source_db> = <database>
<database> = pathway | brite | module | ko | genome | <org> | vg | ag | compound |
             glycan | reaction | rclass | enzyme | network | variant | disease |
             drug | dgroup | environ | atc | jtc | ndc | yj | pubmed

http://rest.kegg.jp/link/<target_db>/<dbentries>
<dbentries> = KEGG database entries of the following <database>
<database> = pathway | brite | module | ko | genome | <org> | vg | ag | compound |
             glycan | reaction | rclass | enzyme | network | variant | disease |
             drug | dgroup | environ | genes | atc | jtc | ndc | yj | pubmed

样例：http://rest.kegg.jp/link/ec/C01029

详细信息：get

http://rest.kegg.jp/get/<dbentries>[/<option>]
<dbentries> = KEGG database entries of the following <database>
<database> = pathway | brite | module | ko | genome | <org> | vg | ag | compound |
             glycan | reaction | rclass | enzyme | network | variant | disease |
             drug | dgroup | environ | disease_ja | drug_ja | dgroup_ja | environ_ja |
             compound_ja
<option> = aaseq | ntseq | mol | kcf | image | conf | kgml | json

样例： http://rest.kegg.jp/get/ec:1.5.3.15

Biopython调用KEGG API

导入KEGG API包

from Bio.KEGG import REST

find

REST.kegg_find("compound", cpd_name)

link

REST.kegg_link("ec",cpd_id)

get

REST.kegg_get(ec_id)

实战DEMO：从222种代谢物质中筛选symbol明确、与小鼠基因表达相关的化合物，并输出其正负效应。

from Bio.KEGG import REST
fo = open("CPD_ID.txt","w")
cpd=["Phosphorylcholine",...]
print(len(cpd))
for i in range(222):
    tmp=REST.kegg_find("compound", cpd[i]).read()
    list=tmp.split("\n")
    if list[0]!='':
        fo.write(list[0]+"\n")
    else:
        fo.write("\n")       
fo.close()

from Bio.KEGG import REST
fo = open("KEGG_ID.txt","w")
cpd=["C00588",...]
for i in range(47):
    ec_list=REST.kegg_link("ec",cpd[i]).read().split("\n")
    fo=open(cpd[i]+'.txt','w')
    if ec_list[0]!='':
        for j in range(len(ec_list)-1):
            try:
                ec=ec_list[j].split("\t")[1]
                tmp=REST.kegg_get(ec).read()
                r=tmp.index(cpd[i])
                if tmp.find('PRODUCT',0,r)!=-1:flag=1
                else:flag=-1
                l=tmp.index('MMU:')
                r=l;
                while tmp[r]!='\n':r=r+1
                gene=tmp[l:r]
                fo.write(gene+'\t'+str(flag)+'\n')
            except:continue
    fo.close()

输出结果样例：C00588.txt

MMU: 237928(Phospho1)	-1
MMU: 20597(Smpd1) 20598(Smpd2) 238011(Enpp7) 58994(Smpd3) 77626(Smpd4)	1
MMU: 66358(Adprm)	1
MMU: 12651(Chkb) 12660(Chka)	1
MMU: 13026(Pcyt1a) 236899(Pcyt1b)	-1