X-Message-Number: 18712
From: "Peter Christiansen" <>
Date: Wed, 06 Mar 2002 17:04:56 -0600


--------------------------------------------------------------------------------

ExPASy Home page Site Map Search ExPASy Contact us SWISS-PROT

Hosted by NCSC US Mirror sites: Canada China Korea Switzerland Taiwan



--------------------------------------------------------------------------------



The human proteomics initiative


Version of December 3, 2001



In the year 2000, the Swiss Institute of Bioinformatics (SIB) and the 
European Bioinformatics Institute (EBI) announced a major effort to 
annotate, describe and distribute to the life science community a large 
amount of highly curated information concerning human protein sequences. 
This initiative, hereafter known as the Human Proteomics Initiative (HPI), 
is tightly linked to an appeal to the user community to participate actively 
in this effort at various levels.



In 1857, a group of English lexicographers and philologists met and decided 
to undertake a major effort of collecting information concerning the meaning 
and usage of all words in the English language. This major collective effort 
spanned a number of decades and resulted in one of the most impressive 
monuments of knowledge on any given language; the Oxford English Dictionary 
(OED). To create and maintain the dictionary they locally built up a team of 
highly qualified linguistic experts and complemented this classical approach 
by what was at that time an innovative concept. They made an appeal to 
English speakers around the world to send them citations illustrating the 
use of particular words and how they evolved over time. Today we could use 
their original appeal as well as the description of their goal almost 
verbatim, only replacing the  English language  by the  human proteome !



A few months ago, the combined efforts of a number of sequencing centers and 
companies have produced a first draft of the human genome sequence. Such an 
endeavor was only a very preliminary step in the understanding of human 
biological processes. The first pitfall to overcome is the detection of all 
coding regions on the genomic sequence. Current algorithms, while being very 
powerful, are not capable of detecting with certainty all exons, are not 
well equipped to distinguish different splice variants and are unable to 
detect small proteins (which are numerous and crucial to many biological 
processes).



Even when all potential coding regions have been predicted, the user 
community will have at its disposition the sequence of from 20'000 to 35'000 
"naked" proteins (the precise number of human genes is a hotly debated 
subject of contention!). We call these proteins  naked  because genomic 
information does not allow the efficient prediction of all the 
post-translational modifications (PTM) of which the majority of proteins are 
the target. Proteins, once synthesized on the ribosomes, are subject to a 
multitude of modification steps. They are cleaved (thus eliminating signal 
sequences, transit or pro- peptides and initiator methionines); many simple 
chemical groups can be attached to them (example: acetyl, methyl, 
phosphoryl, etc.) as well as some more complex molecules, such as sugars and 
lipids. Finally, they can be internally or externally cross-linked (example: 
disulfide bonds). More than a hundred different types of PTM are currently 
known and many more are yet to be discovered. The complexity due to all 
these modifications is compounded by the high level of diversity that 
alternative splicing can produce at the level of sequence. Thus the number 
of different protein molecules expressed by the human genome is probably 
closer to a million than to the hundred thousand generally considered by 
genome scientists.



Another factor of complexity to take into account is the amount of 
polymorphism at the protein sequence level. While some of these 
polymorphisms are linked to disease states, most are not, yet have in many 
cases a direct or indirect effect on the activities of the proteins.



We therefore initiated a major project to annotate all known human sequences 
according to the quality standards of SWISS-PROT. This means providing, for 
each known protein, a wealth of information that includes the description of 
its function, domain structure, subcellular location, post-translational 
modifications, variants, similarities to other proteins, etc.



There are currently 7'600 annotated human sequences in SWISS-PROT. These 
entries are associated with about 20'000 literature references; 19'500 
experimental or predicted PTMs, 1'800 splice variants and 12'500 
polymorphisms (most of which are linked with disease states).



The HPI project contains a number of sub-components, which are briefly 
described below:



Annotation of all known human proteins. We plan to fully annotate all human 
protein sequences that are not yet in SWISS-PROT. These sequences are either 
in the TrEMBL computer-annotated supplement or do not appear in any sequence 
database - because the coding sequence has not been annotated as such in the 
DNA databases or because the sequence has not been submitted. We will also 
review and complete the annotation of the human sequences currently in 
SWISS-PROT.
Annotation of mammalian orthologs of human proteins. We are making sure that 
for any human proteins, orthologs in other mammalian species (mainly mouse 
or rat) are also annotated at a level equivalent to that of the cognate 
human sequences.
Annotation of all known human polymorphisms at the protein sequence level. 
These are now commonly termed  c-SNPs  (coding single nucleotide 
polymorphisms) or  SAPs  (single amino-acid polymorphisms). As mentioned 
above, SWISS-PROT already holds information on a sizeable amount of such 
polymorphisms, and it significantly expanded its effort to store and 
annotate all  small  variations at the protein level. Mutations that cause 
major changes to a protein sequence (as it is the case for most frameshift 
mutations) are not and will not be considered to be relevant to SWISS-PROT, 
as their deleterious effect on a given protein s function is usually 
obvious!
Annotation of all known post-translational modifications in human proteins. 
A major effort has been made to supplement the already quite comprehensive 
description of known post-translational modifications in human proteins 
currently provided in SWISS-PROT.
Tight links to structural information. SWISS-PROT is tightly linked to the 
PDB/RCSB 3D-structure database and already includes many features useful to 
structural biologists (such as literature references concerning X-ray and 
NMR papers; links to the HSSP database; DSSP-derived secondary structure 
information, etc.). As less than 5% of all human proteins have been 
characterized at the level of their 3D-structure, it is important to expand 
the scope of experimentally-derived structural information by providing 
homology-derived models for all human proteins for which such an approach is 
scientifically relevant.


For all aspects of the HPI projects, we would appreciate the help and 
collaboration of the scientific community. Information concerning the human 
proteome is highly critical to a large section of the life science 
community. We therefore appeal to the user community to fully participate in 
this initiative by providing all the necessary information to help and to 
speed up the comprehensive annotation of the human proteome.



The HPI project is a long-term challenge, it will take years to annotate and 
periodically re-annotate all human proteins in such a way as to obtain a 
full and useful compendium describing the function and more specifically the 
role of these crucial actors which are involved in most, if not all, 
biological processes.



It should also be noted that the goals of the HPI project will not be 
achieved by the SWISS-PROT groups at SIB and EBI without the financial means 
now being provided by the yearly license fees paid by industrial companies 
for access to SWISS-PROT and related databases.



In ancient times, the Chinese are said to have used the sentence "May you 
live in interesting times!" as a form of curse. There is no doubt that the 
life science community is living in interesting times; but we need to make 
sure that this is not a curse but a benediction.





For more information on the HPI project you can consult the following Web 
pages:



http://www.expasy.org/sprot/hpi/

http://www.ebi.ac.uk/swissprot/hpi/



You can also download various non-redundant sets of human protein sequence 
entries from SWISS-PROT and TrEMBL from the following Web page:



http://www.ebi.ac.uk/proteome/HUMAN/



Human protein sequences from SWISS-PROT are integrated in the International 
Protein Index (IPI) available at:



http://www.ebi.ac.uk/IPI/IPIhelp.html



A short description of HPI has been published in:



O'Donovan C., Apweiler R., Bairoch A.

The human proteomics initiative (HPI).

Trends Biotechnol. 19:178-181(2001).



If you would like to participate in the HPI project, please send us email 
at:







--------------------------------------------------------------------------------

ExPASy Home page Site Map Search ExPASy Contact us SWISS-PROT

Hosted by NCSC US Mirror sites: Canada China Korea Switzerland Taiwan



--------------------------------------------------------------------------------


_________________________________________________________________
Join the world s largest e-mail service with MSN Hotmail. 
http://www.hotmail.com

Rate This Message: http://www.cryonet.org/cgi-bin/rate.cgi?msg=18712