X-Message-Number: 18712 From: "Peter Christiansen" <> Date: Wed, 06 Mar 2002 17:04:56 -0600 -------------------------------------------------------------------------------- ExPASy Home page Site Map Search ExPASy Contact us SWISS-PROT Hosted by NCSC US Mirror sites: Canada China Korea Switzerland Taiwan -------------------------------------------------------------------------------- The human proteomics initiative Version of December 3, 2001 In the year 2000, the Swiss Institute of Bioinformatics (SIB) and the European Bioinformatics Institute (EBI) announced a major effort to annotate, describe and distribute to the life science community a large amount of highly curated information concerning human protein sequences. This initiative, hereafter known as the Human Proteomics Initiative (HPI), is tightly linked to an appeal to the user community to participate actively in this effort at various levels. In 1857, a group of English lexicographers and philologists met and decided to undertake a major effort of collecting information concerning the meaning and usage of all words in the English language. This major collective effort spanned a number of decades and resulted in one of the most impressive monuments of knowledge on any given language; the Oxford English Dictionary (OED). To create and maintain the dictionary they locally built up a team of highly qualified linguistic experts and complemented this classical approach by what was at that time an innovative concept. They made an appeal to English speakers around the world to send them citations illustrating the use of particular words and how they evolved over time. Today we could use their original appeal as well as the description of their goal almost verbatim, only replacing the English language by the human proteome ! A few months ago, the combined efforts of a number of sequencing centers and companies have produced a first draft of the human genome sequence. Such an endeavor was only a very preliminary step in the understanding of human biological processes. The first pitfall to overcome is the detection of all coding regions on the genomic sequence. Current algorithms, while being very powerful, are not capable of detecting with certainty all exons, are not well equipped to distinguish different splice variants and are unable to detect small proteins (which are numerous and crucial to many biological processes). Even when all potential coding regions have been predicted, the user community will have at its disposition the sequence of from 20'000 to 35'000 "naked" proteins (the precise number of human genes is a hotly debated subject of contention!). We call these proteins naked because genomic information does not allow the efficient prediction of all the post-translational modifications (PTM) of which the majority of proteins are the target. Proteins, once synthesized on the ribosomes, are subject to a multitude of modification steps. They are cleaved (thus eliminating signal sequences, transit or pro- peptides and initiator methionines); many simple chemical groups can be attached to them (example: acetyl, methyl, phosphoryl, etc.) as well as some more complex molecules, such as sugars and lipids. Finally, they can be internally or externally cross-linked (example: disulfide bonds). More than a hundred different types of PTM are currently known and many more are yet to be discovered. The complexity due to all these modifications is compounded by the high level of diversity that alternative splicing can produce at the level of sequence. Thus the number of different protein molecules expressed by the human genome is probably closer to a million than to the hundred thousand generally considered by genome scientists. Another factor of complexity to take into account is the amount of polymorphism at the protein sequence level. While some of these polymorphisms are linked to disease states, most are not, yet have in many cases a direct or indirect effect on the activities of the proteins. We therefore initiated a major project to annotate all known human sequences according to the quality standards of SWISS-PROT. This means providing, for each known protein, a wealth of information that includes the description of its function, domain structure, subcellular location, post-translational modifications, variants, similarities to other proteins, etc. There are currently 7'600 annotated human sequences in SWISS-PROT. These entries are associated with about 20'000 literature references; 19'500 experimental or predicted PTMs, 1'800 splice variants and 12'500 polymorphisms (most of which are linked with disease states). The HPI project contains a number of sub-components, which are briefly described below: Annotation of all known human proteins. We plan to fully annotate all human protein sequences that are not yet in SWISS-PROT. These sequences are either in the TrEMBL computer-annotated supplement or do not appear in any sequence database - because the coding sequence has not been annotated as such in the DNA databases or because the sequence has not been submitted. We will also review and complete the annotation of the human sequences currently in SWISS-PROT. Annotation of mammalian orthologs of human proteins. We are making sure that for any human proteins, orthologs in other mammalian species (mainly mouse or rat) are also annotated at a level equivalent to that of the cognate human sequences. Annotation of all known human polymorphisms at the protein sequence level. These are now commonly termed c-SNPs (coding single nucleotide polymorphisms) or SAPs (single amino-acid polymorphisms). As mentioned above, SWISS-PROT already holds information on a sizeable amount of such polymorphisms, and it significantly expanded its effort to store and annotate all small variations at the protein level. Mutations that cause major changes to a protein sequence (as it is the case for most frameshift mutations) are not and will not be considered to be relevant to SWISS-PROT, as their deleterious effect on a given protein s function is usually obvious! Annotation of all known post-translational modifications in human proteins. A major effort has been made to supplement the already quite comprehensive description of known post-translational modifications in human proteins currently provided in SWISS-PROT. Tight links to structural information. SWISS-PROT is tightly linked to the PDB/RCSB 3D-structure database and already includes many features useful to structural biologists (such as literature references concerning X-ray and NMR papers; links to the HSSP database; DSSP-derived secondary structure information, etc.). As less than 5% of all human proteins have been characterized at the level of their 3D-structure, it is important to expand the scope of experimentally-derived structural information by providing homology-derived models for all human proteins for which such an approach is scientifically relevant. For all aspects of the HPI projects, we would appreciate the help and collaboration of the scientific community. Information concerning the human proteome is highly critical to a large section of the life science community. We therefore appeal to the user community to fully participate in this initiative by providing all the necessary information to help and to speed up the comprehensive annotation of the human proteome. The HPI project is a long-term challenge, it will take years to annotate and periodically re-annotate all human proteins in such a way as to obtain a full and useful compendium describing the function and more specifically the role of these crucial actors which are involved in most, if not all, biological processes. It should also be noted that the goals of the HPI project will not be achieved by the SWISS-PROT groups at SIB and EBI without the financial means now being provided by the yearly license fees paid by industrial companies for access to SWISS-PROT and related databases. In ancient times, the Chinese are said to have used the sentence "May you live in interesting times!" as a form of curse. There is no doubt that the life science community is living in interesting times; but we need to make sure that this is not a curse but a benediction. For more information on the HPI project you can consult the following Web pages: http://www.expasy.org/sprot/hpi/ http://www.ebi.ac.uk/swissprot/hpi/ You can also download various non-redundant sets of human protein sequence entries from SWISS-PROT and TrEMBL from the following Web page: http://www.ebi.ac.uk/proteome/HUMAN/ Human protein sequences from SWISS-PROT are integrated in the International Protein Index (IPI) available at: http://www.ebi.ac.uk/IPI/IPIhelp.html A short description of HPI has been published in: O'Donovan C., Apweiler R., Bairoch A. The human proteomics initiative (HPI). Trends Biotechnol. 19:178-181(2001). If you would like to participate in the HPI project, please send us email at: -------------------------------------------------------------------------------- ExPASy Home page Site Map Search ExPASy Contact us SWISS-PROT Hosted by NCSC US Mirror sites: Canada China Korea Switzerland Taiwan -------------------------------------------------------------------------------- _________________________________________________________________ Join the world s largest e-mail service with MSN Hotmail. http://www.hotmail.com Rate This Message: http://www.cryonet.org/cgi-bin/rate.cgi?msg=18712