Prepared by the Working Group on Biomedical Computing
Advisory Committee to the Director
National Institutes of Health
June 3, 1999
National Programs of Excellence in Biomedical Computing will answer the question of who will do computation tomorrow by educating students at all levels, with an emphasis on bachelor's and master's students to fill today's critical need for people with cross-disciplinary knowledge. Programs may be housed at a university or they may be freestanding and link to several universities; they will provide some new faculty positions and integrate and coordinate existing resources. They will offer short courses to biomedical researchers, and encourage visiting scientists.
Computation is becoming an enabling technology for biomedicine; some of the most exciting, challenging, and hardest problems posed to computing and computational scientists are emerging from the biomedical field.
Examples of the scope of the problems (and the cognate opportunities) abound:
Advanced medical-imaging systems give surgeons a better view of a patient's anatomy than they can see in front of them on the operating table. With computers that create three-dimensional models of real-time MRI scans, and programs that incorporate that model into a video of the operation in progress, surgeons can more precisely cut and suture, knowing both the extent of a tumor and its relationship to adjacent healthy tissue.
In other work, researchers are exploring the use of computer models to help surgeons decide whether to recommend surgery for stenosis, the narrowing of an artery. MRIs measure the flow of blood around a blockage, but they cannot measure the pressure on artery walls. Working together, surgeons, experimentalists and theoreticians, are building mathematical models of the pressure in the artery based on fluid dynamics.
Other researchers are exploring a computer-based virtual-reality interface with tactical feedback that would allow remote control of micro-surgical tools. Although that work is still in its early stages, it might eventually allow surgeons to perform microscopic surgery with minimal invasion, checking their progress and effectiveness with remote sensing devices, and thus reducing trauma and promoting healing.
A National Program devoted to the application of computing to surgery would concentrate the skills and knowledge of a range of experts on developing the hardware and software tools that are needed to bring computing into the operating room. It would also educate and train the physicians, bioengineers, programmers, and technicians who will develop and apply the new computer-based surgical techniques.
In the not-too-distant future, clinicians will be able to match reconstructed images of a tumor deeply hidden in the body with a genetic characterization of a tumor, correlating the tumor's growth and metastatic involvement (the microcosm of the disease) with the patient's clinical response (the macrocosm of the disease). Imaging technologies might automate tissue-pathology analysis, leading to greater diagnostic accuracy.
Such work requires basic science research to amass the baseline data that allows that kind of exciting application of computationally based clinical medicine. A National Center focused on clinical practice could coordinate that kind of research and its direct application to human health.
It is worth noting that fundamental discovery is the foundation for such advances in medicine, but because of the diversity of diseases as complex as cancer, the ultimate impact of a discovery on the treatment of human disease almost always requires studies in human populations, that is, clinical studies.
Weaknesses in computing support for clinical research — quality assurance, varying capabilities for electronic data capture, connectivity on the Internet, security and privacy of data, and high-speed communication between laboratories, to name a few — pose enormously expensive problems. This Working Group has not attempted to deal directly with those problems, but recommends that when NIH Institutes fund clinical research they be sensitive to the need for computing, connectivity, and high-speed links between laboratories studying the bases of disease.
Neurobiologists working on the brain's ability to process information are limited not by their ideas, but by the tools to create realistic models of brain function. Until recently, neurobiologists have been able to record only the activity of single cells; new technological advances allow them to record from hundreds or even thousands of cells at the same time. With that breakthrough, the focus has turned to creating the techniques that will allow monitoring and analysis of the large numbers of neurons involved in specific behaviors. The data and the computational power are available; neurobiologists need to address the bigger issue of manipulating their data. A neurobiology Program of Excellence could bring together expertise to apply the latest data-management tools to the study of how the brain controls motor movements or how it forms memories.
Geneticists are running analyses of large numbers of subjects against the enormous amounts of data now being released about the human genome, utilizing the data from hundreds of subjects and their family members to map disease genes within a region of 30-40 megabases of DNA — more than 100 megabytes of information on each person. Those analyses can take as long as six months on routine laboratory computers. To gain the advantage of a two-day turnaround on a supercomputer, geneticists must adapt their programs to the more powerful systems. Good research should not be hurried, but delaying progress because software is not available could delay the discovery of new findings, new treatments and new cures.
Much of the information that comes out of clinical trials is statistical in nature. While some statisticians have been involved in helping to interpret those results, with the vast amounts of data now being generated, the issues are becoming more interesting to statisticians as data problems. The statistical community is only now beginning to realize that it may have much to contribute. A National Program directed towards the display and understanding of high-dimensional data from clinical trials would involve statisticians, physicians, and computer scientists in the attempt to deal with specialized data of enormous complexity and size.
Such a National Program would not be strictly computational. From the statistician's perspective, some problems that are labeled computational are really problems of the analysis of complex data. That analysis requires computational support, to be sure, but the challenge is to create appropriate analytical tools, whether algorithmic or programmatic. That is certainly the case with genetic-array data on tumor cells, or pattern-recognition problems in some image reconstruction — the kinds of problems that engage clinicians as well.
RATIONAL DRUG DESIGN AT THE CELLULAR LEVEL
Biological chemists attempting to model entire cells are waiting for the data to catch up to the technology. When the human genome has been fully sequenced, with all the genes identified, biological chemists hope they can test their theories of drug activities on computer models of cells. While researchers know a great deal about drugs that simply inhibit enzymes, they are much less sure about drugs that have subtle effects on cellular function. Researchers might possibly chart the effect of drugs on genes themselves when they can model an entire cell. Microarrays and complex genomic databases might be used to help biological chemists identify drug side effects with minimal human or animal testing. Sophisticated, linked databases of chemical substances, gene and protein structures, and reaction pathways and physiological effects will be required to make that kind of drug design possible. It is part of the idea behind National Programs of Excellence to find ways to coordinate those disparate kinds of data.
Why do some cells die, and others grow uncontrolled? In cells, what is aging, and what is cancer? Cell biologists believe the answer lies in the way proteins assemble in the cell. There, function seems to follow form: The shape of proteins determines what they will do. The secret of protein assemblies seems to be in the ability of adjacent proteins to pass enough information to reach a corporate consensus on action. To correlate the arrangement of the proteins with their functions, researchers need high-resolution images of protein structures, and they need to compare structures across functions. That is not a trivial task. It takes hundreds of thousands, maybe millions of cross-sections of cell structures captured by microscopy (electron, light, MRI microscopy) to create a clear picture of the structure. That work is impossible without computational tools to collect, process and interpret the data to help understand how biological systems works. A National Program might give researchers the computational equivalent of heavy machinery that they need to plow into such data-massive science. By bringing together the machinery; the people who know how to collect, curate, and manipulate that data; and the scientists who are familiar with cell biology, the NIH could move researchers forward in understanding the life cycle of the cell, and the diseases that affect it.
Sequencing the genome, image reconstruction, the simulation of biological systems and other emerging areas have all led to increased opportunity for understanding biology while illuminating the alarming gap between the need for computation in biology and the skills and resources available to meet that need. Much of what needs to be done in this new approach to biology will have to be done by people who are currently either not drawn into biology, have little standing in biology, or whose career opportunities are better in industry or in other scholarly disciplines. The NIH should act to increase the number of people who are trained in both biology and computation, and dignify that expertise within the biomedical research community.
At the same time, the NIH needs to insure that computer power is available. While most biomedical researchers have the desktop systems they need, they do not have up-to-date local clusters, they do not have sufficient access to regional computing centers, and they do not have a viable plan for using national computing centers — particularly those that promise teraflop computers by the next century.
Biomedical computing is on a cusp. Its growth is inevitable, but the timetable is still unknown. A small push by the NIH could result in great changes in a short time. If the NIH does not act, change could take another five, ten, or twenty years.
From the Principal Investigators who understand how to use computers to solve biomedical problems to the people who keep the computers running, there is a shortfall of trained, educated, competent people. The NIH needs a program of workforce development for biomedical computing that encompasses every level, from the technician to the Ph.D. The National Programs of Excellence in Biomedical Computing would provide a structure for developing expertise among biomedical researchers in using computational tools.
Today the disciplines of computer science and biology are often too far apart to help one another. A computer-science student often stops studying other sciences after freshman biology or chemistry; a biology student, even one knowledgeable about computers, may not ever have had formal computer-science classes. Biomedical computing needs a better — and more attractive — meld of those disciplines. Today computer-science students have little incentive to learn about biomedicine. The barrier is not just the rigorous demands of computer science, it is also the relative rewards: The $50,000 to $80,000 a year that professional programmers earn makes the compensation associated with many research positions in biology laughable. This situation is even more risible when one includes the reality that staff positions on NIH research grants are guaranteed for no longer than the grant award.
In the future, many biomedical scientists will have to be well educated in both biology and computer science. One-sided education will not work. The Department of Biological Structure at the University of Washington offers one of the few programs in biomedical computing. The computer-science side incorporates programming, data structures, simple computer architecture, databases, computer networks, basic artificial intelligence, knowledge representation, and qualitative modeling. On the biology side, the program emphasizes basic medical science with courses such as anatomy, histology, cell biology, biochemistry or molecular structure. Other courses provide the quantitative basis for the broad spectrum of biology, from basic mathematics through calculus, differential equations, linear algebra, and statistics.
Such cross-discipline education should be supported by the NIH grant system. Awards should be competitive with those for computer-science and physics education. Establishing such programs will not alone create an academic infrastructure for biomedical computing; research grants are needed to make a fundamental difference in academe. Grants to faculty members are more likely to change the focus of a Ph.D. program than any change in the job market for graduates.
Strong action by the NIH is required because the existing biomedical research and teaching structures of the universities and research institutions of this country inadequately value interdisciplinary efforts generally, and computation in particular. Few grant programs and fewer academic departments foster the kind of interdisciplinary work required to address biomedical challenges fully, let alone educate students about them. National Programs of Excellence would specifically include formal and informal instruction from the undergraduate through post-graduate levels, and incorporate a range of opportunities for scholars and researchers to participate.
Biomedical computing needs software tools to take advantage of the hardware. Often that software is cobbled together by graduate students with little programming knowledge, for use by those whose expectations are bound by the immediate problem. The application may be used once, then abandoned when the problem is solved, the graduate student moves on, or the technology changes. The publication goes out, but the tools remain in the laboratory.
That system worked for years only because computing had not yet become an important tool for biologists. Now that biomedical research is more dependent on computers, the discipline cannot afford to waste the effort to produce one-off software that is used once and discarded. Software can be shared if it is correctly conceived, properly developed, and effectively promulgated. Such a process offers two benefits: Needed software will be made available, and time spent reinventing the same processes in one laboratory after another will be freed for basic research.
One important element in the system is the creation of software-development groups: software and computer engineers who can take laboratory-based software and "harden" it-standardizing it for more general use, testing it under various conditions, documenting it, supporting it, and upgrading it as technology changes. Currently the NIH generally does not support such efforts; grants from the NIH are typically available only to develop a working model, a prototype. Completing that software and distributing it is not possible under today's funding programs. It is a generally accepted rule in the software business that producing a working prototype is only 20% of the cost of making a commercial product. NIH funding mechanisms finance only that first 20%. Where software has shown itself to be valuable to a range of researchers in biomedical computing, the NIH needs to find ways to support its full development. That might be done through public-private agreements between research centers and industry, or through direct NIH funding.
The need for numerical computation continues to challenge the most advanced computers, so the design and application of new algorithms continue to be of major importance. Good algorithms make computers more effective. Algorithms are the mathematical expression of information in a specialized environment. They are the bridge between data and understanding.
Discovering algorithms that advance scientific knowledge requires a thorough grounding in computer science and mathematics, as well as a keen understanding of the particular problem domain. In biology, algorithm development is now done only by the most knowledgeable computational biologists, a small fraction of the Ph.D.s in the field. Yet algorithms encapsulate the hypotheses that drive science, and their development should be an integral part of biomedical-computing research. More expertise is clearly needed as biological data increase and more computational power becomes available. To put complicated biological applications on tomorrow's teraflop machines will require teams of people working for several years. Without new algorithms and software, the power of such computers will be wasted, and mid-level machines will flounder in a sea of data. Algorithm development, the process by which researchers harness computing power, is as necessary in biomedical computing as computer power. The NIH should put resources into algorithm research if it is to advance biomedical research.
However, those with a bent for mathematics and computer science and the tenacity to seek a Ph.D. now see little reward in biomedical computing. There are few academic positions in that field; research grants tend to support the biological and not the computational aspects of their work; and their salaries are based on standards in biology, not computer science. A Ph.D. in computer science or mathematics carries more prestige, offers more job options, and guarantees more money than a Ph.D. in biology. If the NIH does not act to make biomedical research more attractive to those who are knowledgeable in computational areas, as biology increasingly becomes an information science, there will not be enough people who can create algorithms for biomedical research.
Biomedical computing is entering an age where creative exploration of huge amounts of data will lay the foundation of hypotheses. Much work must still be done to collect data and to create the tools to analyze it. Bioinformatics, which provides the tools to extract and combine knowledge from isolated data, gives us new ways to think about the vast amounts of information now available. It is changing the way biologists do science. Analyzing biological, physical, and chemical data is new — mathematical biology has done that for more than a century — but because the advent of extensive databases and the tools to manipulate them gives researchers the ability to tease knowledge about living systems from complex biological data using modern computational tools. In large part because of the tools of bioinformatics, biology is becoming a data-driven science.
Researchers use bioinformatics tools to create models that help them understand data in large problem spaces — from whole systems to whole organisms. That new understanding of the data helps them form hypotheses about biological systems. Scientists whose research once encompassed a single gene or a single protein are using bioinformatics to study integrated functions among tens of thousands of genes. In a now-classic example of the changes wrought by bioinformatics, a team of scientists discovered previously unknown sets of interrelationships when they did a standard fibroblast experiment on thousands of genes instead of the handful of genes that had been studied previously. They found a system far more complex than anyone had imagined. As biomedical researchers develop ways of dealing with large data sets, they can make leaps in understand those more-complex systems.
The Human Genome Project will require tools that can handle information on three billion base pairs — DNA units. The HGP, when it is completed early in the next century, will give biology the equivalent of a periodic table of the elements for human systems. Tomorrow's researchers will be Keplers to the Tycho de Brahes who are today sequencing the human genome. But with three billion base pairs and 100,000 genes in the human genome that could be involved in disease, biomedicine needs better techniques to store and identify genes and gene groups, and better methods to analyze them.
The study of the techniques and methods that allow researchers to collect and process data toward the understanding of the life and death of organisms is the essence of bioinformatics. It incorporates database creation, curation, and access. Some of the specific problems bioinformatics researchers are facing include:
The database issue is in part a computational issue. To store and manipulate databases that have answers to biomedical questions hidden in thousands or hundreds of thousands of data points requires a level of sophisticated manipulation that grows more difficult as the volume of data grows. Moreover, the information needs to be presented in a format that humans can use: Reducing ten million data points to ten thousand still presents more information than a human mind can encompass. Writing the software that will turn those data points into models is a conceptual challenge.
Database issues are also systems issues. Biomedical researchers increasingly need databases of images and software as well as databases of numeric data. Those databases need to be housed on computers powerful enough to manipulate all the data quickly for many researchers at the same time.
Finally, there are research and policy issues. When are specialized databases appropriate, and how is that decided? How long should they be maintained, and by whom? What standards should apply? How should they be interconnected?
The Information Storage, Curation, Analysis, and Retrieval program this Working Group has proposed would give the NIH a way to support and advance databases and database development directly, either through grants or by establishing National Programs of Excellence focused on the special problems of data and its use. It would allow the NIH to reward proposals for research aimed at gathering and testing data, not just for research intended to test hypotheses.
To deal with increasing amounts of biomedical data, the research community needs access to scalable computing systems. The need for computation is growing in bioinformatics analysis as well as in molecular dynamics and bioengineering simulations. The need is growing exponentially as the data from imaging and sequencing balloon and the use of computational simulations snowballs. Computational facilities are vital as biologists tackle more and more complex problems.
Researchers who five years ago spent little time on computers report that they now spend 90% of their research time in front of their monitors. Much of that change is because of the development of important biomedical databases such as those at the National Center for Biotechnology Information. Investigators have come to depend on those databases in their work. A study late last year showed that usage is increasing at 10% to 15% a month. In 1991 there were 195 searches a day. By 1994 that had increased to 5,000 a day. Last year there were 600,000 a day. At that rate, the NCBI databases will be used more than 25 million times a day by 2002. During the same period, the amount of determined DNA sequence had increased from 71 to 217 to 2,008 million base pairs. Sequencing the human genome (three billion base pairs) is expected to be completed sometime shortly after the turn of the century.
Those large databases require that researchers have available both the hardware and the software to manipulate them, either remotely or — when the application is unique — on their desktops. They also need to handle large datasets such as those used for imaging or simulations. A 3-D image that has a resolution of 1024 by 1024 by 1024 pixels contains at least a gigabyte of data. At least eight gigabytes of data are required for an image that is 2048 by 2048 by 2048, and clinical researchers and clinicians are demanding resolution beyond what the technology can offer today.
Biologists report problems finding funds for infrastructure support to maintain the computational resources in their laboratories: network routers, file servers, printers, and other facilities that are shared among many grantees. A great need is for people with the expertise to manage those systems and tailor them for biomedical uses. Those problems are exacerbated by the rapidly growing demand for local computer clusters where researchers can quickly turn around computational problems.
Some researchers have had to find novel ways to get the computational resources they need. One team used a major corporation's computers at night and on weekends to do its protein-folding analyses. In all, they used three times the computational resources that had been awarded for all their research projects for a year. Because the computing resources were made available, they were able to try new computational experiments, with good results. Unfortunately, such public-private partnerships are hard to put together, and so most research teams make do with inadequate equipment and power.
The unrelenting pressure on computational technology is evident in the increase in the usage of the nation's high-performance computing centers. At the National Science Foundation's supercomputer centers, for instance, out of the 50 largest projects in fiscal 1998, biomolecular computation consumed more resources than any other scientific discipline. That year the supercomputing cycles doubled, yet two-thirds of the requested cycles were turned down because of lack of sufficient resources. According to the NSF, 12% of all investigators who use their supercomputer centers are in biology, and they account for 25% of all cycles — an increase of 54% from fiscal 1997 to fiscal 1998. The biologists who used the NSF supercomputers used large amounts of time, not just a few hours, suggesting that for less-intensive applications researchers were able to find mid-size facilities to meet their needs. The pool of researchers changed, too: An analysis of the projects shows a 40% turnover in users. Together those facts suggest that supercomputers are broadly needed — and used — across biological disciplines.
For most supercomputer users, the access to computing cycles is only one of the benefits provided by a supercomputer center. The strength of the National Science Foundation's supercomputer centers is as much in their support staff as in their hardware, and in the collegial interactions among supercomputer users. The opportunity to discuss problems and solutions is an important part of the centers' gestalt. Most biomedical-computing researchers who use supercomputers have no colleagues doing similar work at their own institutions; today it is only at the supercomputer centers that they find colleagues — many of them in fields like physics, chemistry, and mathematics — with whom they can discuss their approaches. (National Programs, as they are developed, will also offer opportunities for biomedical researchers to work alongside colleagues in computer-rich environments, building new communities around common interests.)
The current levels of computing bring a variety of computational-biology problems within reach. However, to systematically study those systems — to really explore phase space, to understand not only how it works, but how those systems can be manipulated — requires computation at adequate resolution for sufficiently long periods of time, and also requires large numbers of related computations. For the biomedical promise of computation to be realized, tera-scale computing must become routine.
As more powerful computing becomes routinely available, more researchers will use it because the increased computing power will open up opportunities that did not previously exist, and biomedical researchers will move to exploit those opportunities. For that reason, any attempt to predict future needs based on current usage will result in a substantial underestimate.
Because of the importance of this initiative across the NIH, and because of the basic emphasis on scientific research as a means to train scientists across disciplines and provide the tools for their work in the 21st Century, funding for the four parts of the Biomedical Information Science and Technology Initiative might be shared among the Institutes. National Centers of Excellence in Biomedical Computing, in particular, are good candidates for shared funding. Their basic educational purpose should encourage institutions to provide support for National Programs associated with their campuses.
To help the reviewers and staff who will be awarding grants under this initiative, this Working Group suggests the following review criteria for National Programs of Excellence in Biomedical Computing:
The National Programs of Excellence in Biomedical Computing and the teams they bring together are important because biomedical computing needs cross-disciplinary expertise. The result of those Programs will be individuals with broad knowledge that can be applied to biomedical issues — knowledge that incorporates the strengths of biology, computer science, and mathematics. In the short term, biomedicine will benefit from the team approach. In the long term, there will be individual biomedical researchers who can apply much of the expertise that biomedical computing needs. The Biomedical Information Science and Technology Initiative (BISTI), and particularly its National Programs of Excellence in Biomedical Computing, is a bootstrapping approach to that next level.
The Initiative will presage smaller changes, as well. NIH study sections may come to expect that a fair proportion of biomedical research will need computational resources, and may even suggest that researchers include provision of those resources in their grant applications. In academe, there inevitably will be some restructuring of academic departments of biology and biomedicine, and tenure and promotion decisions at universities may depend as much on computational achievements as on traditional biomedical research. Both changes will improve biomedical research. Biomedical computing offers promise of profound advances in understanding and improving human health. Its advent is assured: Biomedical researchers are increasingly using computers to collect, store, access, and explore new data about the human condition, and that ripple of change will soon be a tidal wave. However, although it is inevitable, the promulgation of this critical enabling technology could face delays of five to ten years without action by the NIH. These recommendations are intended to shape the socio-technical aspects of biomedical computing to realize more quickly the anticipated benefits.
David Botstein, Ph.D.
Professor and Chair
Department of Genetics
Stanford University Medical School
Stanford, CA 94305
Larry Smarr, Ph.D.|
Director, National Center for
University of Illinois
Champaign, IL 61820
Biomedical Instrumentation, Imaging
David A. Agard, Ph.D.
Department of Biochemistry and Biophysics
University of California at San Francisco
San Francisco, CA 94143
Molecular Modeling and Simulation|
Michael Levitt, Ph.D.
Department of Structural Biology
Stanford University Medical Center
Stanford, CA 94305
David Harrington, Ph.D.
Department of Biostatistics
Dana-Farber Cancer Institute
Boston, MA 02115
David J. Lipman, M.D.
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
Bethesda, MD 20894
Gwen Ann Jacobs, Ph.D.
Center for Computational Biology
Montana State University at Bozeman
Bozeman, MT 59715
Charles S. Peskin, Ph.D.
Department of Mathematics
Courant Institute of Mathematical Sciences
New York University
New York, NY 10012
Surgical Decision Support
Christopher R. Johnson, Ph.D.
Center for Scientific Computing and Imaging
University of Utah
Salt Lake City, UT 84112
George Rose, Ph.D.
Department of Biophysics and
Johns Hopkins University School of Medicine
Baltimore, MD 21205-2196
Arthur Levinson, Ph.D.
President and Chief Executive Officer
So. San Francisco, CA 94080-4990
Gerald M. Rubin, Ph.D.
Department of Molecular and Cell Biology
University of California at Berkeley
Berkeley, CA 94720-3200
Hamilton O. Smith, M.D
The Institute for Genomic Research
Rockville, MD 20850
M. Anne Spence, Ph.D.
Department of Pediatrics
University of California Irvine Medical Center
Orange, CA 92868-3298
Information-Based Biology [Bioinformatics]
Shankar Subramaniam, Ph.D.
Departments of Biochemistry, Biophysics and
University of Illinois at Urbana-Champaign
Senior Research Scientist
National Center for Supercomputing Applications
Urbana, IL 61801
Liaison Members of the Working Group:
Robert R. Borchers, Ph.D.
Division of Advanced Computational
Infrastructure and Research
National Science Foundation
Arlington, VA 22230
|Mary E. Clutter, Ph.D.
Assistant Director for Biological Sciences
National Science Foundation
Arlington, VA 22230
Alan S. Graeff
Chief Information Officer, NIH
Bethesda, MD 20892
|Michael L. Knotek, Ph.D.|
Program Advisor for Science and Technology
Office of the Secretary of Energy
U.S. Department of Energy
Washington, DC 20585
Margaret L. Simmons, Ph.D.
Associate Director, Program Development
National Partnership for Advanced
San Diego Supercomputer Center
University of California at San Diego
La Jolla, CA 92093
National Center for Supercomputing Applications
University of Illinois
Champaign, IL 61820