Universities offer ‘Google 101’
Every minute of the day, huge numbers of computers across the planet process huge gobs of information to pinpoint credit card fraud, locate possible terrorists or sort through immense piles of scientific data.
All those computing efforts now are part of what’s called cloud or cluster computing. In short, many tasks occurring in science, business and the computer world rely on thousands and thousands of parallel computers each handling small parts of the vast computing task at hand.
A large share of that cloud or parallel computing takes place inside huge data centers dotting the globe; inside those centers, thousands and thousands of processors churn away, digesting new information and shipping back results.
As that change is taking place, two technology firms, IBM Corp. and Google Inc., say they’ll spend millions of dollars to teach advanced cluster-computing programming to the next generation of technology managers. That effort will start at the University of Washington and five other U.S. universities.
The trial run of what some have called “Google 101” began last year, when professors at the University of Washington offered two pilot classes in how to develop programs for clustered computers.
Ed Lazowska, a professor in UW’s computer science department, said the new courses reflect a recognition that schools must move from a curriculum focusing on how to program for isolated computers to learning how to work across thousands of computers in a cluster.
“More and more fields are becoming data-rich, and extracting knowledge from massive amounts of data is exactly what this is about,” Lazowska said.
In addition to UW, Carnegie Mellon University, Stanford University, the University of Maryland, the University of California at Berkeley and the Massachusetts Institute of Technology will offer the classes.
It’s no surprise Google is pushing for changes in computer-science education. Its prime business of finding search results relies on millions of computers spread across the world. In order to deliver relevant and quick search results, industry experts say Google is adding roughly 500,000 processors to its grid each year.
The genesis of UW’s test courses came in 2006 when Christophe Bisciglia, a senior software engineer at Google who received a computer science degree from UW in 2003, was interviewing possible Google employees in Seattle.
They were capable programmers when dealing with problems on a single computer, he discovered. When Bisciglia, a Gig Harbor native, asked them what they would do if confronted with 1,000 times as much data, they felt stumped.
Beyond search, other real-life uses of cluster computing are credit card processing, e-mail spam tracking and large-scale science research, said Lazowska. Those practices, plus the growing field of projecting consumer behavior, all rely on data mining: the computerized sifting through large databases to project what will happen.
“The large number of computers that handle those millions and millions of Mastercard, Visa or American Express transactions are also mining those vents to try to detect fraud. That ability is of enormous value to the companies and to everyday consumers,” he said. They do that by identifying purchase patterns that vary from the cardholder’s usual practice.
Then tech companies are also turning to clustered computers to sort through the millions of daily e-mail messages flowing into Yahoo, Hotmail or Google mail. Clustered and coordinated data centers are programmed to scour the huge volumes of data to identify the spam and keep it out of users’ inboxes, Lazowska said. The key technique is to use past spamming patterns to spot actual spam and distinguish it from normal e-mail.
Advanced cluster programming will also be vital as scientists tackle massive research projects, said Lazowska.
An example will be the Neptune Project, a new National Science Foundation-funded effort to dive into and understand the essential chemistry, biology and physics taking place in the oceans. The initial test-run of the Neptune Project is scheduled over the next several years, said Lazowska.
It will use thousands of sensors linked through fiber optic cables in the ocean, tentatively planned to be deployed off the Washington coast near what geologists call the Juan de Fuca Plate.
All that data will be gobbled up and processed in a vast network of clustered computers. The goal is to give scientists full and real-time access to all that information for their research.
“Now we know almost nothing about the biological and chemical processes of the ocean,” he said. “This use of data-rich collections of information will revise oceanography.”