Computational Biology in the 21st Century: Making Sense out of Massive Data

Computational Biology in the 21st Century: Making Sense out of Massive Data

Articles, Blog , , , 1 Comment


>>GOOD AFTERNOON, EVERYONE. IT’S MY PLEASURE TO WELCOME YOU TO A SPECIAL WEDNESDAY AFTERNOON LECTURE BECAUSE THIS IS THE MARGARET PITTMAN LECTURE, A SPECIAL LECTURE GIVEN ONCE A YEAR TO HONOR MARGARET PITTMAN. SHE WAS THE FIRST WOMAN LABORATORY CHIEF AT THE NATIONAL INSTITUTES OF HEALTH APPOINTED TO THAT POSITION IN 1957 AFTER A DISTINGUISHED CAREER HERE AT NIH AND IN OTHER PLACES LIKE ROCKEFELLER, WHERE SHE WAS VERY MUCH A PIONEER IN THE AREA OF INFECTIOUS DISEASE. IN FACT, DR. PITTMAN WAS THE FIRST TO ESTABLISH A CAPSULAR TYPE B OF HOMOFLUOUS INFLUENZA AS ONE OF SIX TYPES OF H INFLUENZA MOST RESPONSIBLE FOR CHILDHOOD MENINGITIS SETTING THE STAGE FOR EFFORTS SOME DONE HERE WHICH HAVE NOW LED TO A REMARKABLE DECLINE IN THE INCIDENCE OF THAT TEAR TERRIBLE DISEASE BECAUSE OF ABLE OF VACCINE, ALL YOU CAN CONNECT THE STORY BACK TO MARGARET PITTMAN’S WORK IN THE 1930s. SHE ALSO WORKED ON SALMONELLA TYPE B, WORKED OUT ANOTHER PARTICULAR BACTERIUM, HOMOFLUOUS EGYTHIUS RESPONSIBLE FOR EPIDEMIC CON JUNETIVEITIS AND OTHER OBSERVATIONS IMPORATIONS
IMPORTANT I
N VACCINES. SO WE EVERY YEAR CHOOSE SOMEONE AND THE CHOICE COMES FROM THE ADVICE OF THE SCIENTIFIC DIRECTORS AT NIH AND ON THE RECOMMENDATION OF THE NIH SCIENTIST ADVISORS AND DELIVER THE PITTMAN LECTURE WHO FOLLOWS IN THAT TRADITION OF BEING A REMARKABLE WOMAN SCIENTIST LEADER. TODAY WE’RE FORTUNATE THAT THAT ROLE IS BEING PLAYED BY PROFESSOR BONNIE BERGER, WHO IS PROFESSOR OF APPLIED MATHEMATICS AND COMPUTER SCIENCE AT M.I.T. AND ASSOC$Uy BROAD INSTITUTE. BONNIE GOT HER UNDERGRADUATE DEGREE AT BRANDEIS AND GOT A Ph.D. AND EVER SINCE, AND A POST DOCTORAL AT M.I.T. AND SINCE 1992 ON THE FACULTY, RATHER RAPIDLY, ADVANCING FROM ASSISTANT TO ASSOCIATE, TO FULL PROFESSOR WHERE SHE’S NOW ALONG WAIT SHE’S BEEN HONORED BY AN NSF CAREER AWARD, BY THE BIOPHYSICAL SOCIETY’S DAYHOFF AWARD FOR RESEARCH, AND BEING CHOSEN AS A FELLOW OF THE ASSOCIATION FOR COMPUTING MACHINERY IN 2004. HER WORK IS VERY TIMELY FOR US HERE AT NIH, AS WE’RE ALL STRUGGLING WITH THE WONDERFUL PROBLEM OF HAVING TOO MUCH DATA. BIG DATA, AS IT’S FEATURED ON THE COVER OF NATURE MAGAZINE, AS WE TALK AROUND THE TABLE AT DIRECTOR MEETINGS ON THURSDAY MORNING, BIG DATA AS I’M NOW BEING ASKED BY PEOPLE IN THE WHITE HOUSE, WHAT ARE YOU GOING TO DO ABOUT THIS, SINCE EVERYBODY RECOGNIZES THAT WE ARE IN A CIRCUMSTANCE OF NEEDING TO BE VERY THOUGHTFUL, AND CREATIVE, ABOUT HOW WE HANDLE THE VERY LARGE QUANTITIES OF BIOLOGICAL DATA THAT ARE POURING OUT OF MANY GENOMICSS BASED UPON GENERAL AND HOW DECEMBER WORKS. DISEASE WORKS. WE NEED INDIVIDUALS CREATIVE IN PUTTING TOGETHER ALGORITHMS TO ASSIST US IN MINING NUGGETS OUT OF THIS LARGE SEA OF INFORMATION. WE COULD NOT HAVE A BETTER PERSON TO DESCRIBE SOME OF THE APPROACHES THAT ARE CURRENTLY BEING DONE IN THAT REGARD, AND WHO IS A LEADER HERSELF IN THAT EFFORT THAN TODAY’S SPEAKER. SO HER PRESENTATION TODAY IS CALLED COMPUTATIONAL BOILING IN THE 21st CENTURY, MAKING SENSE OF DATA. JOIN ME IN WELCOMING PROFESSOR BONNIE BERGER. [APPLAUSE] >>GOOD AFTERNOON. DR. COLLINS ALMOST TOOK SOME OF MY INTRODUCTION, BUT THAT’S FINE. ANYWAY, THE MISSION OF OUR FIELD IS TO ANSWER BIOLOGICAL AND BIOMEDICAL QUESTIONS BY USING COMPUTATION IN SUPPORT OF OR IN PLACE OF LABORATORY PROCEDURES, WITH ONE GOAL BEING TO GET MORE ACCURATE ANSWERS AT A GREATLY REDUCED COST. WE ARE CURRENTLY GENERATING MASSIVE DATA SETS, SO MASSIVE THAT WITHOUT SMART ALGORITHMS WE WON’T BE ABLE TO ANALYZE THESE TO DISCOVER PATTERNS THAT MIGHT PROVIDE CLUES TO THE UNDERLYING BIOLOGICAL PROCESSES. THROUGHOUT MY TALK, THERE WILL BE A COMMON THEME OF TAKING A MACROSCOPIC VIEW OR PICTURE OF THE DATA THROUGH WHICH WE CAN VIEW PROBLEMS LIKE MEDICAL GENOMICS AND BIOLOGICAL NETWORKS. BUT THERE WAS A CHALLENGE HERE, AS DR. COLLINS SAID, THE SIZE OF THE DATABASES ARE GOING ASTRO NORMICALLY. ASTRONOMICALLY. WE HAVE LOTS OF DATA. THE BAD NEWS IS THE PROBLEMS THREATEN TO BECOME COMPUTATIONALLY INTRACTABLE DUE TO THE SHEER ENORMITY OF THE DATABASES. THINGS WERE BAD ENOUGH WHEN I STARTED AROUND 1995 IN THIS AREA, BACK THEN THE SIZE WAS HALF A MILLION SEQUENCES, THE PDV HAD 3-7 3800 PROTEIN
STRUCTURES S WE50,000 PROTEIN SEQUENCE USED THIS FOR PARALLEL RESIDUE CORRELATION. THINGS HAVE GOTTEN WORSE AT AN INCREDIBLE RATE. RECENTLY THERE’S BEEN AN EXPONENTIAL EXPLOSION IN THE AMOUNT OF SEQUENCING DATA. NOW, IT IS TRUE THAT COMPUTERS HAVE GOTTEN A LOT FASTER, AND ALSO MORE COST EFFECTIVE. AS YOU CAN SEE IN THE GREEN LOG SCALE PLOT HERE, THE AMOUNT OF PROCESSING YOU CAN DO PER DOLLAR OF COMPUTE HARDWARE HAS BEEN MORE OR LESS DOUBLING EVERY YEAR. KNOWN AS MOORE’S LAW. BACK IN THE 1990s, THIS WAS ENOUGH TO KEEP UP WITH THE PACE OF SEQUENCING DATA. WHICH IS SHOWN IN BLUE HERE. BUT LOOK WHAT HAPPENED AFTER THE ADVENT OF NEXT GEN SEQUENCING, THE SIZE OF DATABASES HAS BEEN GROWING BY A FACTOR OF TEN EVERY YEAR. NOW, IN THE PASTF, WE WOULD
DEAL WITH SUCH PROBLEMS BY SAYING FUTURE COMPUTERS WILL BE FAST ENOUGH. BUT CLEARLY, THAT’S NOT THE CASE. SO THIS IS A BIG PROBLEM AND A CHALLENGE FOR THE FIELD. SO MUCH SO, THAT THERE’S BEEN A RECENT NEW YORK TIMES ARTICLE, ALSO MANY OTHERS, IDENTIFYING THIS KIND OF A PROBLEM, IN FACT THEY POINT OUT STAGING CGI, THE LASTER GENOME CENTER IN THE WORLD WAS SEQUENCING SO MUCH IT OVERWHELMED THE INTERNET CONNECTION AND IT COSTS MORE TO ANALYZE A GENOME THAN TO SEQUENCE IT NOW. NOW, IT’S TEMPTING TO THINK THAT CLOUD COMPUTING WILL SOLVE THIS PROBLEM, AS THIS ARTICLE ITSELF SUGGESTS. BUT THAT’S SIMPLY NOT THE CASE. IT MAY SAVE SOME COST, BUT IT DOESN’T ADDRESS THE FUNDAMENTAL ISSUE. THAT IS, IT DOESN’T CHANGE THE PROBLEM THAT SEQUENCING DATA IS GROWING EXPONENTIALLY FASTER THAN COMPUTING POWER PER DOLLAR. SO THE ONLY THING THAT WILL ADDRESS THIS ISSUE ARE FUNDAMENTAL TALLY BETTER
ALGORITHMS TO MAKE A DIFFERENCE. WE NEED ALGORITHMS SO FAST THAT IN SOME CASES THEY DON’T — THEIR RUNNING TIME DOES NOT EVEN GROW LINEARLY WITH THE SIZE OF THE DATA. AND THAT’S WHAT WE DO. WE DIVIDE ALGORITHMS THAT DO THESE COOL DELAYSE CALCULATIONS
FAST AND SCALE SO THE COST DOESN’T EXPLODE WITH THE SIZE OF THE DATABASE. ANOTHER THING WE DO IS DESIGN ALGORITHMS TO TAKE ADVANTAGE OF MASSIVELY GROWING DATA SETS TO DEBTS NEW BIOLOGICAL INSIGHTS. SO DESIGNING EFFICIENT AL ALGORITHMS FOR PROCESSING MASSIVE DATA ALLOWS US TO PRODUCE SOFTWARE THAT CAN ANSWER SOME IMPORTANT BIOMEDICAL QUESTIONS IN PRACTICE. SO IN THIS TALK, I’LL SPEAK ABOUT THREE INSTANCES WHERE WE HAVE MASSIVE AMOUNTS OF DATA, AND HOW WE’RE RESPONDING TO THE CHALLENGE OF ANALYZING IT. I’LL TALK ABOUT ONE CHALLENGE IN LARGE SCALE GENOMICS, ONE CHALLENGE IN MEDICAL GENOMICS, ONE IN NETWORK BIOLOGY. THE SPOTLIGHT THERE WIL WILL BE
ON HOW BETTER ALGORITHMS CAN MAKE THE PROBLEMS TRACTABLE AND GAIN INSIGHTS WE WOULDN’T HAVE BEEN ABLE TO GAIN. LET’S FOCUS FIRST ON LARGE SCALE GENOMICS. SO CURRENTLY, MANY GENOMICS APPLICATIONS REQUIRE US TO STORE, ACCESS AND ANALYZE VERY LARGE LIBRARIES OF SEQUENCE DATA. BUT GIVE BE THE GROWTH OF SUCH DATA THAT I JUST DESCRIBED, WE HAVE TO WONDER IF OUR FASTEST ALGORITHMS CAN KEEP PACE. CLEARLY IF WE JUST WANT TO STORE THE DATA WE COULD COMPRESS IT WHICH SOME HAVE DONE BUT THAT IS NOT GOING TO SOLVE ALL OF OUR PROBLEMS BECAUSE EVENTUALLY WE HAVE TO LOOK AT IT. SO THE KEY HERE IS THAT MUCH OF THE QUOTE/UNQUOTE NEW DATA IS ACTUALLY SIMILAR. SO THE QUESTION BECOMES HOW CAN WE TAKE ADVANTAGE OF THIS REDUNDANCY IN OUR ALGORITHMS THAT STORE AND PROCESS THIS DATA AT THE SAME TIME? WE CALL THIS COMPRESSIVE GENOMICS. SO NOTICE THAT IN THE ORIGINAL SCENARIO HERE, WE COMPRESS THE DATA AND THEN DECOMPRESS IT TO ANALYZE IT, WHEREAS WITH COMPRESSIVE GENOMICS WE IMPRESS COMPRESS THE DATA AND OPERATE ON THAT WITH NO NEED TO DECOMPRESS. NOW, IN THE ALGORITHM COMMUNITY WHICH I’M FROM THIS IS WHAT’S KNOWN AS DISTINCT DATA STRUCK FOR FOR EXACT CASE MATCHING BUT THINGS ARE RARELY THAT IN OUR FIELD. AS IT TURNS OUT, WE HAVE DATABASES OUT THERE SUCH AS WORM BASE THAT HOLD DATA FOR MANY CLOSELY RELATED AND NOT SO CLOSELY RELATED SPECIES. AND THE THOUSAND GENOMES PROJECT IS GENERATING LOTS AND LOTS OF HIGHLY SIMILAR HUMAN SEQUENCE DATA. SO HOW SIMILAR IS THIS DATA? WELL, HERE IS AN ILLUSTRATION OF A SUBTREE, AND THE AMOUNT OF NONREDUNDANT DATA FOR EACH LEVEL OF THE TREE IS IN BLACK, AND THE INDIVIDUAL GENOMES ARE COLORED. IF YOU LOOK UP HERE, YOU SEE THAT THE AMOUNT OF NONREDUNDANT DATA IS HALF THE SIZE OF THE TOTAL DATABASE. AND YOU WOULD EXPECT THAT FOR A COLLECTION OF HIGHLY SIMILAR GENOMES, YOU COULD GET THE AMOUNT OF NONREDUNDANT DATA PROPORTIONAL TO ONE OF THE GENOMES. HOW WE MAKE USE OF THIS REDUNDANCY IS AT THE HEART OF COMPRESSIVE GENOMICS. WE HAVE A NUMBER OF APPLICATION AREAS AND I’M NOT GOING TO DEE ABLE TO GET INTO THEM TODAY BUT PRINTOULD SEE THEM IN SHORTLY. THE KEY IS THE RUN TIME IS PROPORTIONAL TO THE NONREDUNDANT INFORMATION THAT WE HAVE IN THE COLLECTION OF GENOMES WE CONSIDER, RATHER THAN THE FULL DATA SAT. SO I’VE JUST TALKED ABOUT HOW SUBLINEAR TIME AL G ALGORITHMS
THAT SCALE WITH REDUNDANT DATA RATHER THAN THE FULL SET CAN HELP US MANAGE THE ENORMOUS GROWTH IN BIOLOGICAL DATA. SO NOW WHAT I’M GOING TO DO IS TALK ABOUT HOW WE CAN GAIN MEDICAL INSIGHT FROMNUTE SIGHTS LARGE SCALE DATA. THIS IS IMPRESSIVE UNDER THE EMBARGO POLICY. SO IN THE OLD DAYS, IF YOU WERE INTERESTED IN SOME DISEASE, SAY BREAST CANCER, YOU WOULD MAP THE GENE EXPRESSION PROFILES FOR A VARIETY OF GENES, ONTO AN EXPRESSION ARRAY, TO LOOK FOR PATTERNS OF INTEREST. IN THE LAB NEXT DOOR, SOMEONE MIGHT BE DOING THE SAME THING FOR, LET’S SAY, COLON CANCER. BUT YOU WOULD HAVE NO WAY TO COMBINE AND INTEGRATE THE SEPARATE DISCOVERIES. ALL THIS HAS NOW CHANGED. WE NOW HAVE DATABASES SUCH AS NCBI’S GENE EXPRESSION OMNI BUS, WHICH PULLS TOGETHER MANY DISPARATE GENE EXPRESSION STUDIES. SO NOW BECAUSE COMPUTERS ARE MUCH FASTER, AND COST LESS, AND WE HAVE LOTS AND LOTS OF THESE GENE EXPRESSION STUDIES PUBLIC AVAILABLE WE’RE NO LONGER CONFINED TO THE TENS OF SAMPLEES WE CAN GENERATE IN OUR OWN WET LAB. BUT NOW AS I’LL SHOW YOU IN THIS TALK WE HAVE BEEN ABLE TO ANALYZE THOUSANDS OF GENE EXPRESSION SAMPLES TO DERIVE NOVEL BIOLOGICAL OR MEANINGFUL BIOLOGICAL INSIGHTS. AND MORE IMPORTANTLY, MANY OF THESE INSIGHTS CAN ONLY BE GLEANED BY LOOKING AT HUNDREDS OF THOUSANDS OR TENS OF THOUSANDS OF GENE EXPRESSION SAMPLES AT THE SAME TIME. SO I JUST HAVE SHOWN YOU A PLOT OF THE WHOLE DATABASE. AS YOU SAW IT CONSISTS OF HUNDREDS OF THOUSANDS OF SAMPLES AND WAS INTRINSICALLY HIGHER DIMENSIONAL. WIRE GOING TO LOOK AWE’RE GOING
TO LOOK A
T A SUBSETH, 3,000 SAMPLES, 20,000 GENES, PROJECTED ONTO TWO DIMENSIONS. HERE IS THE TWO DIMENSIONAL PLOT WITH 3,000 SAMPLES, EACH IS A GREAT POINT WITH COLORS WHICH I’LL SPEAK ABOUT IN A MOMENT. NOW, AMAZINGLY, ACROSS ALL THESE SAMPLES, WE CAN LEARN SOME REALLY INTERESTING THINGS. SO THE FIRST THING WE LEARNED IS THAT TISSUES OF SIMILAR TYPES LOCALIZE ON THIS LANDSCAPE. SO AS YOU CAN SEE, THEY ARE VERY CLEAR, FLOOD, BRAIN, EPITHELIAL CLUSTERS HERE. IN FACT, WE CAN EVEN SEE SOMETHING MORE. WE GET THAT MORE SPECIFIC TYPES CO-LOCALIZE. SO IF WE JUST TAKE ONE LEVEL DOWN THE EPITHELIAL CLUSTER AND ITS SAMPLES, AND WE PROJECT THEM ONTO A TWO DIMENSIONAL PCA, WE GET THAT GASTROSAMPLES, AND THOSE ASSOCIATED WITH REPRODUCTIVE HORMONES CO-LOCALIZE AND WE HAVE MANY EXAMPLES WHERE WE CAN GO FURTHER DOWN THE HIERARCHY AND SEE SAMPLES FROM SIMILAR TISSUE TYPES CO-LOCALIZE. OKAY. SO THE INTERESTING THING IS IF YOU LOOK AT CANCER SAMPLES THEY LIE IN THE SAME VICINITIES AS NONCANCEROUS COUNTERPARTS BUT MORE SPREAD OUT ON THE LANDSCAPE. SO OUR OVERALL GOAL IS TO LEVERAGE THE STRUCTURE IN ORDER TO MAP THE TRANSCRIPTOMIC LANDSCAPE. SO TO DO THIS, WE NEEDED A UNIFIED APPROACH WHERE WE COULD MAP SAMPLES INTO THEIR CORRESPONDING BIOMEDICAL PHENOTYPES. LET’S SAY LUNG TISSUE OR DUCTAL BREAST TISSUE. AND FOR THAT WE CONSTRUCTED A CURATED MACHINE READABLE DATABASE, THAT ALLOWED US TO MAP A GIVEN GENE EXPRESSION SAMPLE TO ITS BIOMEDICAL PHENOTYPES. WE USE THE NLM MEDICAL LANGUAGE SYSTEM AND THAT’S GENE EXPRESSION SAMPLES UP THE HIERARCHY. SO HAVING SUCH A DATA STRUCTURE WHERE WE CAN QUICKLY RETRIEVE GENE EXPRESSION SAMPLES ALLOWED US TO BE ABLE TO DO A MACROSCOPIC ANALYSIS OF A LARGE AMOUNT OF DATA. NOW THAT WE HAVE THIS DATA STRUCTURE WHICH MAPS GENE EXPRESSION SAMPLES ON THEIR BIOMEDICAL PHENOTYPES WHAT CAN WE DO WITH IT? ONE THING WE’VE DONE, HERE IS OUR DATA STRUCTURE. WE TAKE NEW GENE EXPRESSION SAMPLES AND WE QUANTIFY HOW THEY MAP ONTO OUR TRANSCRIPTOMIC LANDSCAPE. SO TO DO THIS, WE DEVELOPED A CONCEPT ENRICHMENT SCORE BASED ON SMIRNOFF STATISTICS OVER THE CONCEPT DATABASE. SO THIS STAT STATISTIC ALLOWS
TO US ANSWER THE QUESTION GIVEN A NEW GENE EXPRESSION SAMPLE CAN WE ACCURATELY LABEL IT GIVEN THE OTHER SAMPLES IN THE DATABASE AND THEIR LABELS? AND IN FACT OUR ABILITY TO CORRECTLY LABEL IT IS QUITE STRONG. WHEN WE TESTED THIS AND LEAVE ONE SAMPLE OUT CROSS VALIDATION THE AVERAGE ACCURACY WAS 92.8% AS MEASURED BY THE AREA UNDER THE CURVE OF OVER THE 120 THE CONCEPTS IN THE DATABASE. OUR ABILITY TO PLACE GENE EXPRESSION, NEW GENE EXPRESSION SAMPLES ON THIS LANDSCAPE, IS WE CAN DO THIS WITH CONFIDENCE. IT’S STRONG. SO WE’VE DEVELOPED A WEB RESOURCE BASED ON THIS, COMPORTIA THAT TAKES AN INPUT OF EXPRESSION DATA AND RETURNS A RANK ORDERED LIST OF THE CONCEPTS MOST ASSOCIATED WITH IT, AND IT ALSO RETURNS A PLOT OF WHERE THE NEW SAMPLE, WHICH COMES FROM THE BRAIN FALLS ON THIS LANDSCAPE. SO THE SAMPLE WE’RE TRYING TO PLACE IS IN BLUE, AND WE’VE LABELED THE OTHER BRANDS, THE OTHER BRAIN SAMPLES IN THE CASS DATABASE, IN ORANGE. IT’S IN THE MIDDLE OF THE BRAIN RANGE. OMIC IS OUR TRA TRANSCRIPT
OURBGS LANDSCAPE, THE ONE GENERATED FROM 3,000 SAMPLES. SO AS YOU MIGHT IMAGINE, HAVING THE FULL TRANSCRIP
TRANSCRIPTOMIC LANDSCAPE CAN BE HELPFUL IN THE DIAGNOSIS. JUST BECAUSE CANCER IS IN THE BRAIN DOESN’T MEAN IT ORIGINATED IN THE BRAIN. KNOWING THE ORIGIN CAN BE HELPFUL. BY BEING PLACE TO PLACE A NEW EXPRESSION SAMPLE ON THIS TRANSCRIPTOMIC LANDSCAPE WE’RE ABLE TO DO SOMETHING REALLY IMPORTANT. AND THIS IS BECAUSE IN OUR FRAMEWORK, NEW — THIS IS BECAUSE IN OUR FRAMEWORK SAMPLES TEND TO LOOK MORE LIKE THEIR TISSUE OF ORIGIN THAN THEY LOOK LIKE THEIR TISSUE WHERE THEY M ECONOMY
METASTISIZE TO. THESE ARE LUNG CANCER METASTISES IN ORANGE, AND THEY FALL — FOR THE LUNGS, THEIR TISSUE OF ORIGIN, MUCH MORE HIGHLY THAN FOR THE BRAIN. HERE WE HAVE ANOTHER EXAMPLE, WHERE WE HAVE BREAST CANCER METASTASIES, THEY LOOK MORE LIKE BREAST, TISSUE OF ORIGIN, THAN LUNG, CLOSE BY, THAN BONE OR BRAIN AND CONCEPT ENRICHMENT SCORES ARE HIGHER FOR BREAST. WHILE THESE ARE TWO EXAMPLES, WE SEE SIMILAR RESULTS ACROSS A VARIETY OF CANCER. SO NOT ONLY CAN WE IDENTIFY THE TISSUE OF ORIGIN MORE METASTASES WE CAN IDENTIFY WHICH GIANTS ARE MOS GENES ARE
MOST ASSOC
IATED WITH THEM. SO WHAT WE WANT TO DO IS GIVEN A PARTICULAR BIOMEDICAL IDENTIFY MARKERIFY MARKER GENES, FROM AN ENTIRELY DIFFERENT DIMENSION WITH TRANSCRIPTOMIC LANDSCAPE. WHAT WE’RE ASKING IS WHAT — WE WANT TO PINPOINT THE GENES THAT ARE MOST RELATED TO A PARTICULAR PHENO TYPE AND NOT, LET’S SAY, RELATED TO MORE GENERAL PHENO TYPE LIKE CANCER. SUCH AS GENES INVOLVED IN CELL CYCLE AND CELL ADHESION. AND IN FACT, WE’VE DEVELOPED AN APPROACH FOR IDENTIFYING MARKER GENES WHICH I’M NOT GOING TO GET INTO BUT WE BASICALLY USE A FINITE IMPULSE CONTROL FILTER, OVER EACH PHENO TYPE, ALLOWS US TO IDENTIFY THE MARKER GENES THAT ARE ENRICHED FOR EACH PARTICULAR BIOMEDICAL PHENO TYPE. SO IN SO DOING, FOR EXAMPLE, WE’RE ABLE TO FIND ONES MORE PARTICULAR TO BREAST CARCINOMA. THIS BRINGS A MORE GENERAL STUDY WHICH ANSWERS WHAT MY COLLABORATOR CALLS THE INCIDENTSODENTUN. WE LOOKED AT WHAT THE MARKER GENES WERE FOR CARCINOMA AND 13 SUBSETS AND FOUND A QUARTER OF THE MARKER GENES HAD HIGHER MARKER P VALUES FOR CARCINOMA THAN THEY DID FOR THE MORE PARTICULAR CONCEPTS HERE. AND THIS IS IMPORTANT WHEN YOU’RE DESIGNING CLINICAL TESTS BASED ON MARKER GENES. YOU DON’T WANT TO BE USING THE GENERAL CANCER ONES FOR, LET’S SAY, LOBULAR BREAST CARCINOMA. SO WE RAN CONCORDIA AND OUR CONCEPT ENRICHMENT SCORE TO IDENTIFY MARKER GENES ACROSS BREAST CANCER GENES AND WE FOUND THAT THERE WERE 74 THAT WERE HIGHLY ENRICHED FOR BEING UNIQUE TO BREAST CANCER. THREE INTERESTING ONES ARE LISTED HERE, WHICH SOME OF YOU MAY BE FAMILIAR WITH, BUT THEY WERE EXTREMELY HIGH SCORING AND ARE KNOWN TO BE ASSOCIATED WITH BREAST CANCER. AND WHEN WE LOOKED AT THE GO ENRICHMENT, MEANING THE FUNCTIONAL ENRICHMENT FOR THE DIFFERENT CONCEPTS, WE SAW THAT THEY DID NOT HAVE THE COMMON CANCER GO TERMS, BUT THEY HAD TOUCH AS CELL CYCLE AND CELL ADHESION BUT HAD ONE PARTICULAR TO BREAST CANCER, AND ADDITION, WE FOUND ONES RELATED TO CARBOHYDRATE AND LIPID METABOLISM. IT’S KNOWN THAT WOMEN WITH TYPE 2 DIABETES MAY HAVE HIGHER SUSCEPTIBILITY TO BREAST CANCER SO THIS WAS NOT SURPRISING. SO WHAT WE WOULD LIKE TO BE ABLE TO DO IS USE THESE DATABASES AND SYSTEM TO DEVELOP DATA MINING ALGORITHMS FROM WHICH WE CAN UNDERSTAND THE MACROSCOPIC SIGNALS IN THE DATA. SO IN ONE EXAMPLE HERE, WE TRY TO DO THIS FOR STEM CELLS. LIKE GENES. AND SO WHAT WE DID HERE WAS WE LOOKED AT — WE MADE A WHOLE NEW PCA, THIS TIME OVER 200 GENES, NOT OUR 20,000 GENES THAT WE ORIGINALLY STARTED WITH. AND THESE 200 GENES WERE IDENTIFIED AS BEING THE HIGHEST SCORING MARKER GENES FOR STEM CELL LIKENESS. SO WE STILL HAVE OUR 3,000 SAMPLES. BUT WE’VE REDUCED 200 DIMENSIONAL SPACE AS OPPOSED TO 20,000 GENE DIMENSIONAL SPACE. AND THIS IS THE MAP THAT WE GET, AND THEN WE CAN ASK OURSELVES, WHERE DO NORMAL GENE EXPRESSION SAMPLES, MALIGNANT SAMPLES AND STEM KRILL LIK
CELL-LIKE ONES LIE ON THE LANDSCAPE? THE STRIKING THING IS WE
FIND9+Ñ THAT MALIGNANT TUMOR SAMPLES SUCH AS HERE FOR BLOOD, THAT LIKE LIE LEUKEMIA LIE BETWEEN NORMAL AND STEM CELL-LIKE ONES. WE FIND THAT MALIGNANT TUMOR SAMPLES RETAIN SOME CHARACTERISTICS CLOSE TO THE ISSUE OF ORIGIN BUT ADOPT STEM CELL LIKE PROGRAMMING. PEOPLE SUGGESTED THIS IN STUDIES BUT HERE WE’RE FINDING IT IN TERMS OF ANALYZING BLINDLY A MASSIVE AMOUNT OF DATA. SO IN THE NEXT FEW SLIDES, THE RED AREA WILL BE THE NORMAL TISSUE SAMPLES, THE GREEN WILL BE THE MALIGNANT ONES, THE POTENTWILL BE THE PLURY STEM CELL, BLUE MESNYCHUYMAL. THIS IS BLOOD. FOREFORIMILAR PATTERN COLON WITH NORMAL ONES NEAR NORMAL, AND THE RED AND GREEN-SHADED AREAS NOW CORRESPONDING TO COLON INSTEAD OF BLOOD SHIFTED, WHICH WE WOULD EXPECT,IVE DIFFERENTIATED STEM CELLS REMAIN WHERE THEY ARE, WHICH WE ALSO HOPE FOR, AND SHADING THEM ON EACH OF THE SLIDES. WE GET A SIMILAR PATTERN FOR BREAST TISSUE STAM SAMPLES AND SIMILAR PATTERN FOR PROSTATE SAMPLES AND TAKEN ALL TOGETHER WITH THE ADDITION OF BRAIN, WE CAN SEE AN OVERALL SUCH PATTERN WHERE MA ANYTHIN MALIGNANT
CAMPEL ARE BETWEEN THAT AND NORMAL. LOOK AT PC 2, IT TURNS OUT THE SHADING OF THE RELATIVE TISSUES ACTUALLY REFLECT THEIR PLACEMENT ON THE ORIGINAL WHOLE TRANSCRIPTOMIC LANDSCAPE OVER THE 20,000 GENES THAT I SHOWED YOU BEFORE. THESE ONLY THE 200 STEM GENES,LATED SKWRAOERPBGS RECAT RECAPITULATING THE STEM CELLS. WE WOULD LIKE FOR THIS TO HAVE CLINICAL APPLICATIONS, AND WE’VE SHOWN THAT WE CAN ACTUALLY SHED SOME INSIGHT AS TO WHERE THE PRIMARY SITE IS FOR M METASTASIS, WE HOPE THIS APPROACH PROVIDES A COMPLEMENT TO THEIR METHODS. WE CAN IDENTIFY MARKER GENES SPECIFIC TO A DISEASE, IN PARTICULAR BREAST CANCER, AND WE’RE HOPING THAT THESE KIND OF METHODS MAY BE HELPFUL IN, YOU KNOW, CLINICAL TESTS SUCH AS NAMA PRINT IN THE FUTURE. AND IN PRELIMINARY STUDIES, WE’VE ALSO BEEN ABLE TO SHOW THAT TUMOR GRADE IS CORRELATED WITH A STEM CELL LANDSCAPE, THAT I JUST SHOWED YOU. AND HOPEFULLY IN THE LONG RUN, THIS WILL BE HELPFUL IN DISAM BIG WAITINDISAMBIGUATING
MID-GRADE TUMORS, WHICH ARE SO DIFFICULT TO TREAT. OOPS, SORRY. I WENT — OKAY. SO WE’VE SEEN THAT BY S*EUFRPBT SYNTHESIZING A LARGE EXPRESS DATABASE WE HAVEE INSIGHT WE WOULD SOUGHT KNOT ABLE TO GET FROM ONE PARTICULAR MEMBER OF THE DATABASE. NOW WE’RE GOING TO LOOK AT HOW BY USING NETWORK INFORMATION WE’RE GOING TO BE ABLE TO DO CROSS SPECIES INFERENCE WHICH WE COULD NOT GET FROM SEQUENCE DATA ALONE. IN PARTICULAR, WE’RE GOING TO FOCUS ON PROTEIN-PROTEIN INTERACTION NETWORK. SO THESE ARE THE SPECIES FOR WHICH WE HAVE THE MOST PPI DATA FOR. AS YOU CAN SEE, THESE ARE THE NUMBER OF PROTEINS IN EACH OF THE SPECIES. AND THESE ARE THE CURRENTLY KNOWN NUMBERS OF INTERACTION, OF COURSE FOR YEAST WE HAVE A LOT OF KNOWN INTERACTION, WHEREAS FOR A MOUSE WE HAVE VERY FEW. AND HERE IS — IN THE PAST, THE WAY WE WOULD MODEL PPIs, WE WOULD TAKE A VERY LOW THROUGH-PUT STRUCTURE-BASED APPROACH WHERE A GREAT EXTENT AND LARGE AMOUNTS OF TIME WE WOULD BE ABLE TO ANALYZE THE STRUCTURE AND CHEMISTRY OF A PARTICULAR PROTEIN COMPLEX. BUT NOW OVER THE LAST DECADE OR SO, THERE’S BEEN A HIGH THREE-PUT NETWORK-BASED APPROACH EMERGING WHERE WE MODEL NETWORKS AT LOWER RESOLUTION BUT WE COME UP WITH A NETWORK WHICH COVERS THE ENTIRE SPECIES, AT LEAST ATTEMPTS TO COVER THE ENTIRE SPECIES. X ANDROTEIN IS A VERTE EXECS EACH EDGE REPRESENTS INTERACTION BETWEEN THE PLANO PROTEINS. THIS LOW LEVEL APPROACH ALLOWS US TO COME UP WITH INSIGHTS THAT WE COULDN’T NECESSARILY GET FROM THE LOW THROUGH-PUT DETAILED THE MORE DETAILED STRUCTURAL APPROACH. THIS LOW RESOLUTION APPROACH. SO HERE IS THE YEAST PPI NETWORK, THE EARLIEST ONE. AND IN SUCH A NETWORK, EVERY EDGE IS DETERMINED BY SOME HIGH THROUGH-PUT TECHNIQUE. SO IN THIS CASE THIS EDGE IS DETERMINED BY YEAST 2 HYBRID. FORTUNATELY, WE HAVE MORE TECHNIQUES COMING ALONG, AND MASTECTROMETRY IS A GOOD ONE GIVING US NEW ENTER ABC NEWS DATA.TERACTION THERE’S A PROBLEM WITH THIS DATA. AS YOU MAY HAVE GUESSED FROM MY PREVIOUS SLIDE WHERE I TALKED ABOUT THE NUMBER OF PROTEINS AND ENTER AB INTERACTIONS,
COVERAGE IS NOT SO GREAT. THERE ARE PROTEINS FOR WHICH WE HAVE NO INFORMATION AT ALL, AS TO ANY OF THEIR INTERACTIONS. LOTS OF FALSE
AND THERE Aá POSITIVES. AND EACH OF THESE HIGH THROUGH-PUT TECHNIQUES GIVES A CONFIDENCE IN EACH OF THESE EDGES, AND WE WOULD LIKE TO ASSIGN BETTER CONFIDENCES, THAT’S A NICE OPEN PROBLEM THAT WE’RE WORKING ON. SO THOSE PROBLEMS LEND THEMSELVES TO COMMONATORIAL ALGORITHMS. TRADITIONAL ONES IN THE PAST DON’T APPLY DIRECTLY WITH LOTS OF ERRORS IN THE DATA. ONE THING WE CAN DO WITH THIS NETWORK DATA IS — I NOTICED. ONE THING WE CAN DO WITH THE NETWORK DATA ISAWAY COMPAR WE
CAN COMPARE IT ACROSS SPECIES. SO THIS IS WHAT’S KNOWN AS COMPARATIVE GENOMICS, AND I’M SURE YOU’RE VERY FAMILIAR WITH IT IN TERMS OF SEQUENCES, WHERE WE LOOK AT BIOLOGICAL DATA ACROSS SPECIES, WITH THE HOPE THAT AREAS OF HIGH CONSERVATION CORRESPOND TO FUNCTIONAL PARTS OR MODELS OF THE GENOMES. WHAT I’M GOING TO SHOW YOU HERE PROTEINY LOOKING AT PRIOR TEEN SEQUENCE INFORMATION WE CAN’T GET AS GOOD CORRESPONDSES OF GENOMES ACROSS SPECIES AS WE CAN BY USING SEQUENCE INFORMATION WITH A NETWORK PERSPECTIVE. AND MY GROUP, I JUST GRABBED THIS PICTURE FROM SOMEWHERE. WE DID SOME. OF THE EARLIEST WORK COM COMPARING MOUSE
GENOMES, AND INFORMATION ACROSS GENOMES AND TURNED THIS TO COMPARING NETWORKS. ONE REASON WE WANT TO COMPARE NETWORKS IS WE WANT TO BE ABLE NOTATIONR TRANSFER AN OH TAKES FROM ONE SPIRES TO ANOTHER. SPECIES TO ANOTHER. GENOME IN A OUT A JOIN HOME IN MOUSE, YOU LEARN BUT YOU WOULDN’T WANT TO DO THAT IN HUMANS. WE NEED A MECHANISM TO TRANSFER INFORMATION INTO HUMANS. AND ONE TERMINOLOGY FOR THIS IS ORTHOLOGY, THE CORRESPONDS BETWEEN GENES AND PROTEINS USED INTERCHANGEABLY HERE ACROSS SPECIES. BUT WHAT WE WANT IS WE WANT FUNCTIONAL ORTHOLOGY, AND WHAT THIS MEANS IS WE WANT PROTEIN WHICH IS ACTUALLY PERFORM THE SAME FUNCTION ACROSS SPECIES. AND THIS IS A VERY IMPORTANT PROBLEM. I’M WORKING WITH BIOLOGISTS WHO ARE FRUSTRATED WITH THE SEQUENCE-ONLY BASED METHODS TRADITIONALLY USED FOR THIS. AND THEY WANT TO USE OTHER INFORMATION TO GET BETTER CORRESPONDSES BECAUSE THE SEQUENCE-BASED METHODS TEND TO GET LOTS OF FALSE POSITIVES, AND THEN THEY ARE NOT GETTING CORRECT ANSWERS. SO AS I SAID, I’M GOING TO SHOW YOU THAT BY USING SEQUENCE AND NETWORK INFORMATION, WE CAN GET MUCH BETTER MAPPINGS BETWEEN GENES ACROSS SPECIES. SO THE PROBLEM WE HAVE IS GIVEN TWO PROTEIN-PROTEIN INTERACTION NETWORKS WE WANT TO FIND FOR A NETWORK SOMETHING THAT HAS COMPARATIVE STRUCTURE IN THE OTHER NETWORK. SO FOR ANY PARTICULAR PAIR OF NODES WE WANT TO SCORE HOW SIMILAR THEY ARE BASED ON SEQUENCE AND NETWORK. SO FOR A GIVEN NOTE ON THE FLY WE WANT TO KNOW WHICH NODES HERE IN THIEST HAVE SIMILAR FUNCTIONS. SO THE WAY THAT WE DO THIS IS WE MATCH NEIGHBORHOOD TOPOLOGIES. ALGORHMTHIC.LEGE GET AL GO RATE THE HEART OF THE AL G ALGORITHM
IS COMPUTING THE SIL SILL LARRY SCORES. COMPUTE — SIMILARITY SCORES. WE’RE GOING TO GET A HIGH SCORE IF THE TWO NODES ARE A GOOD MATCH, IF I AND J ARE A GOOD MATCH. THE INTUITION WE PURSUE, I AND J ARE AIvt GOOD MATCH IF THEIR SEQUENCES ALIGN AND IF THEIR NEIGHBORS ARE A GOOD MATCH. IN THE PAST, THIS QUOTE/UNQUOTE FUNCTIONAL SIMILARITY SCORE, RIJ, WAS BASED MERELY ON SEQUENCE SIMILARITY. AS I SAID, THAT LEADS TO A LOT OF FALSE POSITIVES. SO WHAT WE’RE GOING TO DO IS ADD A NETWORK COMPONENT TO THIS. NIJ, WHICH IS A SIMILARITY SCORE BETWEEN THE NEIGHBORS OF NODES I AND J. THEN IT’S GOING TO BE A CONVEX COMBINATION OF SEQUENCE AND NETWORK SIMILARITY SCORE. NOTICE THAT ALPHA IS USER DEFINED, ALTHOUGH RERECOMMEND A SETTING, IF IT’S ALPHA ONE, YOU HAVE NO SEQUENCE DATA. IF IT’S ALPHA ZERO YOU HAVE NO NETWORK DATA IN THIS EQUATION. SO IN SUM, THE AL G ALGORITHM
TAKES TWO NETWORKS, BLUE AND GREEN, AND PRODUCES A MAPPING OF ALL THE NODES IN BLUE AND GREEN NETWORKS, THIS MAPPING IS JUST THIS MITT RIGS R MATRIX R, THE
SIMILAR SCORES. NOTICE THIS MATRIX IS PRETTY EMPTY. THAT’S BECAUSE LOTS OF THE PAIR-WISE SIMILARITY SCORES ARE ZERO. AS I SAID, WHAT WE WANT TO COMPUTE IS THE SEQUENCE SIMILARITY SCORE, PLUS THE NEIGHBORHOOD SIMILARITY SCORES. WE WANT THE NEIGHBORS OF SIMILAR NODES ARE ALSO SIMILAR. THIS IS MEASURED BY A WEIGHTED SUM OVER THE SIMILARITY SCORES OF THE NEIGHBORS OF I AND J. WHICH ARE RUV. WE DON’T KNOW THE RUV VALUES OF THE NEIGHBORS, SIMILARITY SCORES OF THE NEIGHBORS, UNTIL WE KNOW, UNTIL WE COMPUTED THEM. SO IT TURNS OUT THAT IT’S NOT SO MUCH A PROBLEM BECAUSE THIS IS A LINEAR SYSTEM OF EQUATIONS, THAT WE CAN JUST SOLVE AS A VALUE PROBLEM. IN FACT WE CAN SOLVE IT BECAUSE THE R MATRICES ARE SO SPORES. THE LENDS ITSELF TO A RANDOM WALK INTERPRETATION AND WE HAVE A BLUE GRAPH G-1 AND GREEN GRAPH G-2, AND OUR PROBLEM IS JUST TAKING A RANDOM WALK ON THE TENSOR PRODUCT GRAPH OF G 1 AND G 2 SUCH THAT THE TRANSITION PROBABILITY OUT OF ANY GIVEN PRODUCT NODE UV IS THE SAME, IT’S EQUIVALENT FOR THE OUT EDGES OF THAT NODE. PRECISELY THE TERM ON OUR NETWORK SIMILARITY SCORE. HERE WE’RE LOOKING AT A SIMPLER CASE, NOT THE SEQUENCE INFORMATION IN THE NETWORK, IN THE SIL SIMILARITY SCORE FOR
USE OF COMPUTATION. IT TURNS OUT THE STATIONARY DISTRIBUTION OF THE RANDOM WALK IS THE LARGEST IGAN VALUE OF THE MATRIX, N SQUARED BY N SQUARED IN SIZE, THE RESULT OF THE TRANSITION PROBABILITIES IN THE MATRIX HERE. THIS MAY REMIND YOU OF AN ALGORITHM THAT’S OUT THERE, GOOGLE’S ALGORITHM DOES A SIMILAR RANDOM WALK ON A SINGLE GRAPH, RATHER THAN A PRODUCT OF GRAPHS TO RANK WEB PAGES IN ORDER OF IMPORTANCE. SO AN EVEN HARDER PROBLEM IS MULTIPLE NETWORK ALIGNMENT. AND THE REASON THIS IS SO HARD IS THE SAME REASON AS FOR MULTIPLE SEQUENCE ALIGNMENT, IS THAT THE PROBLEM IS EXPONENTIAL IN THE NUMBER OF NETWORKS. SO BASICALLY WE WANT TO FIND GIVEN MULTIPLE NETWORKS SOME CONSERVE STRUCTURE BETWEEN THEM. SO AS FOR THE CASE OF SEQUENCE ALIGNMENT, WE’RE GOING TO APPROXIMATE THIS WITH PAIR NETWORK ALIGNMENTS. THE APPEAR-WIS PAIR-WISE
NETWORK ALIGNMENTS ARE THE R MATRI CRUSHINGES. THIMATRICES. THE ORTHO LOGS WILL BE THE WEIGHTED SUBGRAPHS FOR THESE — FOR THIS COMPUTATION. SO NOTICE THAT WE’RE ALLOWING ONE SUCH GOOD ALIGNMENT WOULD BE ONE NODE FROM PURPLE, ONE FROM GRIPE GREEN AND TWO FROM
YELLOW BECAUSE WE CAN HAVE GENE DUPLICATION EVENTS. WE WANT CROSSINGS RATHER THAN ONE-TO-ONE MAPPING. QUICKLY, THIS IS HOW THIS WORKS. WE COMPUTE A SIMILARITY GRAPH BETWEEN ALL PAIRS OF NETWORKS. AND THEN WHAT WE DO IS WE WANT TO FIND STRONGLY SIMILAR NEIGHBORS. SO WE START WITH A PARTICULAR NODE LET’S SAY THE RED ONE HERE IN ARNOLD, AND WE WANT TO FIND STRONGLY SIMILAR NEIGHBORS TO THAT. SO THE IDEA THAT WE’RE USING HERE IS THAT IF MULTIPLE PAIRS OF NETWORKS AGREE, THAT SOMETHING — THAT NODES OR PROTEINS ARE RELATED, THEN THE OTHER NETWORK, EVEN IF WE DON’T HAVE AN EDGE THERE, THEY ARE PROBABLY RELATED IN THAT TOO ALTHOUGH OF COURSE IN BIOLOGY THERE COULD BE EXCEPTIONS BUT BASICALLY WE’RE HOPING THAT IN THIS CLUSTER MOST OF THE EDGES HAVE HIGH SCORE. SO THEN WE FIND A STRONGLY SIMILAR NEIGHBOR TO THE RED ONE IN ARNOLD, AND IN FACT WHAT WE REALLY WANT IS A HIGHLY WEIGHTED SUBSET OF THAT. BECAUSE THAT MEANS THAT MOST OF THE CORRESPONDS ACROSS SPECIES AGREE THESE ARE IMPORTANT, AND THEY HAVE THE SAME FUNCTION. AND FOR THIS WE USE THE PAGERRING NIBBLER ALGORITHM, STARTING WITH ARE RED NODE, A RANDOM WALK WITH A TELEPORT BACK TO THAT NODE. THIS WILL BE DONE SOON, THE TECHNICAL PART. SO WE GET A COUPLE OF SUCH PAGERRING NIBBLE TYPE SUBGRAPHS, HIGHLY WEIGHTED SUBGRAPHS, AND THEN IF THEY HAVE A LOT OF EDGES IN COMMON WE MERGE THEM. AS THEY DO HERE. AND THEN WE REMOVE THEM FROM THE GRAPH ON THE NEXT SLIDE, HENCE NIBBLE, AND THEN REPEAT. SO THAT’S THE ALGORITHM THAT ALLOWS US TO DO MULTIPLE NETWORK ALIGNMENTS, AT LEAST APPROXIMATE IT WITH PAIR-WISE NETWORK ALIGNMENT. HOW DOES THIS DO? THE TROUBLE IN THIS FIELD IS THAT THERE’S NO GOLD STANDARD DATABASE FOR MEASURING ORTHOLOGY. IT’S FULL FULL ACTUALLY A HUGE PROBLEM, THERE ARE NO GOLD STANDARDS. WE CAME UP WITH OUR OWN MEASURE, NORMALIZED ENTROPY. WE SAID THINGS THAT ARE ORTH ORTHOLOGIC HAVE SIMILAR ENRICHMENT TERMS, SHOULD BE DOING SIMILAR FUNCTIONS. SO WE CAME UP WITH AN EN TERM, FEWER NODES, MORE HAVE THE SAME FUNCTION THAN LOTS OF FEWER FUNCTIONS, YOU WANT FEWER FUNCTIONS IS WHAT I MEANT TO SAY. BY NORMALIZE THE ENTROPY WE DID BETTER FO – FOR ALL SPECIES AND
JUST FOR HUMAN AND FLY. WE WERE ALSO ABLE TO GET GOOD COVERAGE ESPECIALLY FOR THREE OR MORE SPECIES AS YOU CAN SEE HERE, THE BEST RESULTS ARE BOLD-FACED. AND WE WERE ABLE TO DO BETTER — – THERE ARE SOME,K NETWORK AND GREMLIN, WHICH WE DID BETTER THAN THOSE. AND THIS TAKES IN THE GENE OR PROTEIN I.D. OR ALL SORTS OF DIFFERENT TYPES OF I.D.s, AND IT TELLS YOU THE FIVE SPECIES OR SOME IS UP SET THAT SUBSET
AND GIVES
YOU A LOT OF OTHER INFORMATION ABOUT THE ORTHO LOGS AND LINKS TO OTHER DATABASES THAT CONTAIN INFORMATION ABOUT THE ORTHO LOGS. SO I’LL PUT UP ONE BIOLOGICAL APPLICATION WE’VE BEEN ABLE TO GET, USING ISO-BASE. WE WORK WITH THE SUE LUN
LUNDQUIST LABS THAT USE YEAST MODELS TO UNDERSTAND PARKINSON’S OR ALZHEIMER’S, THIS IS GIANTS INVOLVED IN TOXICITY IN PARKINSONS, SO SUE GAVE US A LIST OF GENES IN YEAST AND GENES TO KNOW WHAT OU ARE
GIANTS LIKELY HAVING THE SAME FUNCTION IN HUMANS? WE USED ISO-BASED TO FIND THIS GENE HERE, AND MANY OTHERS THAT I’LL TELL YOU ABOUT IN A MINUTE, BUT IN PARTICULAR THIS GENE HERE WE FOUND TO BE ON A PATHWAY THAT WAS INVOLVED IN MEDIATED TRANSPORT. IT TURNS OUT WHEN YOU KNOCK — WHEN YOU OVEREXPRESS A PROTEIN WHICH IS IMPORTANT IN PARKINSON’S DISEASE, THAT THIS PATHWAY IS DISRUPTED. SO IF SOME EVIDENCE THAT THAT GENE IS DOING SOMETHING RELATED TO PARKINSON’S, AND IN FACT USING ISO BASE WE WERE ABLE TO FIND 48 HUMAN ORTHO LOGS TO HER YEAST COUNTERPARTS, AND 24 OF THEM WERE ENTIRELY NEW. THEY WEREN’T FOUND BY ANY OF THE OTHER ORTHOLOGY PREDICTORS OR NETWORK-BASED ONES. SO WE HAVE LOTS OF APPLICATIONS OF ISO RANK AND ISO BASED. YOU SAW A FEW OF THESE ALREADY. PEOPLE HAVE ALSO USED IT FOR METABOLIC NETWORK ALIGNMENT, AND WE WERE ABLE TO DO GENETIC INTERACTION NETWORK ALIGNMENT, WE MAKE THAT AVAILABLE IN ISO BASE.u SO AS I’VE TALKED ABOUT TODAY, WE SAW HOW BETTER ALGORITHMS CAN MAKE PROBLEMS MORE TRACTABLE, ACROSS VARIOUS AREAS, AND THEY CAN ALLOW US TO GAIN INSIGHTS THAT WE OTHERWISE WOULD NOT HAVE BEEN ABLE TO GET. BUT THIS IS A VERY SMALL FRACTION ACTUALLY OF WHAT WE CURRENTLY WORK ON. AND I’M JUST GOING TO NAME A FEW OF OUR RECENT SOFTWARE THAT WE’VE PUT OUT TO GIVE YOU AN IDEA OF THE OTHER THINGS WE WORK ON. WE ALSO HAVE DONE A LOT OF WORK IN PROTEIN STRUCTURE PREDICTION. IN FACT WE DEVELOPED THIS PROGRAM, MATT, FOR PROTEIN STRUCTURE ALIGNMENT, AND IN AN INDEPENDENT REVIEW ARTICLE IT WAS DEEMED TO BE THE BEST PROGRAM FOR PROTEIN STRUCTURE ALIGNMENT. WE DO ENSEMBLE MODELING, TO PREDICT STRUCTURE OR FOLDING PATHWAYS OF STRUCTURES, WE WORK ON AM LLOY AMELOIDS, PREDICTING
MUTE ANTMUTANTS AND THEIR STRUCTURES AND COUNT TO WORCONTINUE TO WORK IN
COIL-COILS. WE ALSO WORK ON PREDICTING NONCODING RNA STRUCTURE, AND LOCATIONS OF MICRO RNAs IN SEQUENCE DATA, SO WE HAVE A COUPLE PROGRAMS, RNA MUTANTS, PREDICTING MUTATIONAL EFFECTS ON THE STRUCTURE OF RNA AND A PREDICTINGAPER, WE DICKING NONCODING OF RNA STRUCTURES, BUT TO HIGHLIGHT WE’VE DONE A COUPLE PIECES OF WORK ON MICRO RNA PREDICTION, AND ONE HAS THE MINATAR PROGRAM WHICH LOOKS FOR MICRO NHA TARGETS. SURPRISINGLY THEY ARE MORE PREVALENT IN ORFS, AND MORE PREVALENT IN SOME SITES. WE FOUND LAST SUMMER IN ANOTHER PAPER GENOME RESEARCH THAT MICRONATE TAR GETTING TARGETS SEQUENCE REPEATS IN ORF REGIONS THAT WERE PREVIOUSLY NOT KNOWN TO BE TARGETED AND MAY SUGGEST ROLES AND REGULATION. AND WE TEAMED UP WITH DAVID BARTELL FOR THIS, HE DID EXPERIMENTS TO CONFIRM THIS IN HUMANS. BIOLOGICALBOURQU WORK ON
BOILING CALL NETWORKS. WE TAKE PROTEIN SEQUENCES AND PREDICT THEIR STRUCTURE AND WE WORK ON SIGNALING NETWORK RECONSTRUCTION, AND WE ALSO INTEGRATE STRUCTURE-BASED PREDICTIONS WITH SYSTEMS-WIDE SIGNALING NETWORK AND NETWORK ANNUAL THIS IS. ANALYSIS. WE’RE APPLYING COMPUTATIONAL TECHNIQUES TO BIOLOGICAL PROBLEMS. SO I WANT TO THANK THE PEOPLE OF MY GROUP, THE COMPRESSIVE GENOMICS WORK WAS DONE BY TWO OF MY GRAD STUDENTS AT THE TIME, MICHAEL BAINES NOW POST DOC AT HMS, AND THE MEDICAL GENOMICS WORK DONE BY NATHAN PALMER, PATRICK SCHMIDT, BOTH STUDENTS NOW — NATHAN AT HMS, PATRICK IS GO THERE, AND ZAK HOHAMI AND BIOLOGICAL NETWORK WORK WAS DONE WITH ISO RANK WITH HELP, AND MICHAEL BAIN AND OTHERS. AND DANNY PARK HELPED DO THE ISO-BASED DATABASE. I WANT TO THUNDERSTORM WARNING BUNCH OF OTHERS WHO HAVE BEEN COLLABORATIVE AND INSTRUMENTAL IN THIS WORK. THANK YOU. [APPLAUSE] >>THANKS FOR A VERY STIMULATING AND BROAD RANGING PRESENTATION. THE FLOOR IS OPEN FOR QUESTIONS, THE MICROPHONE IS IN THE AISLES. PLEASE USE THOSE IF YOU HAVE A QUESTION TO POSE. YES, SIR? >>I WONDER IF YOU WOULD SPECULATE WHETHER COMPARING NETWORKS ASSOCIATED WITH TOXICOLOGY OR TOCK IS CIT TOCK
TOXICITY I
N ANIMALS FOR TESTING PHARMACEUTICALS COULD BE AL APPLIED NOT TO SO TO WHEN NETWORKS ARE SIMILAR WHEN ANIMAL CASES WOULD PREDICT HUMAN TOXICICITY. >>YOU SHOULD LOOKED A MODULE AND NODE CORRELATIONS BUT THAT’S DOABLE. THE PROBLEM IS IF YOU’RE MISSING DATA YOU DON’T KNOW IT’S NOT SO. THERE’S PROBABLY A LOT OF MISSING DATA. >>THERE’S A HUGE ISSUE OF LATE FAILURES IN DRUG DEVELOPMENT, BECAUSE TOX ISSUES WERE NOT IDENTIFIED AS AN EARLY STAGE. THIS COULD BE A VALUABLE APPROACH. >>I WOULD LOVE TO TALK TO YOU ABOUT THAT. THAT SOUNDS INTERESTING. >>I LIKE THE PCA PLOTS WITH ALL THE DIFFERENT SAMPLES AND HOW YOU WERE ABLE TO STRATIFY. YOU ONLY SHOWED THE TWO PRINCIPLE COMPONENTS. THE OTHER PART, IF YOU LOOKED AT INDEPENDENT COMPONENT ANALYSIS OR MULTIPLE DIMENSIONAL SCALING IF THAT GAVE YOU MORE OR LESS INFORMATION ON THE SAMPLES? >>WE DID LOOK AT MORE COMPONENTS. THEY ARE HARD TO PUT ON HERE. >>OF COURSE. >>FOR THE LEVEL THAT WE’RE WORKING AT RIGHT NOW WE DIDN’T NEED THEM BUT I — THERE WAS DEFINITELY MORE DATA WHEN YOU WENT OUT TO A FEW MORE DIMENSIONS. I DON’T KNOW ABOUT THE MULTI-DIMENSIONAL TESTING. >>GOTCHA. OKAY. >>BONNIE, IN THAT ANALYSIS WHERE YOU WERE DOING THE LEAVE ONE OUT EXPERIMENTS TO SEE IF THEY MAPPED TO WHERE THEY SHOULD BASED ON CELLULAR BASIS OF ORIGIN YOU SAID YOU GOT IT RIGHT ABOUT 92.8% OF THE TIME. IT WOULD BE INTERESTING TO LOOK AT THE ONES FOR YOU, BECAUSE THERE MIGHT BE INTERESTING BIOLOGY THERE IF A DATA SET DIDN’T LAND WHERE YOU EXPECT. DO >>YOU’RE SPEAKING LIKE A TRUE BIOLOGIST. THEY WANT TO KNOW THE CASES WHERE COMPUTATIONAL TECHNIQUES DON’T WORK. >>EXACTLY. >>WE DIDN’T. WE WERE JUST TRYING TO VALIDATE. THAT’S A VERY GOOD POINT. YEAH, IT ALSO COULD BE A LOT OF THAT COULD BE ERRONEOUS MAPPING. >>COULD BE, RIGHT. >>A LOT OF THE SAMPLES ARE PROBABLY MISLABELED, AND MAY HAVE ENDED UP, THERE MAY BE NOISE IN THE DATA. OH, YEAH, YEAH, YEAH. WE HAD TO CURATE THIS 3,000 SAMPLE SET. WE HAD TO REALLY LOOK AT IT TO GET OUR CURATED MACHINE READABLE DATABASE. IT WAS KIND OF A NIGHTMARE. NOW THAT WE HAVE THE 3,000 WE’VE BEEN ABLE TO RUN IT AUTOMATICALLY TO GET A LOT MORE TO CHARACTERIZE TENS OF THOUSANDS OF MORE SAMPLES. WE DON’T ONLY HAVE 3,000 NOW. ANYWAY -. >>ANOTHER QUESTION? >>YES. SO AGAIN THE PCA MAP WAS SIMILAR TO BARIBASI’S MAP. CAN YOU EXPLAIN SOME OF THE SIMILARITIES BETWEEN THAT NETWORK AND WHAT YOU’VE DONE AS WELL? >>WELL, I DON’T KNOW WHICH BARIBASI NETWORK YOU’RE TALKING ABOUT. HE HAS A LOT. >>BASICALLY THE DISEASE NETWORK WHERE YOU SHOWED — >>WHAT DISEASE IS RELATED — WE’RE NOT SHOWING DISEASE RELATIONSHIPS. WE’RE KIND OF HIGHLIGHTING SIMILAR TISSUES AND THEN WE’RE÷ PLACING DISEASE SAMPLES ON THE SAMPLES, ON THOSE MAPS. >>YOU’RE BASING IT ON TISSUES. >>BASING IT ON TISSUES. >>I SEE, I SEE. >>AND WE’RE GIVING YOU THE PHENOTYPIC SAMPLES THEY ARE MOST ENRICHED FOR. >>HOW BIG A PROBLEM IS MISSING DATA FOR YOUR FUNCTIONAL ORTHOLOGY ANALYSIS? IT SEEMS LIKE YOU CAN’T GO THERE WITHOUT A COMPLETE DATA SET. >>YOU ARE SO RIGHT. IT IS A BIG PROBLEM. ESPECIALLY IF YOU’RE TRYING TO DO MOUSE DATA. FORTUNATELY, WE CAN ADJUST THE ALPHA PARAMETER AND WE CAN WEIGHT THE SEQUENCE DATA MORE IN THE CASES WHERE WE DON’T HAVE THE NETWORK DATA. BUT THAT’S WHY WE ALSO WANT TO GENETIC INTERACTION DATA. WE HAD A LOT MORE OF THAT. >>GOT IT. >>THANKS. >>WELL, IT’S BEEN FASCINATING CONVERSATION. YOU’RE WELCOME TO COME DOWN IN CONTINUEYOU WANT TO CONSIDER THE CONVERSATION WITH BONNIE. YOU’RE WELCOME TO SPEAK WITH THE PRESENTER HERE DOWN FRONT. LET US THANK THE PRESENTER ONE MORE TIME. THANK YOU, DR. BERG. [APPLAUSE]

One thought on “Computational Biology in the 21st Century: Making Sense out of Massive Data

Leave a Reply

Your email address will not be published. Required fields are marked *