RWF Consulting LLC
Advancing Business Processes Worldwide

Big Data

The Big Data Blog, Part II: Daniela Witten

Daniela Witten, Assistant Professor of Biostatistics at the University of Washington, who was featured three times in Forbes’ 30 Under 30 and works on statistical machine learning to find ways of using Big Data to solve problems in genomics and biomedical sciences discusses Big Data in our second interview.

17 March 2014

Nicholas Bashour



Center for Science, Technology, and Security Policy

Leading up to the AAAS Center for Science, Technology, and Security Policy (CSTSP) and Federal Bureau of Investigation’s public event on April 1, “Big Data, Life Sciences, and National Security,” CSTSP is bringing you a series of interviews with leading experts on Big Data. In our previous post, we talked with Subha Madhavan, Director of the Innovation Center for Biomedical Informatics at the Georgetown University Medical Center, about Big Data in the biomedical field.

Today’s expert is Daniela Witten, Assistant Professor of Biostatistics at the University of Washington. Daniela, who was featured three times in Forbes’ 30 Under 30, works on statistical machine learning to find ways of using Big Data to solve problems in genomics and biomedical sciences.

CSTSP: How is Big Data defined?

Daniela: Big Data is a catchphrase for the massive amounts of data being collected in a variety of fields, like astrophysics, genomics, sociology, and marketing. Big Data has gotten a lot of attention recently as people have become increasingly aware of this data explosion, and of the potential to perform statistical analyses on these data in order to answer really interesting, and previously unanswerable questions.

CSTSP: What, in your opinion, are the most innovative applications of Big Data?

Daniela: I’m particularly interested in Big Data in biomedical research. For instance, how can we use Big Data to determine the best way to treat a cancer patient, or to identify brain regions that are affected in certain diseases? In that context, the data are not only big, but also “high-dimensional” — they are characterized by having many more features or measurements (e.g. brain regions) than observations (e.g. patients with a particular disease).  This high dimensionality leads to a lot of statistical challenges.

CSTSP: You mentioned data that is “high-dimensional.” Can you expand a little more on what that means, what it does, and how is it applied?

Daniela: When we talk about "high-dimensional" data, we mean data in which there are more measured variables, or in the language of statisticians, "features," than observations on which those variables are measured, such as patients, rats, or whatever else we are taking measurements on.

For instance, suppose we are interested in predicting a kid's height on the basis of his/her shoe size, and his/her father's height. We need to collect some data in order to build this predictive model. We might collect data from 200 kids; for each kid we measure the kid's shoe size, father's height, kid's height. Then we have three measured variables and 200 observations. This is "low-dimensional" data, to which classical statistical methods, like linear regression,  are applicable.

Now instead, suppose we want to predict a kid's height using his DNA sequence: that is, 3 billion variables, because a person's DNA sequence is 3 billion basepairs long. If we collect data from 200 kids, and measure height and DNA sequence for each kid, then we now have billions of measured variables and 200 observations. This is "high-dimensional" data -- and it leads to a lot of statistical challenges! Classical statistical methods like linear regression cannot be applied, and there is a great danger of overfitting the data.

CSTSP: What does “overfitting the data” refer to?

Daniela: "Overfitting" refers to a model that performs well on the data for which it was developed but performs poorly on new data not used in model development. In the example of predicting kids' heights, if you develop a model that does GREAT on the 200 kids used to fit the model, but then does poorly at predicting the heights of the next 200 kids who walk into your office, then you have overfit the data!

CSTSP: In what ways do you think the Big Data field will change or grow in the near future?

Daniela: For the most part, the insights that are being drawn from Big Data right now are just the low-hanging fruit. This is because the statistical techniques needed to delve more deeply into this type of data are lagging a step behind.

There are two reasons that we need new statistical techniques to make sense of Big Data: (1) even simple scientific questions cannot be answered on the basis of Big Data using existing statistical techniques; and (2) many of the scientific questions that we wish to answer using Big Data are not simple.

Classical statistical techniques are intended for the setting in which we have a large number of observations, for instance, patients enrolled in a study, and a small number of features or measurements, for instance, height, weight, and blood pressure, per observation. But in the context of Big Data, we are often faced with “high-dimensional” situations where there are many more features than observations.

For instance, if we perform brain imaging on cancer patients, we might have hundreds of thousands, or even millions, of features (brain regions for which measurements are taken), but only hundreds of observations (cancer patients).  So many classical statistical techniques, such as linear regression, cannot be applied, and there is a need for new statistical methods that are well-suited for this “high-dimensional” scenario. This has been an active area of statistical research in recent years, but a lot more work is needed in order for valid statistical conclusions to be reliably drawn based on this type of Big Data.

Furthermore, in gleaning insights from Big Data, we face an additional problem: many of the scientific questions that we wish to answer on the basis of Big Data have not yet been well-studied from a statistical perspective, because those questions simply didn’t exist for old-fashioned "Small Data”!

Collaborative filtering is one example of a statistical method that has been newly-developed in the context of Big Data, in order to answer a question that didn't arise with Small Data. Collaborative filtering systems are used by companies like Amazon to suggest to a customer items that he or she might want to purchase, based on his or her past purchase history as well as purchases made by other customers.

These systems have potential applications in medicine. For instance, based on a patient's electronic medical record (EMR) as well as the EMRs of many other patients, can we identify diseases or conditions for which that patient is at risk? It is important to further develop and refine collaborative filtering techniques before we apply them in a clinical setting where they will play a role in patient care.

CSTSP: In regards to the new statistical techniques and challenges, you mentioned “collaborative filtering.” What are some of the other innovative statistical techniques that are being developed to better utilize Big Data?

Daniela: I am very interested in developing methods for "graphical modeling" on the basis of high-dimensional data. A "graphical model" is a representation of the dependence relationships among a set of variables. Graphical models have applications in many fields, but I am particularly interested in applications in biology. For instance, this is a picture of a graphical model.  In the context of biology, you can think of each black dot in that picture as a gene, and each line (red or blue) as representing two genes that have some sort of relationship between them (for instance, maybe one gene acts upon another).

Why is this interesting? There are more than 20,000 genes in humans, and it is believed that genes work together in complex ways in order to perform important biological functions. It is hoped that understanding the relationships among genes will lead to important insights about biology. People would like to discover, for instance, that Gene X affects Genes Y, Z, and W, and that Gene W affects Genes A, B, C, D, E.

Understanding these relationships among genes could also lead to insights about human diseases. For instance, maybe in normal tissue, Genes A, B, C, and D interact, but in some particular disease, Gene A doesn't properly interact with Genes B, C, and D.  This could lead to a better understanding of the disease, as well as possible therapeutic targets.

But drawing these sorts of conclusions based on the available data is really challenging, and better statistical methods are needed. Part of my research involves developing such methods.   The available data are very high-dimensional: they typically consist of around 20,000 gene measurements, on a few hundred patients.

CSTSP: What are some of the biggest challenges or risks facing Big Data in the near future? How can groups working with Big Data avoid overlooking these challenges and risks?

Daniela: One of the biggest challenges about Big Data is that it is easy to confuse noise with signal — or, as a statistician would describe it, to overfit the data. Unfortunately, it is much easier to accidentally overfit with Big Data than with Small Data — and in fact, overfitting is virtually guaranteed unless one vigilantly guards against it! For instance, if overfitting has occurred, then an algorithm that seems to perfectly predict patient response to chemotherapy may give terrible results on new patients not used in developing the algorithm.

Some approaches, like cross-validation and the use of a separate test set, can and should always be used to reduce the risk of overfitting. However, these approaches have their limitations -- for instance, they result in a substantial reduction in the sample size, which can be a big problem in the context of biomedical data for which observations (e.g. patients) are quite limited.

To overcome this problem, the statistical community has been working hard in recent years to develop betters ways to measure uncertainty and perform inference for models applied to Big Data. 

1200 New York Ave NW, Washington, DC 20005


Last Updated 10 Apr 2014

Read our privacy policy and terms of use

© 2017 American Association for the Advancement of Science


Cancer’s Big Data Problem


The US Department of Energy (DOE) and the National Cancer Institute will use DOE supercomputers to help fight cancer by building sophisticated models based on  population,  patient,  and molecular data, explains “Cancer’s Big Data Problem,” from IEEE Computing in Science & Engineering.

PDF Link to Cancer’s Big Data Problem


Becoming a Data Scientist

Data Science, Machine Learning, Big Data Analytics, Cognitive Computing …. well all of us have been avalanched with articles, skills demand info graph’s and point of views on these topics (yawn!). One thing is for sure; you cannot become a data scientist overnight. Its a journey, for sure a challenging one. But how do you go about becoming one? Where to start? When do you start seeing light at the end of the tunnel? What is the learning roadmap? What tools and techniques do I need to know? How will you know when you have achieved your goal?

Given how critical visualization is for data science, ironically I was not able to find (except for a few), pragmatic and yet visual representation of what it takes to become a data scientist. So here is my modest attempt at creating a curriculum, a learning plan that one can use in this becoming a data scientist journey. I took inspiration from the metro maps and used it to depict the learning path. I organized the overall plan progressively into the following areas / domains,

  1. Fundamentals
  2. Statistics
  3. Programming
  4. Machine Learning
  5. Text Mining / Natural Language Processing
  6. Data Visualization
  7. Big Data
  8. Data Ingestion
  9. Data Munging
  10. Toolbox

Each area  / domain is represented as a “metro line”, with the stations depicting the topics you must learn / master / understand in a progressive fashion. The idea is you pick a line, catch a train and go thru all the stations (topics) till you reach the final destination (or) switch to the next line. I have progressively marked each station (line) 1 thru 10 to indicate the order in which you travel. You can use this as an individual learning plan to identify the areas you most want to develop and the acquire skills. By no means this is the end; but a solid start.


Delivering value from big data with Microsoft R Server and Hadoop


Businesses are continuing to invest in Hadoop to manage analytic data stores due to its flexibility, scalability, and relatively low cost. However, Hadoop’s native tooling for advanced analytics is immature; this makes it difficult for analysts to use without significant additional training and limits the ability of businesses to deliver value from growing data assets.

Microsoft R Server leverages open-source R, the standard platform for modern analytics. R has a thriving community of more than two million users who are trained and ready to deliver results.

Microsoft R Server runs in Hadoop. It enables users to perform statistical and predictive analysis on big data while working exclusively in the familiar R environment. The software uses Hadoop’s ability to apportion large computations for transparently distributing work across the nodes of a Hadoop cluster. Microsoft R Server works inside Hadoop clusters without the complex programming typically associated with parallelizing analytical computation.

By leveraging Microsoft R Server in Hadoop, organizations tap a large and growing community of analysts—and all of the capabilities in R—with true cross-platform and open standards architecture.



Accelerating R analytics with Spark and Microsoft R Server for Hadoop


Analysts predict that the Hadoop market will reach $50.2 billion USD by 2020.1 Applications driving these large expenditures are some of the most important workloads for businesses today including:

• Analyzing clickstream data, including site-side clicks and web media tags.

• Measuring sentiment by scanning product feedback, blog feeds, social media comments, and Twitter streams.

• Analysis of behavior and risk by capturing vehicle telematics.

• Optimizing product performance and utilization by gathering data from built-in sensors.

• Tracking and analyzing people and material movement with location-aware systems.

• Identifying system performance and intrusion attempts by analyzing server and network log.

• Enabling automatic document and speech categorization.

• Extracting learning from digitized images, voice, video, and other media types.

Predictive analytics on large data sets provides organizations with a key opportunity to improve a broad variety of business outcomes, and many have embraced Apache Hadoop as the platform of choice.

In the last few years, large businesses have adopted Apache Hadoop as a next-generation data platform, one capable of managing large data assets in a way that is flexible, scalable, and relatively low cost. However, to realize predictive benefits of big data, organizations must be able to develop or hire individuals with the requisite statistics skills, then provide them with a platform for analyzing massive data assets collected in Hadoop “data lakes.”

As users adopted Hadoop, many discovered performance and complexity limited Hadoop’s use for broad predictive analytics use. In response, the Hadoop community has focused on the Apache Spark platform to provide Hadoop with significant performance improvements. With Spark atop Hadoop, users can leverage Hadoop’s big-data management capabilities while achieving new performance levels by running analytics in Apache Spark.

What remains is a challenge—conquering the complexity of Hadoop when developing predictive analytics applications.

In this white paper, we’ll describe how Microsoft R Server helps data scientists, actuaries, risk analysts, quantitative analysts, product planners, and other R users to capture the benefits of Apache Spark on Hadoop by providing a straightforward platform that eliminates much of the complexity of using Spark and Hadoop to conduct analyses on large data assets.