Exploratory data analysis

From Wikipedia, the free encyclopedia

Exploratory data analysis (EDA) is that part of statistical practice concerned with reviewing, communicating and using data where there is a low level of knowledge about its cause system. It was so named by John Tukey. Many EDA techniques have been adopted into data mining and are being taught to young students as a way to introduce them to statistical thinking.

Contents

Tukey's books were notoriously opaque, and so several attempts were made to popularise his EDA ideas. Prominent among these was the Statistics in Society (MDST242) course of The Open University.

Tukey held that too much emphasis in statistics was placed on evaluating and testing given hypotheses (confirmatory data analysis) and that the balance was in need of redressing in favour of using data to suggest hypotheses to test. In particular, confusion of the two types of analysis and employing them on the same set of data can lead to systematic bias owing to the issues endemic in testing hypotheses suggested by the data.

The objectives of EDA are to:

The principal graphical tools used in EDA are:

The principal quantitative tools are:

  • Median polish
  • Letter values
  • Resistant line
  • Resistant smooth
  • Rootogram

Many EDA ideas can be traced back to earlier authors, for example:

  • Francis Galton - his emphasis on order-statistics and percentiles
  • Arthur Bowley - used precursors of the stemplot and five-figure summary (Bowley actually used a "seven-figure summary", including the extremes, deciles and quartiles, along with the median)
  • Andrew Ehrenberg's philosophy of data reduction (see his book of the same name).

The Open University course Statistics in Society (MDST 242), took the above ideas, and merged them with Gottfried Noether's work, which introduced statistical inference via coin-tossing and the median test.

For details of the above, see John Bibby's book HOTS: History of Teaching Statistics.

  • XLisp-Stat (free software and Lisp based EDA development framework for Mac, PC and X-Windows)
  • ViSta (free interactive software based on Xlisp-Stat for EDA)
  • DataDesk (free-to-try commercial EDA software for Mac and PC)
  • Orange (free component-based software for interactive EDA and machine learning)
  • GGobi (free interactive multivariate visualization software linked to R)
  • MANET (free Mac-only interactive EDA software)
  • Mondrian (free interactive software for EDA)
  • Fathom (for high-school and intro college courses)
  • TinkerPlots (for upper elementary and middle school students)
  • CMU-DAP (Carnegie-Mellon University Data Analysis Package, FORTRAN source for EDA tools with English-style command syntax, 1977)
  • Exploratory data analysis (Statgraphics Centurion)

  • Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1985). Exploring Data Tables, Trends and Shapes. ISBN 0-471-09776-4. 
  • Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1983). Understanding Robust and Exploratory Data Analysis. ISBN 0-471-09777-2. 
  • Tukey, John Wilder (1977). Exploratory Data Analysis. Addison-Wesley. ISBN 0-201-07616-0. 
  • Velleman, P F & Hoaglin, D C (1981) Applications, Basics and Computing of Exploratory Data Analysis ISBN 0-87150-409-X

  • Leinhardt, G., Leinhardt, S., Exploratory Data Analysis: New Tools for the Analysis of Empirical Data, Review of Research in Education, Vol. 8, 1980 (1980), pp. 85-157.
Advanced Search
Included Web Search Engines


Safe Search

close

Top Matching Results

Occasionally Search.com will highlight specialized results that are based on the context of your query. Examples of specialized results include specific links to news, images, or video.

Top Matching Results may highlight information from other Search.com pages, content from the CNET Network of sites, or third party content. The listings are based purely on relevance. Search.com does not receive payment for listings in this section but our partners that provide this data may get paid for listing these products.

Sponsored Links

This section contains paid listings which have been purchased by companies that want to have their sites appear for specific search terms and related content. These listings are administered, sorted and maintained by a third party and are not endorsed by Search.com.

Search Results

Search.com sends your search query to several search engines at one time and integrates the results into one list which has been sorted by relevance using Search.com's proprietary algorithm. You can customize the list of search engines included in your metasearch from the preferences.

The search engines that are used in your metasearch may allow companies to pay to have their Web sites included within the results. To view the Paid Inclusion policy for a specific search engine, please visit their Web site. Search.com does not accept payment or share revenue with any search engine partner for listings in this section.