Note

This course is no longer offered! You might want to check out a newer course Stats 352.

Philosophy

Computing for Data Science presents computing tools and concepts for all stages of dealing with the modern data deluge—statistical computing at the center, but also the essential surrounding tasks, including data organization, presentation of results and the user interface. This approach is needed to deal with the challenges posed by modern technology, challenges that are also opportunities for better use of data. The size and complexity of data sources has increased enormously, while the importance of learning from the data has been recognized as never before. New modes of computing such as large-scale parallelism and cloud computing can help, but require new approaches to programming. But the key challenge is to use our own time effectively by choosing the best programming approach for each stage of a project.

To meet these challenges, we present a range of computing paradigms and corresponding languages, each designed for ease of use but also providing a rich set of tools. We use the R language and the thousands of packages written for it for core statistical computing.

Other languages are discussed for tasks where they excel. For example, Python provides a similarly strong language and a set of supporting packages for data processing, scientific computing and interactive interfaces. An approach through inter-system interfaces and interactive front ends allows you to add features from these languages without mastering all the details. Alternatively, a solution can be programmed largely in another system when appropriate, and then made available in R. Object-oriented programming techniques are particularly valuable. We discuss these both in the functional form found in R and also in the encapsulated form typical of languages such as Java and C++.

Other languages are discussed for tasks where they excel. An approach through inter-system interfaces and interactive front ends allows you to add features from these languages without mastering all the details.

Description

This course covers programming and computing techniques to support projects in data analysis and related research. It is suitable for students at the graduate level in statistics or in other fields where substantial data analysis and development of associated computational software is part of the student’s research activity. Prerequisites are basic competence in computer programming and in statistics plus a serious interest in applying computing to data analysis. Experience with R will be an advantage.

The course will cover the major concepts in programming with R, emphasizing its use to implement and share research and applications of data analysis through R packages. It should be of interest to anyone involved in applying these tools in Statistical Computing, Bioinformatics and Data Mining.

To meet the challenges of modern data analysis, other languages and systems will be included to support data acquisition and management, data visualization and graphics, and user interfaces, especially via the Web. Python will be emphasized for its effective use in many of these areas. Techniques using Java, and XML will be included. Discussions will emphasize inter-system interfaces. Examples from Bioinformatics, databases, distributed and web-based data sources will be used.

The course will include four homework exercise sets. A main requirement is a final project, either an R package or other software contribution of similar scope. Students may choose from a list of projects that will be provided or propose a project. (The latter is subject to the instructor’s approval.) Project proposal details will be forthcoming in class.

Prerequisites

  • Basic programming (at the level of CS106a/CS106b)

  • Basic Statistics (at the level of Stat 110 or Stat 141)

Please note that hands-on programming will be necessary in this class and this is not an introductory R class.

Instructors

Teaching Assistants

  • See Canvas files for details.

Time & Location

  • Mon, Wed, Fri, 10:30-11:20AM in Gates B1

Topics

Topics covered will include R programming at a fairly advanced level, objects, data manipulation, graphics, serialization formats, language interfaces, parallel computing basics, big data resources in R. There will be one or two guest lectures to be announced later.

Canvas

Canonical link to Stanford Canvas.

Textbook

There is no textbook that covers all the material for this class. Among the books we use are:

  • Software for Data Analysis by John Chambers, Springer (2008)
  • Extending R by John Chambers, CRC Press (2016)
  • R Graphics Cookbook by Winston Chang, OReilly (2012)
  • Advanced R by Hadley Wickham, CRC Press (2014)
  • R for Data Science by Hadley Wickham and Garrett Grolemund
  • R packages by Hadley Wickham, OReilly (2015)
  • Deep Learning with R (Francois Chollet and J. J. Allaire)

Other resources will be noted as appropriate in class.