The MAS in Data Science and Engineering is a 38-unit degree program which consists of eight 4-unit courses, a 2-unit case-studies course, and a 4-unit capstone project. The program begins Fall quarter and can be completed in two years of consecutive Fall, Winter and Spring quarters; no courses are conducted during the summer. The capstone provides an opportunity for students to integrate knowledge acquired over previous quarters in a written report and oral presentation.
Fall Quarter - Year One
Python for Data Analysis (4 units)
The goal of this course is to bring students with diverse background and experience to a common level of competency in programming in the context of complex and noisy data. Solid competency in Python programming provides its owner with autonomy and independence in their work. Introduction to object oriented programming using python. Regular expressions. Numpy and Numerical Processing. Ipython and Plotting. Data analysis using PANDAS. Webpage scraping using Scrapy. The Twitter API. NLTK.
Case Studies in Data Science (2 units)
Case studies discussed by speakers from industry, government and academia expose students to the needs and uses of different technologies and their roles in model building
Winter Quarter - Year One
Data Management Systems (4 units)
This course will provide an introduction to the management of structured data beginning with an introduction to database models including relational, hierarchical, and network approaches. It will also cover topics in database system implementation including query languages and system architectures; parallel, column-oriented, and array-based database systems; advanced SQL features including user-defined functions (UDFs), triggers, statistical functions; and support for spatial data.
Probability and Statistics for Data Science (4 units)
Probability and statistics for Data Science. Distribution over the real line; independence, expectation, variance, correlation. Central limit theorem. Chernoff/Hoeffding bound. Statistical tests. Bonferroni correction.
Spring Quarter - Year One
Machine Learning (4 units)
This course provides a broad introduction to the practical side of machine-learning and data analysis. The topics covered in this class include topics in supervised learning, such as k-nearest neighbor classifiers, decision trees, boosting and perceptrons, and topics in unsupervised learning, such as k-means, PCA and Gaussian mixture models.
Scalable Data Analysis (4 units)
The course exercises the data scientist's scalability tool box, covering such concepts as map-reduce, streaming analysis, external memory algorithms, as well as their implementation options in popular frameworks (e.g. Hadoop and its ecosystem: HBase, Hive, Pig and Spark, etc.). The class will include assignments of analyzing large existing databases.
Fall Quarter - Year Two
Data Integration & ETL (4 units)
The course is designed to provide students with the fundamentals of data integration and includes: schema mapping and matching, entity disambiguation, ontology development and management, data provenance, and crowd sourcing and machine learning as strategies for integration. The course will also require hands-on projects in which students will work on a data integration problem requiring integration of two or more datasets taken from an application domain of their choice (e.g. geospatial data, healthcare, financial applications, bioinformatics, etc).
Beyond Relational Data Models (4 units)
The course covers data models, query languages and models of computation beyond those employed in relational databases. It addresses new developments that have gained attention with the advent of the Web 2.0 and Big Data revolutions. The topics are presented in a unifying framework and include: key-value pairs as data model, as used in Google's Big Table; Object- Oriented Data Model, with its practical support in relational databases via the Object-Relational Mapping (involves ODMG standards ODL and OQL, and recent systems such as Ruby on Rails); semi-structured databases (data organized as graph with labels on nodes and edges), query languages based on reachability constraints between nodes: conjunctive regular path queries); XML databases, as special case of semi-structured databases in which the graph is a tree (this involves associated standards such as XML Schema, XPath and XQuery); RDF databases (with associated OWL and SPARQL standard).
Winter Quarter - Year Two
Data Visualization (SDSC: 4 units)
The goal for the course is to use visualization as a tool to explore trends, relationships, confirm hypothesis, communicate findings and gain insight about data. This course will focus on teaching students the principles and techniques for creating visual representation from raw data. The course exercises will be based on publicly available datasets and utilize freely available tools like D3.JS and VisIt. The course will be modeled similar to Stanford’s visualization CS448 course and will include an introduction to visualization, vis foundation review, color, interaction, dashboards and Heat Maps, introduction to D3.Js. high dimensional data, network data, geographic data, text data, scientific visualization: isosurface, volume rendering, and introduction to VisIt.
Data Science Design Capstone Project (2 units)
A team design project in the final two quarters of the program culminates in a final report and an oral presentation of the capstone project. In addition, there might be a demonstration of the working prototype. The project will start by identifying a domain of interest and the available data sources that will be used to study the domain. From this starting point there will be two parallel and interdependent lines of work: data extraction, Transformation and Loading (ETL), and statistical analysis and model building. The ultimate goal will be to present a processing pipeline which transforms the raw data into more usable forms and models which separates between the predictable and the unpredictable aspects of the underlying system. Examples of previous capstone projects can be found here
Spring Quarter - Year Two
Data Science Design Capstone Project - Continued (2 units)
A team design project in the final two quarters of the program culminates in a final report and an oral presentation of the capstone project. In addition, there might be a demonstration of the working prototype. The project will start by identifying a domain of interest and the available data sources that will be used to study the domain. From this starting point there will be two parallel and interdependent lines of work: data extraction, Transformation and Loading (ETL), and statistical analysis and model building. The ultimate goal will be to present a processing pipeline which transforms the raw data into more usable forms and models which separates between the predictable and the unpredictable aspects of the underlying system. Examples of previous capstone projects can be found here.
Supplemental Information
Prospective students often ask for links to resources that would be helpful to review in preparation for the DSE program. In response, the faculty have created a page of "brush up" materials covering math, programming and databases.