DSE Curriculum | Master of Advanced Study Degree UC San Diego

DSE Curriculum

The MAS in Data Science and Engineering is a 38-unit degree program which consists of eight 4-unit courses, a 2-unit case-studies course, and a 4-unit capstone project. The program begins Fall quarter and can be completed in two years of consecutive Fall, Winter and Spring quarters; no courses are conducted during the summer. The capstone provides an opportunity for students to integrate knowledge acquired over previous quarters in a written report and oral presentation.



Fall Quarter

Python for Data Analysis (4 units)

The goal of this course is to bring students with diverse background and experience to a common level of competency in programming in the context of complex and noisy data. Solid competency in Python programming provides its owner with autonomy and independence in their work. Introduction to object oriented programming using python. Regular expressions. Numpy and Numerical Processing. Ipython and Plotting. Data analysis using PANDAS. Webpage scraping using Scrapy. The Twitter API. NLTK.

Case Studies in Data Science (2 units)

Case studies discussed by speakers from industry, government and academia expose students to the needs and uses of different technologies and their roles in model building

Winter Quarter

Data Management Systems (4 units)

This course will provide an introduction to the management of structured data beginning with an introduction to database models including relational, hierarchical, and network approaches. It will also cover topics in database system implementation including query languages and system architectures; parallel, column-oriented, and array-based database systems; advanced SQL features including user-defined functions (UDFs), triggers, statistical functions; and support for spatial data.

Probability and Statistics Using Python (4 units)

The goal of this course is to give the student a foundation in probability and statistics. Probability and statistics, using Python. Distribution over the real line; independence, expectation, variance, correlation. Central limit theorem. Chernoff/hoeffding bound. Statistical tests. The Bonferroni correction. Book: “Think Stats: Probability and Statistics for Programmers” by Allen Downey.

Spring Quarter

Machine Learning (4 units)

This course provides a broad introduction to the practical side of machine-learning and data analysis. The topics covered in this class include topics in supervised learning, such as k-nearest neighbor classifiers, decision trees, boosting and perceptrons, and topics in unsupervised learning, such as k-means, PCA and Gaussian mixture models.

Data Analysis Using Hadoop and Spark (4 units)

Map-reduce, streaming analysis, and external memory algorithms and their implementation using the Hadoop and its eco-system: HBase, Hive, Pig and Spark. The class will include assignment of analyzing large existing databases

Year Two

Data Integration & ETL (4 units)

The course is designed to provide students with the fundamentals of data integration and includes: schema mapping and matching, entity disambiguation, ontology development and management, data provenance, and crowd sourcing and machine learning as strategies for integration. The course will also require hands-on projects in which students will work on a data integration problem requiring integration of two or more datasets taken from an application domain of their choice (e.g. geospatial data, healthcare, financial applications, bioinformatics, etc).

Data Science Design Capstone Project (2 units) Winter Quarter Year Two

A team design project in the final two quarters of the program culminates in a final report and an oral presentation of the capstone project. In addition, there might be a demonstration of the working prototype. The project will start by identifying a domain of interest and the available data sources that will be used to study the domain. From this starting point there will be two parallel and interdependent lines of work: data extraction, Transformation and Loading (ETL), and statistical analysis and model building. The ultimate goal will be to present a processing pipeline which transforms the raw data into more usable forms and models which separates between the predictable and the unpredictable aspects of the underlying system.

Data Science Design Capstone Project - Continued (2 units) Spring Quarter-Year Two

A team design project in the final two quarters of the program culminates in a final report and an oral presentation of the capstone project. In addition, there might be a demonstration of the working prototype. The project will start by identifying a domain of interest and the available data sources that will be used to study the domain. From this starting point there will be two parallel and interdependent lines of work: data extraction, Transformation and Loading (ETL), and statistical analysis and model building. The ultimate goal will be to present a processing pipeline which transforms the raw data into more usable forms and models which separates between the predictable and the unpredictable aspects of the underlying system.

Elective Courses

Data Analysis Using R (4 units)

R, an open source software project with an extensive library of freely available packages and the capability to apply most modern statistical methods, has emerged as a leading statistical computing environment. This course will focus on providing fundamental compute skills necessary for effective data analysis and machine learning tasks by applying modern statistical methods implemented in R. The course covers practical issues in statistical computing including data preparation, manipulation, analysis and the generation of analytical, predictive and graphical results. Topics in statistical data analysis, machine learning and graphics applications will be provided along with practical working examples. Machine learning topics are introduced as needed when addressing real world data mining case studies.

Performance Measurement (4 units)

This course will introduce practical and pragmatic considerations related to the performance of big data solution approaches, cover the fundamentals of computer performance measurement, especially as applied to database and big data systems, and provide an understanding of the primary determinants of performance in big data systems. The course will cover: tools and techniques for performance monitoring and performance tuning; the role of benchmarking and how to interpret benchmark results in context; how to read query plans and perform optimizations and database tuning; behavior of schedulers and governors, including systems like YARN, Mesos, Fair Scheduler, FIFO, HPC schedulers; performance characteristics of data analytics operations; practical limits to performance scaling; and recent results in big data performance and ongoing work in performance optimizations.

Online Analytics Applications (4 units)

The course will cover the functionality of online analytics applications from the business analyst point of view; basics of application and data infrastructure architecture; data organizations for systematic data precomputation. The use of data warehouses, data cubes (with emphasis on ROLAP organizations) and materialized views; Fast Data use cases and relevant technologies: combining transactional and analytical databases; incremental maintenance of precomputed views; use of novel database systems: parallel, column & Hadoop/mapreduce-enhanced; visual interfaces & dashboards for custom analytics applications. Application development technologies and methodologies with emphasis on Model-View frameworks facilitating reflection of state on view. Overview of web-based visualization libraries; custom web-based visualizations (D3); achieving online performance with approximations.

Data Visualization (SDSC: 4 units)

The goal for the course is to use visualization as a tool to explore trends, relationships, confirm hypothesis, communicate findings and gain insight about data. This course will focus on teaching students the principles and techniques for creating visual representation from raw data. The course exercises will be based on publicly available datasets and utilize freely available tools like D3.JS and VisIt. The course will be modeled similar to Stanford’s visualization CS448 course and will include an introduction to visualization, vis foundation review, color, interaction, dashboards and Heat Maps, introduction to D3.Js. high dimensional data, network data, geographic data, text data, scientific visualization: isosurface, volume rendering, and introduction to VisIt.

Beyond Relational Data Models (4 units)

The course covers data models, query languages and models of computation beyond those employed in relational databases. It addresses new developments that have gained attention with the advent of the Web 2.0 and Big Data revolutions. The topics are presented in a unifying framework and include: key-value pairs as data model, as used in Google's Big Table; Object- Oriented Data Model, with its practical support in relational databases via the Object-Relational Mapping (involves ODMG standards ODL and OQL, and recent systems such as Ruby on Rails); semi-structured databases (data organized as graph with labels on nodes and edges), query languages based on reachability constraints between nodes: conjunctive regular path queries); XML databases, as special case of semi-structured databases in which the graph is a tree (this involves associated standards such as XML Schema, XPath and XQuery); RDF databases (with associated OWL and SPARQL standard).

Managing Large-Scale Graph Data (4 units)

Large-scale graphs appear in many diverse applications including the World Wide Web, Social Networks, Human Communication (e.g., phone call graphs, email graphs), Professional Networks (who knows/follows whom), Biological Networks, and Linked Data Graphs. The goal of this course is twofold (a) get students acquainted with data management issues related to graphs, including storage, indexing, querying, and computing with large graph data, and (b) give them a hands-on experience with Neo4j and Gremlin. Prerequisite: Successful completion of DSE250 or written permission of the instructor. The lecture portion of the course will cover: Basic Principles of Graph Data Management, Storage Techniques for Large-Scale Graphs, Indexing Graphs, Query Processing for Graphs, Computing Graph Functions, Special Processing Techniques for Citation Networks, Social Networks, Biological Networks.