**Data Science Online Training Curriculum**

**Unit 1: Introduction to Data Science**

Introduction to Big Data

Roles played by a Data Scientist

Analysing Big Data using Hadoop and R

Different Methodologies used for analysis in Data Science

The Architecture and Methodologies used to solve the Big Data problems

For example, Data Acquisition from various sources

Data preparation

Data transformation using Map Reduce (RMR)

Application of Machine Learning Techniques

Data Visualization etc.,

problem statement of few data science problems which we shall solve during the course

**Unit 2: Basic Data Manipulation using R in Data Science.**

Understanding vectors in R

Reading Data

Combining Data

sub-setting data

sorting data and some basic data generation functions

**Unit 3: Machine Learning Techniques Using R Part-1**

Machine Learning Overview

ML Common Use Cases and techniques

Clustering and Similarity Metrics

Distance Measure Types: Euclidean, Cosine Measures, Creating predictive models

**Unit 4: Machine Learning Techniques Using R Part-2**

Understanding K-Means Clustering in Data Science

Understanding TF-IDF and Cosine Similarity and their application to Vector Space Model

Implementing Association rule mining in R.

**Unit 5: Data Science Machine Learning Techniques Using R Part-3**

Understanding Process flow of Supervised Learning Techniques

Decision Tree Classifier

How to build Decision trees

Random Forest Classifier

What is Random Forests concept in data science

Features of Random Forest

Out of Box Error Estimate and Variable Importance

Naive Bayes Classifier

**Unit 6: Introduction to Hadoop Architecture**

Hadoop Architecture

Common Hadoop commands

MapReduce and Data loading techniques (Directly in R and in Hadoop using SQOOP, FLUME, and other data Loading Techniques)

Removing anomalies from the data

**Unit 7: Integrating R with Hadoop**

Integrating R with Hadoop using R

Hadoop and RMR package

Exploring RHIPE (R Hadoop Integrated Programming Environment)

Writing MapReduce Jobs in R and executing them on Hadoop

**Unit 8: Data Science Mahout Introduction and Algorithm Implementation**

Implementing Machine Learning Algorithms on larger Data Sets with Apache Mahout

**Unit 9: Additional Mahout Algorithms and Parallel Processing using R**

Implementation of different Mahout algorithms

Random Forest Classifier with parallel processing Library in R

**Unit 10: Project**

Project Discussion

Problem Statement and Analysis

Various approaches to solve a Data Science Problem

Pros and Cons of different approaches and algorithms