• fullslide1
    Our training will provide you with a clear road map to navigate the Big Data fields

Big Data/ Data Science Foundation in New York

The Big Data/ Data Science Foundation course in New York offers participants.

The foundation course is non-technical and is open to managers, professionals and decision makers. .

Big Data is a process to deliver decision-making insights. The process uses people and technology to quickly analyze large amounts of data of different types (traditional table structured data and unstructured data, such as pictures, video, email, transaction data, and social media interactions) from a variety of sources to produce a stream of actionable knowledge.

Organizations increasingly need to analyze information to make decisions for achieving greater efficiency, profits, and productivity. As relational databases have grown in size to satisfy these requirements, organizations have also looked at other technologies for storing vast amounts of information. These new systems are often referred to under the umbrella term “Big Data.”

Gartner has identified three key characteristics for big data: Volume, Velocity, and Variety. Traditional structured systems are efficient at dealing with high volumes and velocity of data; however, traditional systems are not the most efficient solution for handling a variety of unstructured data sources or semi structured data sources. Big Data solutions can enable the processing of many different types of formats beyond traditional transactional systems. Definitions for Volume, Velocity, and Variety vary, but most big data definitions are concerned with amounts of information that are too difficult for traditional systems to handle—either the volume is too much, the velocity is too fast, or the variety is too complex.

Big Data/ Data Science Foundation

3 days

Our Big Data/ Data Science Foundation course is a good place to start in case you do not have any experience with Big Data. It provides information on the best practices in devising a Big Data solution for your organization.

The technical components of the modules are optional. These components will be covered as the last topic each day.

Big Data Foundation

Course Outline

Big Data Foundation

Day 1

1. Introduction to Big Data
  • What is Big Data?
  • Usage of Big Data in real world situations
2. Data Processing Lifecycle
  • Collection
  • Pre-processing
  • Hygiene
  • Analysis
  • Interpretation
  • Intervention
  • Visualisation
  • Sources of Data

Technical Components (Optional). The below modules will be covered end of the day.

Introduction to Python

  • Jupyter
  • Interactive computing
  • Functions, arguments in Python

Introduction to Pandas

Day 2

3. Source of Data

Data collection is expensive and time consuming. In some cases you will be lucky enough to have existing datasets available to support your analysis. You may have datasets from previous analyses, access to providers, or curated datasets from your organization. In many cases, however, you will not have access to the data that you require to support your analysis, and you will have to find alternate mechanisms. Twitter data is a good example as, depending on the options selected by the twitter user, every tweet contains not just the message or content that most users are aware of. It also contains a view on the network of the person, home location, location from which the message was sent, and a number of other features that can be very useful when studying networks around a topic of interest.

  • Network Data
  • Social Context Data
  • Sendor Data
  • Systems Data
  • Machine log data
  • Structured Vs Unstructured Data
4. First Order Analysis and exploration
  • Basic Statistics
  • Analyse your dataset and determine features
  • Data validation
  • Noise and bias
  • Random errors
  • Systematic errors
5. Graph Theory

Technical Components (Optional). The below modules will be covered end of the day.

Introduction to NetworkX

  • Adjacency Matrix
  • Clustering
  • Create a Graph
  • Measure centrality
  • Degree distribution
6. Second order analysis

According to the SAS institute, machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look. There are two main classes of machine learning algorithms: (i) supervised and (ii) unsupervised learning. Exactly what does learning entail? At its most basic, learning involves specifying a model structure f that hopefully can extract regularities for the data or problem at hand as well as the appropriate objective function to optimize using a specified loss function. Learning (or fitting) the model essentially means finding optimal parameters of the model structure using provided input/target data. This is also called training the model. It is common (and best practice) to split the provided data into at least two sets - training and test data sets.

  • Machine Learning
  • Meta Data
  • Training data and test data
  • Identifying Features

Technical Components (Optional). The below modules will be covered end of the day.

  • Introduction to Scikit-learn
  • Introduction to Mlxtend

Day 3

7. Rolling out Big Data projects

Hypothetical Big Data project use case:

Cybersecurity measures within a company in relation to insider threats. The company hosts thousands of applications for various business functions. The context will be User Behavior Analytics. Signals include, login meta data for each application, location data, network data, employee data, performance appraisal data, travel data, deaktop activity data. The analytics is focused on determining a risk score based for each user.

Technological component or trend:

The technology component in the insider threat context requires collection and processing of the following data:

  • User Data
  • Application logs
  • Access data
  • Business data
  • Assets, CMDB
  • User activity
  • Network data

A layered approach for data processing is ideal starting with implementation of a ETL (Extract, Transform, Load). Processing of data is done through tools.

  • Extract, Transform, Load
  • Data processing
  • Normalization
  • Correlations
  • Risk profiling
  • Data lake

The last layer is the data lake which stores all structured and unstructured data. This can be accessed through libraries such as pandas, hadoop, graph db etc.,

The data lake will enable building algorithms to determine risky behavior and send alerts. The objective is to prioritize the alerts based on a risk score. Example, a user accessing a certain application from a specific ip address with a recent low rating on his performance appraisal and has booked a long holiday will be flagged as high risk.

  • Project Management
  • Different Phases
  • Technology components
  • Privacy
  • System architecture

Technical Components (Optional). The below modules will be covered end of the day.

  • K-Anonimity
  • Data Coarsing
  • Data suppression
Final Exam

40 Questions

Pass mark: 65%

Master Nodes

Job Tracker
Name Node
Secondary Name Node

  • Attend 3 days class
  • Schedule Exam
  • Take Exam

There is obvious visible information, which one is conscious of and there is information that comes off you. Example, from your phone one can determine which website you visited, who you called, who your friends are, what apps you use. Data science takes it further to reveal how close you are to someone, are you an introvert or an extrovert, when during the day are you most productive, how often do you crave for ice cream, what genre of movies you like, what aspects of social issues interest you the most etc.,

Sensors everywhere

With the possibility of adding sensors to everything, now there is deeper insight into what is going on inside your body. Spending 10 minutes with a doctor who gives you a diagnosis based on stated or observed symptom is less useful than a system that has data about everything going on inside your body. Your health diagnosis is likely to be more accurate with analysis of data collected through devices such as fitbits and implantables.

The amount of data available with wearables and other devices provides for rich insight about how you live, work with others and have fun.

Digital Breadcrumbs

Big Data and analytics is made possible due to the digital breadcrumbs we leave. Digital breadcrumbs include things like location data, browsing habits, information from health apps, credit card transactions etc.,

The data lets us create mathematical models of how people interact, what motivates us, what influences our decision making process and how we learn from each other.

Big Data versus Information

One can think of Big Data as the raw data available in sufficient volume, variety and velocity. Volumes here refer to terabytes of data. Variety refers to the different dimensions of data. Velocity refers to the rate of change.

A bank can use credit card information to develop models that’s more predictive about future credit behavior. This provides better financial access. What you purchased, frequency of purchase, how often do you pay back, where do you spend money are better predictors of payment credibility than a simple one dimensional credit score.


The Big Data analytics is a cyclical process

Collection refers to getting your data together. One would look at multiple sources of data and ensure there is sufficient volumes to justify useful analysis. Example, server logs could provide data about time of logins, resources accessed, frequency of requests etc.,

Pre-processing refers to normalizing data into 1’s and 0’s. Data needs to be normalized for it to be made useful. For example, if we are comparing number of friends in your contacts with location data containing GPS coordinates, we would need to have both these features normalized. We can then determine whether number of friends has any link to mobility.

Hygiene involves separating the noise from signal. This is to ensure that the data is reflective of reality and there is no unusual patterns which are lost.

Analysis involves a first order look at our data to determine patterns

Visualization involves graphical representation of data to detect patterns. The data can show you things like increased spending at the end of a quarter reflecting a pattern.

Interpretation involves second order analysis to determine deeper insights. This is where machine learning comes to play. Are people having lesser kids over the years, if so, what factors seem to play a role etc.,

Other Courses

Check Out Our Other Professional Courses

PMP Project Management Professional

Our Project Management Professional course in New York covers the best practices in the field of Project Management.

Lorem ipsum blah blah blah blah...

Call for monthly offer

iOS Application Development

We teach you everything you need to know to build great iOS apps for the iPhone, iPad devices.

$ 3390

Android Application Development

We cover Java programming language and then teach you the skills to build apps for devices running Android OS.

$ 3390

Professional Cloud Developer

We cover tools and techniques for full stack development which includes front end, back end and business layer.

$ 4552

PMI-ACP Agile Certified Practitioner

Our Agile covers covers SCRUM, XP and Lean. We teach you the most current Agile tools and techniques. Lorem ipsum blah blah blah blah...Lorem ipsum blah blah blah blah... blah blah blah...Lorem ipsum blah blah blah blah...blah blah blah...

$ 1800

Develop iOS Mobile Applications - School Program

We teach you everything you need to know to build great iOS apps for the iPhone, iPad devices.

Call for monthly offer

Copyright 2015 iKompass. All rights reserved.