Fun Java learning

Welcome to Data Mining!
In this course, you will learn about the concepts of data mining, its applications, and knowledge discovery process. You will also be introduced to various data mining techniques.
So, What is Data Mining?
Data Mining is the process of extracting valid, useful, unknown and comprehensible information from data and using it to make proactive knowledge-driven business decisions. Data mining uses statistical procedures to find unexpected patterns in data and identifies associations between variables.
The concept of data mining is growing in popularity, in the realm of commerce and business activities in general. But it's kind of a misconceived or misunderstood topic, and I want to give you an idea of what Data Mining is all about. Basically we're in the information economy and what you have is more and more data being generated in every aspect you can think of. Every time you swipe your grocery card, when you try to get a discount for buying whatever products, that data is being downloaded to a database, on most transactions you do, there is some sort of data download. Organizations are storing, processing and analyzing data more than any time in history and that trend is going to continue to grow.

So what is Data Mining?

Data mining is the incorporation of quantitative methods. Will call them mathematical methods, that may include mathematical equations, algorithms, some of the prominent methodologies like traditional logistic regression, neural networks, segmentation, classification, clustering. Those are all methods, that utilize mathematics. Data mining is applicable across industry sectors. Generally, wherever you have processes wherever data you have, it is the application of these powerful mathematical techniques incorporation with some statistical type of inference testing, that will extract trends and patterns.

I teach a course in Data Mining for managers and over the first half of the course, I give students a very good understanding of what data mining is. Because, to be honest, many people don't quite understand it. It takes a full half of course to provide that understanding. What are these mathematical techniques? But, just as important, in the second half of the course, I say, now you understand these techniques. Now let’s use them in the business world. Let’s apply it to advertising and marketing effectiveness. Let's apply it to the e-commerce initiatives. Let's apply it to even health care processes supply chain processes. There’s just a number of businesses that can be mined with these techniques. Simply put, any organization that has data and processes, can be analyzed with data mining. And the results are extracting information, actionable information from these data resources, that organizations can fine-tune their processes, increase productivity and increase efficiency.

So the data mining topic, this whole idea or the concept is going to grow in popularity. Why? Because data continues to grow. Think about social networking - LinkedIn. Twitter. Facebook. What is it? Its more data and its data to describe people. What they do, what they like, who they are, when you're out buying or doing whatever. As far as using services, just conducting your daily lives, more and more there's data gathering and data capturing. And in the information economy, the way to extract strategic information from that data is what is Data Mining.

Data Mining Tasks
Let's now move on to the common classes of Data Mining tasks - Anomaly Detection, Associate Learning, Cluster Detection, Classification and Regression

Anomaly Detection refers to identifying items, events or observations that do not adhere to the expected pattern or the other items in the dataset.

Anomaly Detection Example
A good example is how the tax department models typical tax returns and then identifies returns that differ from this model using anomaly detection. This is used for audits and reviews.

Association learning is the ability to learn and remember the relationship between unrelated items or stimuli or behavior.

Association Learning Example
Association learning is the type of data mining that drives the recommendation engines in major sites like Amazon and Netflix. This would let you know that customers who bought a particular item also bought another item.

Cluster Detection is a type of pattern recognition particularly useful in recognizing distinct clusters or sub-categories within the data.

Cluster Detection Example
The purchasing habits of hobbyists like gardeners, artists and model builders would look quite different. By analyzing the purchasing behavior using clustering algorithms, one can detect the various subgroups within the dataset.

Classification - If an existing structure is already known, you can use data mining to classify new cases into these pre-determined categories.

Classification Example
The algorithms can be trained to detect systematic differences between items in each group by learning from a large set of pre-classified examples. The algorithm can then apply these rules to the new classification problems. For instance, a classifier can predict borrowers who cheat on loan payments.

Regression/Prediction uses the historical relationship between a dependent and one or more independent variables to predict values of the dependent variable.

Regression Example
It is a common practice for businesses to use regression to predict stock prices, currency exchange rates, sales, productivity gains and so on. For example, a company might use regression to get insights on how the expenses in past advertising have impacted the sales. Here, the dependent variable is sales and the independent variable is advertising expenditures, number of sales reps and the commission paid.

Data Mining Tasks - Summary
Keep in mind that not all patterns inferred by the data mining algorithms are necessarily valid. The patterns detected during the data mining process are often tested against a test set of data, and then the accuracy is validated. Once it achieves the desired standard, these algorithms are used to predict outcomes. Data mining, in this way, can grant immense inferential power

Hello! My name is Thales Sehn Körting and I will present very briefly how data mining works. Most of the times when people search about Big data mining, what they are interested in is the whole process, in which data mining is just one step. This video could be called "How knowledge discovery in databases works”. This is the real title of this video.

To show how knowledge discovery in databases works, we present the steps that starts from raw data, until we get acknowledgement about the data we have, and when we do these using our tools - computational and algorithm tools, we start from the raw data and obtain what we can call knowledge.

The first step is, conversion from raw data to target data, and this is what we call as "Selection of Data”. Suppose we have lots of information about a certain phenomenon and we want to derive some knowledge about this, sometimes we have data that is not useful, data that is not ready to be used, and data in a different format. In these cases, very basic processing that we have to do is called selection, where we get data to the target data.

With the target data, we can do "pre-processing". One of the important operations that we do here is, to detect for example ‘Outliers’. Suppose we have two variables here in this data distribution, we can see or we can use algorithms to detect that red point is an outlier. In some cases, some algorithms may not work properly, if we have data which is very different from the entire distribution. So this is called an outlier. We can try to remove these and get this pre-processed data without outliers.

Another thing that we can do here is to ‘Detect Missing Values’. Suppose we have this data distribution here, we can use some algorithms to estimate what could be those two holes we have here. Suppose we have this estimation using that green line, we could try to interpolate the other data in order to see what could be the data in that holes. These are two of the most well-known pre-processing steps that are performed on data independent of the application we are doing.

After pre-processing, we have to apply the "Transformation of the data”. One thing that we do in this case is to normalize the date. Because, sometimes we have data that ranges from zero to one, another data that is a textual data, other data we have from 0 to infinite. Most of the algorithms are created to use data in a similar way. One of the steps is to "Normalize". Another step is to find correlated variables. Suppose we have two variables that have a high correlation, which means that using these two variables is useless.

So what we can do with these? We can apply some transformation on the data to make these variables uncorrelated and we can extract the most information that we have in the next step. Suppose we have the transformed data, we want to apply this main topic of these whole steps called the "Data mining" from the transformed data, we can get the patterns.

How we do these? We can apply several classification algorithms. You will be able to see in the description of this video, several data mining algorithms. But in this case, we can apply several algorithms such as the k-nearest neighbor or even a decision trees application or vector machines also. So these are possible data mining algorithms or classification algorithms that we apply to the data to obtain patterns. So the data will start to be divided into patterns in the last step. This process is the interpretation of these patterns. This not an automatic procedure. The user looks at the patterns and applies interpretation in order to obtain the knowledge given by that patterns. So the user can look at the discovered patterns and try to see if there is some redundant or irrelevant patterns and with this thing in mind, we obtain knowledge from the data.

It is important to say that we have all these green arrows which means that we can return to any / often the previous steps that we applied here in order to improve our notion of patterns and also our notion of knowledge. That's why we have such an interconnected procedure here. Is important to say that, this explanation about knowledge discovery in databases is based on these main reference from Fayyad, Shapiro and smith in 1996.Thanks for your invitation and this is how knowledge discovery in database works.

Knowledge Discovery Process
Now that you have an idea of how data is processed to create knowledge, let's learn about various stages of Knowledge Discovery Process: Problem Definition > Data Preparation > Data Mining > Data Analysis > Knowledge Assimilation

The problem definition stage is the initial phase of a data mining project, and it focuses on understanding the project objectives, requirements and defining the data mining problem. Based on this, you can identify the data requirement and models.

This stage involves three key activities and requires more than 70% of the total data mining effort.

'Data Selection': We identify the sources of information select a subset of data required for analysis.
'Data Pre-processing': We join data from various tables and resolve issues such as data conflicts, outliers, and missing data.
'Data Transformation': We use conversions and combinations to generate new data fields like ratios and discretized continuous values.

In this stage data mining technique, we identify the algorithm and tools to be used. Then, we apply the algorithm on the sample data set (also known as training data) and tune the control parameters of the algorithm till we get a satisfying result. Later, we validate the model by running the algorithm against the actual data (also known as test data).

In this stage, we evaluate the mined patterns with respect to the defined goals. We interpret the Data Mining output – in the form of rules or patterns to find new and potentially useful knowledge. This is the Holy Grail of the Knowledge Discovery!

In this stage, we implement the business insights derived from the Data Mining process in the organization’s system for further action. The knowledge becomes active, which means that we can make changes to the system, and measure the impact of the changes. The success of this step determines how effective the Knowledge Discovery process is.

Knowledge Discovery Process - Summary
The final deployment would involve building computerized systems to capture relevant data and to make real time recommendations to business. Also, Data Mining Models need to be continuously monitored and refined as several economic factors, business changes and competitor initiatives could impact the performance of the model.

Data Mining team
Let's understand the typical team composition required for Data Mining projects. These projects require people with not just great minds but those who have a great eye for data. A Data Mining team typically involves :

Domain Expert
Database Administrator
Statistician
Mining Specialist

Domain Experts
Domain Experts are usually people in higher business management functions who know the business environment, processes, customers, and competition.

Database Administrator
Database Administrators come with a good understanding of company data, where it is stored, how it is stored, how to access it and how to relate it to other data sources.

Statisticians
Statisticians validate and analyze datasets. Their key tasks include analysis, interpretation, and presentation of statistical outputs.

Data Miner
Data Miners apply data mining techniques and technically interpret the results. They usually have a background in data analysis and statistics.

Roles in Data Mining Summary
In a Data Mining projects, Data Miners play a central role in establishing relationships with Domain Experts for business guidance on their results, with DBAs for access to the data required for their activities and with Statisticians for validating analysis and interpreting statistical outputs.

Supervised Learning
Supervised learning is the most common technique for training neural networks and decision trees. Let us watch this video to learn about Supervised Learning.
This class is divided into three subclasses or three parts. They are, supervised learning, unsupervised learning and reinforcement learning.

P1. So, what do you think supervised learning is? P2. I think of supervised learning as being the problem of taking labelled data sets, gleaning information from it so that you can label new data sets. P1. That's fair. I call that function approximation. Here's an example of supervised learning. I'm going to give you an input and an output, and I'm going to give them to you as pairs, and I want you to guess what the function is. Okay? P2. Okay. P1. 1 -> 1 2 -> 4 P2. Wait, hang on, is 1 the input and 1 the output? P1. Yes. P2. And 2 the input, and 4 the output? P1. Correct. All right. P2. I think I am on to you. P1. 3 -> 9 4 -> 16 5 -> 25 6 -> 36 7 -> 49 P2. Nice. This is a very hip data set. P1. It is. What's the function? P2. It's hip to be squared. P1. Exactly. Maybe. So, if you believe that's true, then tell me if the input is 10, what's the output? P2. 100. P1. And that's right, if it turns out, in fact, that the function is x squared. But the truth is, we have no idea whether this function is x squared. P2. Not really. I have a pretty good idea. P1. You do? Well, where's that idea come from? P2. It comes from having spoken with you over a long period of time. And plus, you know, math. You can't say I'm wrong. P1. You're wrong. P2. You just said I was wrong. P1. No, you've talked to me for a long time, and plus math. I agree with that. P2. Okay. P1. But, I'm going to claim that you're making a leap of faith, Despite being a scientist, by deciding that the input is 10 and the output is 100. P2. Sure. I would agree with that. P1. What's that leap of faith? P2. Well, I mean, from what you told me, it's still consistent with lots of other mappings from input to output like, 10 gets mapped to 11. P1. Right or everything is x squared except 10. P2. Sure. P1. Or everything is x, x squared up to 10. P2. Right, that would be mean. P1. That would be mean, P2. But it's not logically impossible. P1. What would be the median? P2. A-ha. P1. Thank you very much. I was saving that one up.

Unsupervised Learning is a very powerful technique for learning from unlabeled data. Let us watch this video to learn about Unsupervised Learning.

P1. What about unsupervised learning? P2. Right, so unsupervised learning we don't get those examples. We have just essentially something like input, and we have to derive some structure from them, just by looking at the relationship between the inputs themselves. P1. Right . So, give me an example of that. P2. So, when you're studying different kinds of animals, say, even as a kid. You might start to say, oh, there's these animals that all look kind of the same. They're all four-legged. I'm going to call all of them dogs. Even if they happen to be horses, or cows, or whatever. But I have developed, without anyone telling me ,this sort of notion that all these belong in the same class. And it's different from things like trees. P1. Which don't have four legs. P2. Well some do, but I mean, they have, they both bark, is all I'm saying. P1. Did I really set you up for that? Not on purpose. I'm sorry, I want to apologize to each and every one of you for that. But that was pretty good. Michael is very good at word play. Which I guess is often unsupervised as well. P2. No, I get a lot of that. P1. You certainly get a lot of feedback. P2. Yeah, that's right. So I say, please stop doing that. P1. So, if supervised learning is about function approximation, then unsupervised learning is about description. It's about taking a set of data and figuring out how you might divide it up in one way or the other. P2. Or maybe even summarization, it's not just the description but it's a shorter description. It's usually concise, compressed, compact description. P1. So, I might take a bunch of pixels like I have here it might say, male. P2. Wait, wait, wait, wait, I’m pixels now? P1. As far as we can tell. P2. That's fine. P1. I however, am not pixels. I know I'm not pixels. I'm pretty sure the rest of you are pixels. That's right. So I have a bunch of pixels, and I might say male, or I might say female, or I might say dog, or I might say tree. But the point is, I don't have a bunch of labels that say dog, tree, male, or female. I just decide that pixels like this belong with pixels like this. As opposed to, pixels like something else that I'm pointing to behind me. P2. Yeah we're living in a world right now that is devoid of any other objects. Oh, chairs! P1. Chairs! Right. So these pixels are very different than those pixels because of where they are relative to the other pixels. Say, right? P2. I'm not sure that's helping me understand unsupervised learning. P1. Go out and look at a crowd of people and try to decide how you might divide them up. Maybe you'll divide them up by ethnicity, maybe you'll divide them up by whether they have purposefully shaven their hair in order to mock the bald or whether they have curly hair. Maybe you'll divide them up by whether they have goatees, or whether they have grey hair, there's lots of things that you might do in order. P2. Did you just point at me and say grey hair? P1. I was pointing and your head happened to be there. P2. All right. P1. Okay. So, imagine you're dividing the world up that way, you could divide it up male, female. You could divide it up short, tall, wears hats, doesn't wear hats, all kinds of ways you can divide it up. And no one's telling you the right way to divide it up. At least not directly. That's unsupervised learning. That's description, because now-rather than having to send pixels of everyone, or having to do a complete description of this crowd, you can say, there were 57 males and 23 females, or there are mostly people with beards, Or whatever. P1. I like summarization for that. Yeah. It's a nice concise description. P2. That's unsupervised learning. P1. Good. Very good. P2. And that's different from supervised learning in a couple of ways. One way that it's different is, all of those ways that we could have just divided up the world, In some sense are all equal either. So, I could divide up by sex, or I could divide up by height, or I could divide up by clothing, or whatever. And they're all equally good, absent some other signal later telling you, how you should be dividing up the world. But supervised learning directly tells you, there's a signal, this is what it ought to be, and that's how you train. P2. Now, but I could see ways that unsupervised learning could be helpful in the supervised setting, right? So, if I do get a nice description, and it's the right kind of description, it may help me do the function approximation better. P1. Right, so instead of taking pixels as input, and labels like, male or female. I could just simply take a summarization of you like how much hair do you have, your relative height, the weight, and various things like that might help me do it. That's right. And by the way, in practice this turns out to be things like density estimation. We do end up turning it into statistics at the end of the day. Often. P2. But it's statistics from the beginning. But when you say density estimation, are you saying I'm stupid? P1. No. P2. All right so what is density estimation? P1. Well they'll have to take the class to find out. P2. I see. Okay.

Introduction to Data Mining Techniques
We will now learn about some of the widely used Data Mining Techniques.

Classification
The classification technique is based on machine learning. Here, we classify each item in a dataset into one of the predefined sets of classes or groups. The classification methods incorporate mathematical techniques such as statistics, neural networks, decision trees and linear programming.

Classification
This video will help us learn details about Classification.

So, what kind of predictors are we going to look at, or what kind of tasks or what kind of prediction tasks will be covered in this course? We will basically cover, three - we will cover classification, we will cover regression and we will cover clustering tasks. So I'm gonna show them kind of pictorially, just to get everyone warmed up.

So in classification, what do you have? - You have a collection of individuals. So these are the people that we've seen before and they are represented in some way. And in a classification task, in addition to this collection of individuals, somebody comes along and gives us some labels for some of these individuals. So he says that, maybe this one is an F and this one is an M and now as a learning algorithm, you actually have no idea what those labels mean and you really have no idea what these things are. But somebody comes along and attaches labels to some data points. So, what does a classification algorithm try to do in a situation like that?

What it tries to do is, it builds that predictor. And in classification, the predictor takes on a particularly simple reduced form. So the whole predictor takes the form of something that we call a Decision Boundary. The decision boundary is an imaginary line that goes through our space and cuts the space into two parts. One part is going to be the part where our algorithm thinks the M's live and the other part is where the F's live. It tries to draw a boundary in the space of data points such that all that F's are on one side and all the M's are on another side. So that’s what it tries to do and that's what the decision boundaries is and it's a fundamental concept in classification. We will dwell deep on what decision boundaries are and what do they look like a geometrically. That’ll come in a few lectures down the line. So for now, you just have some sort of boundary, where M is on one side and F is on other side. So this red line is the function and there's nothing more to it and that is your predictor. And how does it predicts stuff?

Well, if you fall on one side of the boundary, it'll predict an M for you and if you fall on another side, it'll predict an F. So maybe an M is the market going up and F is the market going down or maybe M is the individual is male and F as an individual is female. So, you just build a predictor to detect gender based on however you represent that individual. So, that's Classification.

Now the important thing to keep in mind is, this classification or decision boundary only looks that way because of the labels we put on the data. So, we could take exactly the same dataset, exactly the same set of individuals, put some different labels, may be couple of yes and no, and what you hope for is that you're learning algorithm will produce a different prediction, a different decision. So, maybe this decision boundary reflects whether you are going to loan money to that particular individual or not. So, you had some examples of people who paid up and example of a person who didn't pay up. So, that's how you decide to draw a boundary. And again the function, the predictor is just the decision boundary. So the prediction is which side of the boundary you are falling on. So, that's Classification!

Regression Analysis
Regression is a predictive modeling technique. It explores the relationship between a dependent variable - which is referred to as a target and an independent variable(s) - which is referred to as the predictor. We use regression technique for forecasting, time series modeling and finding the causal effect relationship between the variables.

Regression Analysis
In this video we will learn more about Regression Analysis with an example.
This tutorial is an introduction to Regression. There's an X variable and a Y variable and in this case, the independent variable is on the x-axis and the dependent variable is on they-axis. And we try to form a relationship between these two variables and draw a line, in this case a straight line. Over the next series of videos, I’ll explain what all this means.

What we try to understand is, as the independent variable is moving or changing, what happens to the dependent variable? Does it go up? Does it go down? How does it change? If they move in the same direction, if the independent variable increases and the dependent variable increases as well like, we say there is a positive relationship. If on the other hand, as the independent variable increases and the dependent variable decreases like this, we say there is a negative relationship. The line would look like this, go downward. In linear regression, to make a line, the key is on line right there - a straight line. You can also do curved lines. But for this topic, is all straight lines to actually conduct regression.

I take observations and always plot some more observations in your random house to come in here like that. And, I try to find a line that will fit a straight line that fits to all these different points. This is called my regression line and it's based upon the least squares method. In the end, I want to minimize the difference between the estimated value and the actual value. I want to minimize my errors. This line will have a lot of errors, if I compare the actual to the estimated value. Again the point is to minimize these errors or make them as small as possible. Now, let's imagine I put study time on the x-axis or make that my independent variable, and the dependent variable becomes grades or GPA. As study time increases, grades should go up. There is a positive relationship. In regression, we develop these equations like this.

In this case, y hat is estimated grades and it's based upon or its equal to B0 plus B1 times X. Where X is study time, B0 we derive mathematically and it is the y-intercept. B1 is also derived mathematically and I’ll do it later video and it's the slope of the line. In this case, the slope is positive. In the next video, I'll discuss how you develop these equations. Now if I change the x-axis to time on Facebook, we see a negative relationship. More time on Facebook, grades will suffer and go down - a negative relationship. What we're estimating is still grades. Estimated grades is equal to B0 - B1 times X. Where X is time on Facebook. B0 is still the y intercept and it is a calculated value. The slope of the line is B1 because its downward sloping - negative relationship and as I said, before I will show you how to calculate this equation in the next video. The X is the independent variable and the Y is the dependent variable. X is what we control, what we manipulate, what we change and the dependent variable is the outcome. So, study time is the independent variable, that’s what we control and your grades are dependent upon how much you study. Now this looks really ugly and it's what I’ll talk about in the next video. But I’ll step you step by step through it and I hope we make it simple for you.

Decision Trees
Decision Tree is one of the widely used and easy-to-understand techniques. The root of the decision tree is a condition or a question which has several answers. Each answer points to a set of questions or conditions that help in determining the data that can help make the final decision.

Decision Trees
This video gives us some more details on Decision Trees.

In today's session we will talk about a very popular data mining technique called Decision Trees. This technique is liked by Data Miners and Analysts world over, because of its intuitive nature and user-friendly results. Let us take some time to understand how this technique works.

We will take the example of a credit card company that has a set of customers. Some of them are profitable. Some of them are not. Customers who do not use their credit cards frequently or those who use the card, but diligently pay their bills on time are examples of customers who are not profitable for credit card companies. Customers who carry balances on the cards, i.e. customers who do not make their card payments in full or on time are examples of customers who are profitable for the company.

On our slide, we will denote profitable customers with red dots and unprofitable customers with blue crosses. In our simplified example, let us assume the company has five profitable and five unprofitable customers. This box here represents the company's customer base. It has five red dots, i.e. five profitable customers and five blue crosses, i.e. five unprofitable customers. These are the company's existing customers. Outside this box is a large population of potential customers. Potential customers are people who are not the customers of this company, but the company can market to these customers so they have the potential to be its customers. These customers are denoted by green squares. The company doesn't yet know if these customers will be profitable once they become customers. Now the credit card company has a fixed marketing budget that allows it to market its products to a limited set of people out of this large population of potential customers. The company wants to utilize its marketing budget in such a way that it attracts the maximum number of profitable customers. In essence, the company is saying, I have 10 customers, 5 of whom are profitable and 5 are unprofitable. I want to add 10 more customers to my customer base. But I want all or most of them to be profitable. So in effect, the company wants to focus its marketing budget only on those people who are likely to be profitable, if they become the company's customers. This is an interesting problem.

How can the credit card company predict if a person will be a profitable customer or not, before the person even becomes a customer? This is where analytics and the power of historical data come in. The company has certain information available about its potential customers. For example age, gender, marital status and the number of credit cards they already own. It wants to see if any of these variables can help predict the profitability of a potential customer. How will the company find this out? For this, let us examine the company's existing customer base. The same information is available to the company about its current customer base also. It knows their age, gender, marital status and the number of cards already owned.

Please examine this table in some detail. In the existing customer base, 5 of the customers are profitable and 5 are unprofitable. Hence the profitability rate of the total customer base is 50%. Now, let us partition the data into two segments based on the age variable. Let us put those what 35 and above in the left segment and those below 35 in the right segment. Examine the profitability rate of the two segments.

The left segment has 4 profitable customers and 2 unprofitable customers. That is, a profitability rate of 66%. In other words, two-thirds of the customers who are above 35 are profitable customers. Compare this with the overall population profitability rate of 50%. And we have an important insight. People who are 35 and above tend to be more profitable customers for the credit card company than the average population. This means if the company markets its products only two people who are above 35 it will end up with a more profitable customer base.

Now, let's see if we can further segment this population into smaller segments. Some of which have an even higher profitability. We will segment this population of people over 35 by the marital status variable. That is whether a person is single or married. The population is segmented into two separate groups. One comprising of married people and the other made up of single people. Notice the left-hand box now. It has four customers all of whom are profitable. This segment of the population is, people who are over 35 and married has a profitability rate of 100%. We have now identified a small segment of population that is highly profitable for the credit card company. We have also learned that the credit card company needs to focus its marketing efforts on people who are above 35 and married, as these people are likely to be profitable customers. This is an example of a business using historical data of its existing customers to predict the behavior of potential future customers in order to build a more profitable customer base.

In particular, we have seen how the decision tree technique is used in predictive modelling. This is a simplified example with four variables and 10 records. In this example, we first segment the data on the age variable. Further we used age greater than or equal to 35 as the splitting criteria. How do we know which variable to use for the split at what time and how do we know what level to split the variable at? In a real business situation, you will be dealing with hundreds of variables and thousands or millions of records. How do you make these decisions in such a scenario? This is where, Decision Trees come in. There are various Decision Tree algorithms that allow the analyst to choose the right variable from thousands of available variables and split the variable at the most optimal value. In the following slides we will learn more about the decision tree technique and the various algorithms underlying this technique.

Neural Networks
Neural Network is well suited for identifying patterns and forecasting. It is a set of connected input/output units where each connection has a weight associated. In the learning phase, for the network to predict the class of the input tuples, it learns by adjusting weights.

Neural Networks
This video gives us some more details about Neural Networks.

We all use computers every day. But sometimes computers fail us and this upsets a lot of people. What we'd like is for our computers to be smarter and more user friendly. So, some people think we should try to make them more human. This would involve making computers think more people. But to do this, we first need to understand, how humans think and how our brains work.

First though, let's look at computers as they are today. Despite everything they can do, they're pretty simple. They take some inputs, perform calculations and produce outputs. The human brain however is extremely complicated and a lot of very smart sciences are still struggling to understand how the whole thing works. One thing we have known for a while is that the tiniest components of your brain that makes it think and do smart things are special cells called neurons. Your brain has billions of these neurons and they talk to each other using electrical impulses to create what are called synapses. This massive synapses is what is responsible for making your brain think and have a consciousness.

Some computer scientists had the idea that we can make a computer that is modelled after this system of neuron connections. They called their idea Neural Networks. The idea behind neural networks or neurons for short is that we have nodes that has some connection between them. This is similar to the neurons in your brain and the synapses a form to get a neuron to do something, we trigger a node with some input and that node in turn triggers the nodes it is connected to. But this alone is not very useful. So we usually organize the neurons the way that makes it easy to produce good results. Since we're used to the computer model of computation, we'd like to have well-defined input and output nodes. We also like to have directed connections so that we know which way information is going. Not only that, we want our connections to have different values. That is some connection should be more important than others. Here the connection values called weights are represented by the thickness of their arrows. The purpose of having different connection weights is that it allows our nodes to behave more like real neurons. When a node stimulated by two different nodes, It can decide which of the two is more important to it by their connection weights.

Here, nodes A and B have been given the values green and orange. And they try to pass on these values to node C since they are connected to it. Since the connection weights between B and C is much larger than the connection weight between A and C, node C decides B is more important to it and takes its value. More often though, we design nodes to take the sum or an average of the nodes true unit. Here, node C takes a sort of yellow color. But notice the shade is much closer to orange than it is too green since node B's connection weight is large compared to node A's.

In some cases, we'd like for our nodes to be able to decide whether they want to accept their triggers at all. So each node gets to think about what it will do. To decide, each node is given what is called a transfer function to judge its inputs. Since in the real world, computers treat all data as numbers, the transfer function is a math equation. It's usually not that complicated. After the node makes a decision, it set its value and then it can trigger the next set of nodes with that value. Choosing whether or not to accept the trigger value is most useful for the output nodes. Since these are the node that produce the result that we actually want. Usually though, the transfer function will return a value that is a combination of the nodes current value and the trigger value. So using the connection with and transfer function, the neural network takes inputs and produces outputs. This is the same test that a computer would do. But it's done in a way that is similar to the way neurons work. Since the input and output nodes are the ones that matter to us, we consider the nodes in the middle as hidden nodes. They do most of the work, but get the least credit.

Now that we know the basics, we need to ask some questions. One of the most important questions is, how are the connections determined?

Well, it turns out that the neural networks can learn them. Does this make neural network smart? Sort of. But it also turns out that the neural networks are very slow learners, as we will soon see. But the question is, how do they learn? It's done through a process called back propagation. We start with random connection weights, then for a given set of inputs, we decide on a set of desired outputs. Using the random weights, we first let the network calculate some outputs. Then we compare the output that the neural net calculated to the desired output that we defined. Since we gave the network random weights, we obviously cannot expect two outputs to be equal. So we find their difference. We call this difference the error in the network. This is difficult to illustrate with colors. But you have to trust me that I did it correctly.

For an easier to follow example, I've given each node a numerical value. You can see that we will find the error by simply subtracting and then we can have negative error as well. Now that we have the errors, we need to adjust the connection to produce small errors. This is where the back propagation comes in. The output nodes tell the hidden nodes they are connected to, about their error and together they decide on how to adjust the connection weights between them. The new weight is calculated using an equation based on the old weight. The nodes input value, the error and something called the learning rates. We'll get back to the learning rate later. With the weights adjusted, the hidden nodes calculate their own error using a similar formula to before. Then these nodes with their newly calculated errors, push the errors back through the hidden nodes and adjust the weight behind them. This goes on until all the weights have been adjusted and all the nodes have been assigned an error. The idea is to determine which nodes are most to blame for the error in the output and try to adjust their weights the most.

Now that all the weights have been updated, the network tries out the original inputs again and tries to calculate some outputs. The calculated output should be closer to the desired outputs than before. But there will still be some error. So the whole process is repeated again and again and again. Remember, how I said that the neural nets are slow learners. Well, the neural network has to do all that for each different input set and there's usually a lot of those. But the idea works and eventually the network will produce the desired outputs. To try to produce the desired outputs more quickly, we can try adjusting the learning rate and we can change the number of nodes as well. But it will still take millions of attempts for the neural network to get the desired output for even a simple problem. So, can your networks make computers think more like humans? Probably not. But it's a good baby step.

Well that's my presentation on neural nets. I hope you liked it and maybe learnt something as well. Feel free to email me if you have any questions. Thanks for watching!

Clustering
Clustering is a data mining technique that identifies a cluster of objects having similar characteristics. At a simple level, clustering uses one or more attributes as the basis for identifying a cluster of correlating results.

Clustering
This video gives us some more details about Clustering.

Clustering is the process of breaking down a largest population or a large dataset into smaller groups. As an analyst, you will often face this question that you need to organize the data that you’re observing into some kind of a meaningful structure or pattern. And this is where, clustering comes in handy.

Clustering allows you to break a population into smaller groups where, each observation within every group is more similar to each other than it is to an observation in another group. So the idea is to group together similar kind of observations into smaller groups and thus breakdown that large heterogeneous population that you're seeing into smaller more homogeneous groups.

Let's take an example to understand how clustering works exactly. Imagine that you own a chain of ice cream shops. You have a number of ice cream shops spread across the country. Say you have 8 of them and you sell two flavors of ice cream. You sell chocolate ice cream and you sell vanilla ice-cream. Now in this table here you can see the sales of both chocolate and vanilla ice cream across your 8 stores. The units are not important. The timeframe is not important for what you're doing. But just imagine that this is the data that you're looking at.

Now there are many different ways you can make sense of this data. You can look at summary statistics. You can calculate the mean, median, spread of the variables and dispersion in order to get a better sense of this data. One very intuitive way of doing this is to plot this data on a graph. So here we have plotted the sales of both chocolate and vanilla ice cream for each of these 8 stores. So you can see 8 dots here. Each of these dots represents a store and on the y-axis you have chocolate sales, on the x axis you have vanilla sales. So you've mapped these 8 stores by their chocolate and vanilla sales and you've created a scatterplot. This is a very intuitive way of looking at this data to understand what this data is saying.

Now when you look at this graph, there is one very clear insight that has come from this and that is that, you can divide your stores into two distinct groups. You have one large group of stores, a group of 5 stores here and you have another group of stores which has 3 stores. So essentially, your 8 stores can be divided into two different groups that behave slightly differently in terms of their chocolate and vanilla sales. The difference essentially is in terms of the magnitude of the sales. In group 1, you can see that sales of both chocolate and vanilla ice cream are lower than in group 2. So, what we've done is, we've just looked at sales of 8 stores for these two flavors of ice cream and we have plotted them on a graph and then we have just divided the stores into 2 groups based on where they were on the graph and their proximity to each other. So, this is essentially how clustering works. This is a very simple two dimensional example of how clustering works. But this accurately explains how really the algorithm works.

Now to better understand this algorithm, let's just look at one more thing that we've done here quite intuitively, actually without even realizing. When we were grouping these stores, when we've created these two groups, what have we done here, at a very intuitive level, what we've done is between these three stores. If I look at the cluster on the top, we've taken one imaginative point somewhere in the center and we've drawn a circle around it. Similarly, for group 2 we've taken an imaginary point and we've drawn a circle around it and all the observations that fall within that circle are grouped together into one cluster. So that's essentially how clustering works.

Now this is an example where we have 2 flavors of ice cream and we have 8 stores. Now imagine that you have expanded your chain of ice cream stores and instead of 2 flavors, you're selling 30 different flavors. You're selling banana ice cream, dark chocolate ice cream, you're selling Belgian chocolate. You’re selling all kinds of flavors and you're selling 30 different flavors. So, how will you plot this information on graph now? You can’t draw a 30-dimensional graph. There’s no way we can visualize a 30-dimensional graph and imagine if instead of 8 stores now you've grown to 500 stores. So instead of 8 points, you will have 500 different points on the graph. Now that’s still easier to visualize, but if you have something like you've got a million records, then you have a million different points and if you have thousands of variables, then you have thousands of different dimensions. So, there is a mathematical dealing with such complexity and that is what cluster analysis works. Let's understand how to clustering algorithms works in a little more detail.

Association Rule Mining
Association Rule Mining is one of the best-known data mining techniques. Here, we discover a pattern based on the relationship between items in the same transaction. In market basket analysis, we use this association technique to identify products that customers frequently purchase together

Association Rule Mining
This video will help us learn more about Association Rule Mining.

The following is a production on the Metro applied research club and the Department of Computer Information Systems at the Metropolitan State University of Denver. With special thanks to our corporate sponsor Nebraska Aluminium Castings. Quality, Custom Aluminium die-casting in Hastings Nebraska.

Welcome to the lecture Association Rules the basics. The goals of this lecture are to introduce the student to key concepts of association rules including, What association rules are, How they can be applied and How they can be interpreted. To introduce the student to important terms and definitions central to Association rules and to demonstrate how associations can be created and assessed using a simple dataset. Some of the key words and terms you will want to pay special attention to include -Association Rules, Support, Confidence, If-then, Antecedent, Consequent and Item set.

Today we're going to be explore association rules and to follow along with me, you are probably going to want to have 2 files. First one is an excel file called credit risk association 2 workbook final. That’s an excel file should be available to you. If not, you can go to ww.joehasley.com. The other file which you might want to be able to find is a lecture from the MIT website. It's called Discovering association rules in transaction database. To explore association rules, you are going to be looking at a relatively simple problem. Data in the file "credit_risk_assn2_workbook_final" basically allows us to determine "Credit Standing". Credit Standing is going to be our dependent variable and then we'll be looking at several independent variables and trying to figure out, if we can predict the outcome of dependent variable based on the independent variables.

So for example, independent variable "Checking Acct" data tells us, "Do they even have checking account?”. This person has no checking account and this person has no balance. So they have an account but no balance. These are all relative. We are not given a dollar amount. Just told relative values. "Credit history" basically, are they current on their bills? Does the bank have to pay something or are they behind on it? Those types of things. We're told "Purpose" of previous loan which was for small appliance or furniture or maybe a new car. "Savings Acct" tells us as opposed to a checking account, to look at a savings account and the amount that they have into it. They may have no account. Their "Employment" - how long have they been at their job, their “Gender", "Marital Status" of the applicant, "Housing" status of the application, what type of job they have, do they have a telephone, foreign-born, their current age and ultimately base issue - we give them loan not. So, "Credit Standing" is the dependent variable that we're looking at. We're going to try and predict that based on these independent variables.

In the case of this dataset, I have 425 instances or examples and again our task is to determine some rules based on this data that will help us make decisions about who to loan money to or not. So, let's talk a little bit about Association Rules. Association Rules provide information in formal “if - then” statements. These rules are computed from the data and unlike the “if - then” rules of logic, association rules are probabilistic in nature. In addition to the antecedent, that’s the - if part, and the consequent - that's then part. An association rule has two numbers that express the degree of uncertainty about the rule. What this means is, rules are not hard-and-fast. Instead of stating a rule such as, "It is cloudy so it will definitely rain today", a probabilistic rule would say "It is cloudy so it will rain today is true about 40% of the time. “Or whatever percent.

In association analysis, the antecedent and the consequent are sets of items called the item sets that are disjoint. That means, they do not have any items in common. For example, if we consider marital status, we see that the options are single, divorced or married. Now in real life, you could be separated or there could be other options. But in this dataset, we have three values and those values are both mutually exclusive. You can only be in one category and they are collectively exhaustive. Meaning those are all possible options for this dataset.

Let's go ahead and examine a few simple association rules together. For example, a simple Association rule would be, if a person is single, then they have good credit. Another Association rule will be, if the person is single, then they have bad credit. Our task is to somehow examine and say "are these good rules". When we examine rule, we will judge it by two numbers. The first number is called the support for the rule. The support is simply the number of transactions that include all items in antecedent and consequent parts of the rule. The support is sometimes expressed as a percentage of the total number of records in the database. For example, I can run a simple excel formula COUNTIF and I can say out of my 420 records, tell me how many of them have a marital status that is single. Well it turned out to be 233 times. So that's just out of the 425 records, how many have a marital status of single.

The next thing I want to calculate is, of the individuals with the marital status single, how many of them have good credit. Well that's my confidence. Put simply, confidence tells us, of the 233 people in our dataset with the marital status single, 130 of them had a marital status of single and also had credit standing of "Good". So about 55.8% of the 233 single people and up also having great credit. Also you can see of the 233 single people, 103 of them ended up having bad credit. So about 55% of the single people had good credit and 44% of the single people had bad credit. Support tells us, out of the whole dataset, how often this rule holds true. Whether single people had good credit or whether single people had bad credit. This says out of the 425 people, about 30% of them were single people with good credit and about 24% of them were single people with bad credit.

Let's consider a second rule. Let’s consider the rule - if divorced, then credit is good. The confidence for this rule s roughly 41.66%. What that says is,if you are divorced, then there is 41.6% chance that your credit standing is good. On the flip side, if you are divorced, then the probability that your credit standing is bad is, 0.5833 or about 58 .3% of divorced people had bad credit. Now it's worth noting, if someone is married, I can't really tell much about them. They have almost 50-50 chance of being a good or bad credit. Finding out that someone is married doesn't give me much of a statistical advantage over just flipping a coin. That is to say, if you had access to this information and someone else didn’t and that someone offered you a bet off, I'll bet you ten dollars that a married person as a bad bet. The information is not giving you much of an advantage. Compare this to the information that a person is divorced. If you had access to this information and someone else did not, and that someone offered you a bet, I'll bet you ten dollars that a divorced person is a good bet. You would know that if you take the bet, so that you are betting in that a divorce person is a bad bet. Then the odds are significantly in your favor.

This is a clear way to quantify the information utility or information value or the association. This is a direct application of the value of perfect information that you learn about in your management science class. This marks the end of part 1. Please join us in part two as we do another example and wrap up lecture.

Common Data Mining Problems
We have almost reached the last mile! This video will help us understand some of the Common Data Mining Problems that we come across in real life.

There are too many different data mining algorithms. But there are only a small circle problem types. In this presentation I will explain 9 common data mining problem types. The information in this presentation is mostly based on a great book called "Data Science for Business" written by Provost and Fawcett.

A very common problem type is classification problems. In these problems, you try to classify something. For example, take sentiment analysis of twitter messages. You want to know - How positive are Twitter messages related to your brand? Are the consumers happy with your brand or not? You can go over every Twitter message and decide whether each message is positive or negative. But this will take too much time. Instead you could better use data mining methods to decide whether message is positive or negative. This is an example of classification problems. In this case we have three classes - positive, negative and neutral. The classes are mutually exclusive. But, usually it is not quite right to classify a specific message into one of these classes. A message might have some positive feeling and some negative feelings. Maybe, the intent of the message writer is not clear. Maybe, his words are ambiguous.

There is a related problem type to classification problem. It is called as scoring or class probability estimation. This method does not clearly decide if the messages is positive or negative. Instead it assigns probability to each class. For example, it says that some messages are positive with 50% probability or negative with 20% probability. This is actually called as score. It somehow represents the probability or likeliness. Classification and scoring methods are very close to each other. Same method can be used in both cases.

Regression is another common problem type. For example, you want to know how much sales are you going to make next year or you want to predict how many people will come to visit your site. If you are trying to estimate or predict a quantity of a variable, then this is a regression problem. Both regression and classification problems can be used for predicting future events. For example, if you ask whether a customer will renew his subscription, this is classification problem. If you ask how much a customer will pay for some service, then it is a regression problem. The difference is here. Classification problems ask whether something will happen. Regression problems will ask, how much something will happen.

Similarity matching tries to identify similar entities. For example, you have a shopping cart website, whenever a customer buys something, you want to present the customer all related products. This is called as recommendation. The recommendation software are usually based on similarity matching. These method find customers who are similar to the current customer. For example, who are the customers that have liked or purchase similar items? Another example, you have lots of customers. But you want to focus all customers based on their similarities. So that you can assign similar customers to the same sales representative. Clustering tries to find groups of similar entities. The key difference between clustering and similarity matching is here. Clustering tries to group entities. Similarity matching tries to identify similar entities to some specific entity. In similarity matching, you start with some entity. In clustering, you don't have a specific entity. You want to explore whether there are any mutual groups in the set.

Co-occurrence grouping - this is also called item set mining, association rule discovery and market basket analysis. This method is similar to clustering because it also tried to group similar entities. The difference is that this method groups entities based on the actions done to them. For example, we have lots of products in our market. We can group the product in two ways. First, we can group objects based on their attributes such as colors, size, price, etc. This is a clustering problem. Second, we group objects based on the purchasing decision. Which products are purchased together? Which products are put into the same basket? This is a co-occurrence grouping problem. This is different that the attributes of the product. There is an action which is done by the customer. The grouping of the products is not based inherent attribute of the product. Grouping is based on the purchasing action done by the customers. For example, tomato sauce is purchased together with spaghetti. These two products do not have similarity in their attributes. But the customer purchase them together because they are complementary products.

Profiling - Profiling is also called as behavior description. This method tries to describe the profile of a person or agent. For example, what is the expected usage behavior of a certain group of people? Behavior description is not a simple statement of some statistical qualities like mean or standard deviation. The profile of the users depends on several factors such as time of the day, day of the week, season of the year. Also, there are several levels of user groups. Are we going to describe the behavior of all the firm users or just some subset of it? How specific should be the subset. So, behavior profiling can be done based on such various factors. Profiling can be used for lots of different business use cases. For example you might want to take credit card frauds based on profile of a customer. If the credit card of some user who has never purchased anything from Airport is being used in some firm named shopping, then, there is high likelihood that the card is being used by some thief.

Link Prediction tries to predict connections between entities. For example, Facebook suggests an everyday some fried connections or maybe they suggest me to connect with some new people. How do they do it? They use link prediction methods. These methods study existing links and find out common groupings. For example, Rebecca and Helen share, 10 connections. Then, there is a high likelihood that Rebecca and Helen might know each other. Data reduction problems try to replace some set of data with some smaller set of data. Most of the time we collect much more data than what we actually need. Having too much data is not bad this per say. But, while we want to get some insight into the data, we usually need to separate useless data from useful data.

Casual modelling problems, try to separate the cause and effect. For example let's assume that we are ice cream sellers. We collect lots of data that such as weather data, demographic data, advertisement data. We observe that there is an increase in sales for last one month. What happened in the last month? Multiple event might have occurred. First, the summer season has started. Second, we started TV advertisements. Now, the question is, which of these two events is the actual cause of the sales increase? Are they both effective or is there any other reason that might have caused the increase in sales? These are the common types of data mining problems. Each of these problems can be solved using various techniques. In fact the number of data mining techniques is much higher than the problem types. It is good to keep these common problem types in mind to obtain perspective in analytical problems.

Data Mining Methods Basics Summary
In this course you have learnt:

What is Data Mining
Knowledge Discovery Process
Roles involved in Data Mining
Commonly used Data Mining Techniques
Applications of Data Mining

Fun Java learning

Saturday, 20 October 2018

Azure essential MCQ

Data Cleansing with R

Friday, 19 October 2018

Data Mining

Data Mining MCQ

Monday, 7 August 2017

Fun Java Learning Part 1

Java

OOPS Concept Inheritance