Data analytics made accessible pdf free download






















Missing values can be filled in with average or modal or default values. Data elements should be comparable. They may need to be a transformed from one unit to another. Data elements may need to be adjusted to make them b comparable over time also. For example, currency values may need to be adjusted for inflation; they would need to be converted to the same base year for comparability.

They may need to be converted to a common currency. Data should be c stored at the same granularity to ensure comparability.

For example, sales data may be available daily, but the sales person compensation data may only be available monthly. To relate these variables, the data must be brought to the lowest common denominator, in this case, monthly. Continuous values may need to be binned into a few buckets to help with some analyses.

For instance, work experience could be binned as low, medium, and high. Outlier data elements need to be removed after careful review, to avoid the skewing of results. For example, one big donor could skew the analysis of alumni donors in an educational setting. Ensure that the data is representative of the phenomena under analysis by correcting for any biases in the selection of data.

For example, if the data includes many more members of one gender than is typical of the population of interest, then adjustments need to be applied to the data. Data may need to be selected to increase information density. Some data may not show much variability, because it was not properly recorded or for other reasons. This data may dull the effects of other differences in the data and should be removed to improve the information density of the data. The outputs of data mining will reflect the objective being served.

There are many ways of representing the outputs of data mining. One popular form of data mining output is a decision tree. It is a hierarchically branched structure that helps visually follow the steps to make a model-based decision. The tree may have certain attributes, such as probabilities assigned to each branch.

A related format is a set of business rules, which are if-then statements that show causality. A decision tree can be mapped to business rules. If the objective function is prediction, then a decision tree or business rules are the most appropriate mode of representing the output.

The output can be in the form of a regression equation or mathematical function that represents the best fitting curve to represent the data. This equation may include linear and nonlinear terms.

Regression equations are a good way of representing the output of classification exercises. These are also a good representation of forecasting formulae. These might be defined in a multidimensional space. These are typical representations of the output of a cluster analysis exercise. Business rules are an appropriate representation of the output of a market basket analysis exercise. These rules are if-then statements with some probability parameters associated with each rule.

For example, those that buy milk and bread will also buy butter with 80 percent probability. This equation may include linear and non-linear terms. These might be defined in a multi-dimensional space. Business rules are an appropriate representation of the output of a market- basket analysis exercise. In supervised learning, a decision model can be created using past data, and the model can then be used to predict the correct answer for future data instances.

Classification is the main category of supervised learning activity. There are many techniques for classification, decision trees being the most popular one. Each of these techniques can be implemented with many algorithms.

A common metric for all of classification techniques is predictive accuracy. Using a relevant set of variables and data instances, a decision tree model has been created. The model is then used to predict other data instances. When a true positive data point is positive, that is a correct prediction, called a true positive TP. Similarly, when a true negative data point is classified as negative, that is a true negative TN.

On the other hand, when a true-positive data point is classified by the model as negative, that is an incorrect prediction, called a false negative FN. Similarly, when a true-negative data point is classified as positive, that is classified as a false positive FP. This is represented using the confusion matrix Figure 4. There are no good objective measures to judge the accuracy of unsupervised learning techniques such as Cluster Analysis.

There is no single right answer for the results of these techniques. For example, the value of the segmentation model depends upon the value the decision-maker sees in those results. Or it may be used to explore the data to find interesting associative patterns. The right technique depends upon the kind of problem being solved Figure 4. Classification techniques are called supervised learning as there is a way to supervise whether the model is providing the right or wrong answers.

These are problems where data from past decisions is mined to extract the few rules and patterns that would improve the accuracy of the decision making process in the future. The data of past decisions is organized and mined for decision rules or equations, that are then codified to produce more accurate decisions. Decision trees are the most popular data mining technique, for many reasons.

Decision trees are easy to understand and easy to use, by analysts as well as executives. They also show a high predictive accuracy.

Decision trees select the most relevant variables automatically out of all the available variables for decision making. Even non-linear relationships can be handled well by decision trees. There are many algorithms to implement decision trees. Regression is a most popular statistical data mining technique. The goal of regression is to derive a smooth well-defined curve to best the data.

Regression analysis techniques, for example, can be used to model and predict the energy consumption as a function of daily temperature. Simply plotting the data may show a non-linear curve. Applying a non-linear regression equation will fit the data very well with high accuracy. Once such a regression model has been developed, the energy consumption on any future day can be predicted using this equation. The accuracy of the regression model depends entirely upon the dataset used and not at all on the algorithm or tools used.

It mimics the behavior of human neural structure: Neurons receive stimuli, process them, and communicate their results to other neurons successively, and eventually a neuron outputs a decision. A decision task may be processed by just one neuron and the result may be communicated soon. Alternatively, there could be many layers of neurons involved in a decision task, depending upon the complexity of the domain.

The neural network can be trained by making a decision over and over again with many data points. It will continue to learn by adjusting its internal computation and communication parameters based on feedback received on its previous decisions.

The intermediate values passed within the layers of neurons may not make any intuitive sense to an observer. Thus, the neural networks are considered a black-box system. At some point, the neural network will have learned enough and begin to match the predictive accuracy of a human expert or alternative classification techniques.

The predictions of some ANNs that have been trained over a long period of time with a large amount of data have become decisively more accurate than human experts. At that point, the ANNs can begin to be seriously considered for deployment, in real situations in real time. ANNs are popular because they are eventually able to reach a high predictive accuracy. ANNs are also relatively simple to implement and do not have any issues with data quality. However, ANNs require a lot of data to train it to develop good predictive ability.

It is a technique used for automatic identification of natural groupings of things. Data instances that are similar to or near each other are categorized into one cluster, while data instances that are very different or far away from each other are categorized into separate clusters. There can be any number of clusters that could be produced by the data. The K-means technique is a popular technique and allows the user guidance in selecting the right number K of clusters from the data.

Clustering is also known as the segmentation technique. It helps divide and conquer large data sets. The technique shows the clusters of things from past data. The output is the centroids for each cluster and the allocation of data points to their cluster. The centroid definition is used to assign new data instances can be assigned to their cluster homes.

Clustering is also a part of the artificial intelligence family of techniques. Association rules are a popular data mining method in business, especially where selling is involved. Also known as market basket analysis, it helps in answering questions about cross-selling opportunities.

This is the heart of the personalization engine used by ecommerce sites like Amazon. The technique helps find interesting relationships affinities between variables items or events. A form of unsupervised learning, it has no dependent variable; and there are no right or wrong answers. There are just stronger and weaker affinities. Thus, each rule has a confidence level assigned to it. A part of the machine learning family, this technique achieved legendary status when a fascinating relationship was found in the sales of diapers and beers.

However, they have recently become more important as the values of data have grown and the field of big data analytics has come into prominence.

There are a wide range of data mining platforms available in the market today. Stand-alone or Embedded: There are stand alone tools and there are tools embedded in an existing transaction processing or data warehousing or ERP system. Open source or Commercial: There are open source and freely available tools such as Weka, and there are commercial products.

User interface: There are text-based tools that require some programming skills, and there are GUI-based drag-and-drop format tools. Data formats: There are tools that work only on proprietary data formats and there are those directly accept data from a host of popular data management tools formats.

Here we compare three platforms that we have used extensively and effectively for many data mining projects. Table 4. It can get quite versatile once Analyst Pack and some other add-on products are installed on it. If offers a powerful set of tools and algorithms for most popular data mining capabilities. It has colorful GUI format with drag-and-drop capabilities. It can accept data in multiple formats including reading Excel files directly.

Weka is an open-source GUI based tool that offers a large number of data mining algorithms. ERP systems include some data analytic capabilities, too.

The business aspects help understand the domain and the key questions. It also helps one imagine possible relationships in the data, and create hypotheses to test it. The IT aspects help fetch the data from many sources, clean up the data, assemble it to meet the needs of the business problem, and then run the data mining techniques on the platform. An important element is to go after the problem iteratively.

It is better to divide and conquer the problem with smaller amounts of data, and get closer to the heart of the solution in an iterative sequence of steps. There are several best practices learned from the use of data mining techniques over a long period of time. It has six essential steps Figure 4.

Business Understanding: The first and most important step in data mining is asking the right business questions. A question is a good one if answering it would lead to large payoffs for the organization, financially and otherwise.

There should be strong executive support for the data mining project, which means that the project aligns well with the business strategy. A related important step is to be creative and open in proposing imaginative hypotheses for the solution. Thinking outside the box is important, both in terms of a proposed model as well in the data sets available and required. Data Understanding: A related important step is to understand the data available for mining. One needs to be imaginative in scouring for many elements of data through many sources in helping address the hypotheses to solve a problem.

Without relevant data, the hypotheses cannot be tested. Data Preparation: The data should be relevant, clean and of high quality. It may be desirable to continue to experiment and add new data elements from external sources of data that could help improve predictive accuracy. Modeling: This is the actual task of running many algorithms using the available data to discover if the hypotheses are supported. Patience is required in continuously engaging with the data until the data yields some good insights.

A host of modeling tools and algorithms should be used. A tool could be tried with different options, such as running different decision tree algorithms. Model Evaluation: One should not accept what the data says at first.

It is better to triangulate the analysis by applying multiple data mining techniques, and conducting many what-if scenarios, to build confidence in the solution. When the accuracy has reached some satisfactory level, then the model should be deployed.

Dissemination and rollout: It is important that the data mining solution is presented to the key stakeholders, and is deployed in the organization. Otherwise the project will be a waste of time and will be a setback for establishing and supporting a data-based decision-process culture in the organization. Data Mining is a mindset that presupposes a faith in the ability to reveal insights.

By itself, data mining is not too hard, nor is it too easy. It does require a disciplined approach and some cross- disciplinary skills. Myth 1: Data Mining is about algorithms. Data mining is used by business to answer important and practical business questions. Formulating the problem statement correctly and identifying imaginative solutions for testing are far more important before the data mining algorithms gets called in.

Understanding the relative strengths of various algorithms is helpful but not mandatory. Myth 2: Data Mining is about predictive accuracy.

While important, predictive accuracy is a feature of the algorithm. As in myth 1, the quality of output is a strong function of the right problem, right hypothesis, and the right data. Myth 3: Data Mining requires a data warehouse. While the presence of a data warehouse assists in the gathering of information, sometimes the creation of the data warehouse itself can benefit from some exploratory data mining.

Some data mining problems may benefit from clean data available directly from the DW, but a DW is not mandatory. Myth 4: Data Mining requires large quantities of data.

Many interesting data mining exercises are done using small or medium sized data sets, at low costs, using end-user tools. Myth 5: Data Mining requires a technology expert. Many interesting data mining exercises are done by end-users and executives using simple everyday tools like spreadsheets.

It requires a lot of preparation and patience to pursue the many leads that data may provide. Much domain knowledge, tools and skill is required to find such patterns. Here are some of the more common mistakes in doing data mining, and should be avoided. Mistake 1: Selecting the wrong problem for data mining: Without the right goals or having no goals, data mining leads to a waste of time. Getting the right answer to an irrelevant question could be interesting, but it would be pointless from a business perspective.

A good goal would be one that would deliver a good ROI to the organization. Mistake 2: Buried under mountains of data without clear metadata: It is more important to be engaged with the data, than to have lots of data. The relevant data required may be much less than initially thought. There may be insufficient knowledge about the data, or metadata. Examine the data with a critical eye and do not naively believe everything you are told about the data.

Mistake 3: Disorganized data mining: Without clear goals, much time is wasted. Doing the same tests using the same mining algorithms repeatedly and blindly, without thinking about the next stage, without a plan, would lead to wasted time and energy.

This can come from being sloppy about keeping track of the data mining procedure and results. Not leaving sufficient time for data acquisition, selection and preparation can lead to data quality issues, and GIGO. Similarly not providing enough time for testing the model, training the users and deploying the system can make the project a failure. Mistake 4: Insufficient business knowledge: Without a deep understanding of the business domain, the results would be gibberish and meaningless.

Be open to surprises. Even when insights emerge at one level, it is important to slice and dice the data at other levels to see if more powerful insights can be extracted. Mistake 5: Incompatibility of data mining tools and datasets. All the tools from data gathering, preparation, mining, and visualization, should work together. Use tools that can work with data from multiple sources in multiple industry standard formats.

It is possible that the right results at the aggregate level provide absurd conclusions at an individual record level. Diving into the data at the right angle can yield insights at many levels of data. Mistake 7: Not measuring your results differently from the way your sponsor measures them. If the data mining team loses its sense of business objectives, and beginning to mine data for its own sake, it will lose respect and executive support very quickly. While the technique is important, domain knowledge is also important to provide imaginative solutions that can then be tested with data mining.

The business objective should be well understood and should always be kept in mind to ensure that the results are beneficial to the sponsor of the exercise. What is data mining? What are supervised and unsupervised learning techniques? Describe the key steps in the data mining process. Why is it important to follow these processes? What is a confusion matrix? Why is data preparation so important and time consuming? What are some of the most popular data mining techniques?

What are the major mistakes to be avoided when doing data mining? What are the key requirements for a skilled data analyst? What data mining techniques would you use to analyze and predict sales patterns? Ideal visualization shows the right amount of data, in the right order, in the right visual form, to convey the high priority information.

The right visualization arises from a complete understanding of the totality of the situation. One should use visuals to tell a true, complete and fast-paced story.

Data visualization is the last step in the data life cycle. This is where the data is processed for presentation in an easy-to-consume manner to the right audience for the right purpose. The data should be converted into a language and format that is best preferred and understood by the consumer of data. The presentation should aim to highlight the insights from the data in an actionable manner. If the data is presented in too much detail, then the consumer of that data might lose interest and the insight.

Hans Rosling is a master at data visualization. He has perfected the art of showing data in novel ways to highlight unexpected truths. He has become an online star by using data visualizations to make serious points about global health policy and development.

Using novel ways to illustrate data obtained from UN agencies, he has helped demonstrate the progress that the world has made in improving public health on many dimensions. The best way to grasp the power of his work is to click here to see this TED video, where Life Expectancy is mapped along with Fertility Rate for all countries from to Figure 5.

Dr Rosling's mesmerizing graphics have been impressing audiences on the international lecture circuit, from the TED conferences to the World Economic Forum at Davos. His aim is ambitious. But I have the idea that if they have a proper road map and know what the global realities are, they'll make better decisions.

Q1: What are the business and social implications of this kind of data visualization? Q2: How could these techniques be applied in your organization and area of work? However, as the amount of data grows, graphs are preferable. Graphics help give shape to data. Tufte, a pioneering expert on data visualization, presents the following objectives for graphical excellence: 1. Show, and even reveal, the data: The data should tell a story, especially a story hidden in large masses of data.

However, reveal the data in context, so the story is correctly told. Induce the viewer to think of the substance of the data: The format of the graph should be so natural to the data, that it hides itself and lets data shine.

Avoid distorting what the data have to say: Statistics can be used to lie. In the name of simplifying, some crucial context could be removed leading to distorted communication.

Make large data sets coherent: By giving shape to data, visualizations can help bring the data together to tell a comprehensive story. Encourage the eyes to compare different pieces of data: Organize the chart in ways the eyes would naturally move to derive insights from the graph.

Reveal the data at several levels of detail: Graphs leads to insights, which raise further curiosity, and thus presentations should help get to the root cause. Serve a reasonably clear purpose — informing or decision-making. Closely integrate with the statistical and verbal descriptions of the dataset: There should be no separation of charts and text in presentation.

Each mode should tell a complete story. Context is important in interpreting graphics. Perception of the chart is as important as the actual charts. Do not ignore the intelligence or the biases of the reader. Keep the template consistent, and only show variations in data. There can be many excuses for graphical distortion. Leaving out the contextual data can be misleading. A lot of graphics are published because they serve a particular cause or a point of view. Many related dimensions can be folded into a graph.

The more the dimensions that are represented in a graph, the richer and more useful the chart become. Time series data is the most popular form of data. It helps reveal patterns over time.

However, data could be organized around alphabetical list of things, such as countries or products or salespeople. Line graph. This is a basic and most popular type of displaying information. It shows data as a series of points connected by straight line segments. If mining with time-series data, time is usually shown on the x-axis. Multiple variables can be represented on the same scale on y-axis to compare of the line graphs of all the variables.

Scatter plot: This is another very basic and useful graphic form. It helps reveal the relationship between two variables. In the above caselet, it shows two dimensions: Life Expectancy and Fertility Rate. Unlike in a line graph, there are no line segments connecting the points. This is a flowing book that one can finish in one sitting, or one can return to it again and again for insights and techniques.

Your email address will not be published. Skip to content Search for:. Easy to read and informative, this lucid book covers everything important, with concrete examples, and invites the reader to join this field.

The chapters in the book are organized for a typical one-semester course. The book contains case-lets from real-world stories at the beginning of every chapter. There is also a running case study across the chapters as exercises. Keong Leong [hdQ. By Michael Roberts [Hx6. Campbell, Alasdair Kean [ipI. Darwin Deen [jd3. Reblitz [JSG. Wile, Marilyn Durnell [LcC. From Ixelles Editions [ldB. Calvert [M1L.

Tannehill, DaleAnderson [oPX. Emery [p7x. Wade [Q3H. Warren [QCy. White, Gordon Davisson [rec. Lansky [RH6. Lamb [RIz. McDougall [SBr. Friis [Tb2. Pickover [u2K. Burton [uM8. Grant [va2. Weiss, Nitin Indurkhya [wMp. Jones [wUq. By David Hillier [xds. Moon [XNL. Hasbro, Mary Jane Begin [zJn. Gruber [zvQ.



0コメント

  • 1000 / 1000