Appendix A. Proposal Review Guide Effective data analytic thinking should allow you to assess potential data mining projects systematically. The material in this book should give you the necessary background to assess proposed data mining projects, and to uncover potential flaws in proposals. This skill can be applied both as a self-assessment for your own proposals and as an aid in evaluating proposals from internal data science teams or external consultants. What follows contains a set of questions that one should have in mind when considering a data mining project. The questions are framed by the data mining process discussed in detail in Chapter 2, and used as a conceptual framework throughout the book. After reading this book, you should be able to apply these conceptually to a new business problem. The list that follows is not meant to be exhaustive (in general, the book isn’t meant to be exhaustive). However, the list contains a selection of some of the most important questions to ask. Throughout the book we have concentrated on data science projects where the focus is to mine some regularities, patterns, or models from the data. The proposal review guide reflects this. There may be data science projects in an organization where these regularities are not so explicitly defined. For example, many data visualization projects initially do not have crisply defined objectives for modeling. Nevertheless, the data mining process can help to structure data-analytic thinking about such projects — they simply resemble unsupervised data mining more than supervised data mining.

Business and Data Understanding What exactly is the business problem to be solved? Is the data science solution formulated appropriately to solve this business problem? NB: sometimes we have to make judicious approximations. What business entity does an instance/example correspond to? Is the problem a supervised or unsupervised problem? If supervised,

Is a target variable defined? If so, is it defined precisely? Think about the values it can take.

Are the attributes defined precisely? Think about the values they can take. For supervised problems: will modeling this target variable actually improve the stated business problem? An important subproblem? If the latter, is the rest of the business problem addressed? Does framing the problem in terms of expected value help to structure the subtasks that need to be solved? If unsupervised, is there an “exploratory data analysis” path well defined? (That is, where is the analysis going?)

Data Preparation Will it be practical to get values for attributes and create feature vectors, and put them into a single table? If not, is an alternative data format defined clearly and precisely? Is this taken into account in the later stages of the project? (Many of the later methods/techniques assume the dataset is in feature vector format.) If the modeling will be supervised, is the target variable well defined? Is it clear how to get values for the target variable (for training and testing) and put them into the table? How exactly will the values for the target variable be acquired? Are there any costs involved? If so, are the costs taken into account in the proposal? Are the data being drawn from a population similar to that to which the model will be applied? If there are discrepancies, are any selection biases noted clearly? Is there a plan for how to compensate for them?

Modeling Is the choice of model appropriate for the choice of target variable? Classification, class probability estimation, ranking, regression, clustering, etc. Does the model/modeling technique meet the other requirements of the task? Generalization performance, comprehensibility, speed of learning, speed of application, amount of data required, type of data, missing values? Is the choice of modeling technique compatible with prior knowledge of the problem (e.g., is a linear model being proposed for a definitely nonlinear problem)? Should various models be tried and compared (in evaluation)? For clustering, is there a similarity metric defined? Does it make sense for the business problem?

Evaluation and Deployment Is there a plan for domain-knowledge validation? Will domain experts or stakeholders want to vet the model before deployment? If so, will the model be in a form they can understand? Is the evaluation setup and metric appropriate for the business task? Recall the original formulation. Are business costs and benefits taken into account? For classification, how is a classification threshold chosen? Are probability estimates used directly? Is ranking more appropriate (e.g., for a fixed budget)? For regression, how will you evaluate the quality of numeric predictions? Why is this the right way in the context of the problem? Does the evaluation use holdout data? Cross-validation is one technique. Against what baselines will the results be compared? Why do these make sense in the context of the actual problem to be solved? Is there a plan to evaluate the baseline methods objectively as well? For clustering, how will the clustering be understood? Will deployment as planned actually (best) address the stated business problem? If the project expense has to be justified to stakeholders, what is the plan to measure the final (deployed) business impact?

Overview The developers are delving into machine learning and deep learning methods to make machines more intelligent. A human learns to accomplish a task by practicing and repeating it repeatedly until the skill is learned by memory. When this happens, his brain’s neurons automatically fire, enabling them to carry out the learnt task swiftly. This is also quite similar to deep learning. For different kinds of issues, it employs various neural network topologies. For instance, object detection, image segmentation, object recognition, and sound and image categorization.

Problem The capacity of computers to detect human handwritten digits is known as handwritten digit recognition. Because handwritten numerals are imperfect and can be generated with a variety of tastes, it is a difficult work for the machine. The answer to this issue is handwritten digit recognition, which uses an image of a digit to identify the digit that is contained in the image.

Predictive Modeling We will use the MNIST dataset to develop a handwritten digit recognition software. Convolutional Neural Networks, a unique subset of deep neural networks, will be utilized. In the end, we’re going to create a GUI that lets you draw a digit and instantly tell what it is.

Data One of the most well-liked datasets among fans of deep learning and machine learning is definitely this one. The MNIST dataset includes 10,000 test photos and 60,000 training images of handwritten digits from 0 to 9. Thus, there are 10 separate classes in the MNIST dataset. The handwritten numerals are displayed as a 28 by 28 matrix with a grayscale pixel value in each cell.

Use One of the very significant problems in pattern recognition applications is the recognition of handwritten characters. Applications for digit recognition include filling out forms, processing bank checks, and sorting mail.

Evaluation Even though the MNIST dataset has been successfully addressed, it can be a good place to start when creating and honing a strategy for tackling image classification problems with convolutional neural networks. We can use the dataset’s existing, well stated train and test datasets. We can further divide the training set into a train and validation dataset in order to estimate a model’s performance for a specific training run. The performance on the train and validation datasets for each run can then be visualized to show learning curves and insight into how well a model is picking up on the issue.

Limitation Different handwriting styles, which are very personal behaviors with many models for numbers based on the angles, length of the segments, stress on particular portions of numbers, etc., present the fundamental challenge in the recognition of handwritten digits. Various samples of handwritten digits from the MNIST dataset.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *