Knowee
Questions
Features
Study Tools

18. A data scientist within an insurance company is training a model to predict the probability of claims on motor insurance book. The train data set has 5000 samples. One of the variable in the training data is the location. The experts in the company have adviced the data scientist that the location is an important variable in increasing or decreasing the chances of claiming. Upon analyzing the data, the data scientist observed that there are 550 samples where the location has missing values. Which of the following can the data scientist do to deal with problem that he has observed?drop the the location columns since it has more that 10% missing valuesdrop all the rows with missing valuesimpute missing values using the most frequent locationuse KNN imputer

Question

  1. A data scientist within an insurance company is training a model to predict the probability of claims on motor insurance book. The train data set has 5000 samples. One of the variable in the training data is the location. The experts in the company have adviced the data scientist that the location is an important variable in increasing or decreasing the chances of claiming. Upon analyzing the data, the data scientist observed that there are 550 samples where the location has missing values. Which of the following can the data scientist do to deal with problem that he has observed?drop the the location columns since it has more that 10% missing valuesdrop all the rows with missing valuesimpute missing values using the most frequent locationuse KNN imputer
...expand
🧐 Not the exact question you are looking for?Go ask a question

Solution

The data scientist has several options to deal with the missing values in the location column:

  1. Drop the location column: This is not advisable since the experts in the company have advised that location is an important variable in predicting the probability of claims.

  2. Drop all the rows with missing values: This could be a viable option, but it would result in losing 550 samples, which is 11% of the total data. This could potentially lead to loss of other important information.

  3. Impute missing values using the most frequent location: This could be a good option, especially if there is a location that is significantly more frequent than others. However, this method assumes that the missing values are likely to be the most common one, which might not always be the case.

  4. Use KNN imputer: This is a more sophisticated method of imputation that takes into account the 'neighborhood' of the missing value. It could potentially provide a more accurate imputation than simply using the most frequent location.

In conclusion, the best method would depend on the specific characteristics of the data. The data scientist could try different methods and see which one results in the best performance of the predictive model.

This problem has been solved

Similar Questions

In which of the following step the missing values are addressed ?  A. Data Cleaning  B. Data Collection  C. Data Arrangement  D. Data Gathering

Question 6According to the Module 2 reading, “Data Mining”, when data is missing in a systematic way, you should determine the impact of the missing data on the results and whether missing data can be excluded from what?1 pointThe studyThe data setThe analysisThe evaluation

Explain what should be done with suspected or missing data.

21. A student on attachment is preparing the dataset to be used for training a linear regression model in Scikit Learn. During exploratory data analysis, he has detected multiple feature columns that have missing values. The percentage of missing data across the whole training dataset is about 15%. The Specialist is worried that this might cause bias to his model that can lead to inaccurate results. Which approach will MOST likely yield the best result in reducing the bias caused by missing values?Compute the mean of non-missing values in the same column and use the result to replace missing values.Use supervised learning methods to estimate the missing values for each featureCompute the mean of non-missing values in the same row and use the result to replace missing values.Drop the columns that include missing values because they only account for 10% of the training data.

23. A data analyst is cleaning data in preparation of training a machine learning model. Whilst cleaning the data, she has observed that there are missing values in the data. Which of the following lines of code can she write to find the percentage of missing values in each column? - i. data.isnull().sum(axis = 1) / len(data) * 100- ii. data.isnull().sum(axis = 0) / len(data) * 100- iii. data.isnull().mean(axis = 1) * 100- iv. data.isnull().mean(axis = 0 ) * 100ii. and ivi. and iii.i onlyii. onlyiii. onlyiv. onlyNone of the above

1/2

Upgrade your grade with Knowee

Get personalized homework help. Review tough concepts in more detail, or go deeper into your topic by exploring other relevant questions.