In this project, I will apply unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data has been provided by Bertelsmann Arvato Analytics, and represents a real-life data science task.
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
# magic word for producing visualizations in notebook
%matplotlib inline
There are four files associated with this project (not including this one):
Udacity_AZDIAS_Subset.csv
: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).Udacity_CUSTOMERS_Subset.csv
: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).Data_Dictionary.md
: Detailed information file about the features in the provided datasets.AZDIAS_Feature_Summary.csv
: Summary of feature attributes for demographics data; 85 features (rows) x 4 columnsEach row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. I will use this information to cluster the general population into groups with similar demographic properties. Then, I will see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.
# Load in the general demographics data.
azdias = pd.read_csv('Udacity_AZDIAS_Subset.csv',sep=';')
# Load in the feature summary file.
feat_info = pd.read_csv('AZDIAS_Feature_Summary.csv', sep = ';')
# Load in the customer demographics data.
customers = pd.read_csv("Udacity_CUSTOMERS_Subset.csv", sep=";")
# Check the structure of the data after it's loaded
print("Shape of the demographics data file for the general population: {}".format(azdias.shape))
print("Shape of the features file: {}".format(feat_info.shape))
print("\nPrint a sample of the feature file:\n{}".format(feat_info.head(5)))
print("\nPrint unique values in the missing_or_unknown column:")
print(feat_info['missing_or_unknown'].unique())
print("\nPrint a sample of the data file")
azdias.head(5)
The feature summary file contains a summary of properties for each demographics data column. I will use this file to make cleaning decisions during this stage of the project.
The fourth column of the feature attributes summary (loaded in above as feat_info
) documents the codes from the data dictionary that indicate missing or unknown data. While the file encodes this as a list (e.g. [-1,0]
), this will get read in as a string object. I will parse it to make use of it to identify and clean the data. I will convert data that matches a 'missing' or 'unknown' value code into a numpy NaN value.
# Identify naturally occuring missing values
print("10 features having the highest number of naturally missing values")
print(azdias.isnull().sum().sort_values(0,ascending=False).head(10))
# Identify missing or unknown data values and convert them to NaNs.
def convert_missing_nan(data):
for i in range(len(feat_info)):
col_i_feat3 = feat_info.iloc[i,3]
# Get the missing/unknown string value for column i from feature info
col_i_feat3_list = str(col_i_feat3)[1:-1].split(',')#re.findall("\-?\w",col_i_feat3)
# Convert the missing/unknown string value to a list
data.iloc[:,i][data.iloc[:,i].isin(col_i_feat3_list)] = np.nan
# If data for the column i matches the list, convert to numpy nan
return(data)
azdias = convert_missing_nan(azdias) # convert missing values in azdias data to nan
azdias.head()
# Feature has different number of unique values between the general population and customer
print(azdias['GEBAEUDETYP'].unique())
print(customers['GEBAEUDETYP'].unique())
# Need to remove 'GEBAEUDETYP' because it has different number of unique values in demographic data
# and the customer data causing one hot encoding to create different no of columns
azdias = azdias.drop(labels='GEBAEUDETYP',axis=1)
feat_info = feat_info[feat_info['attribute']!='GEBAEUDETYP']
Now I will identify a few columns that are outliers in terms of the proportion of values that are missing. And take a call to remove if a column is missing values beyond a threshold.
# Heatmap to show the frequency and distribution of nan values
# Comment if the heatmap takes a long time
plt.figure(figsize=(18,10))
sns.heatmap(azdias.isnull(), cbar=False)
# Perform an assessment of how much missing data there is in each column of the
# dataset.
nrow, ncol = azdias.shape
def find_outliers_nan(data,ax,threshold,data_count):
sorted_by_nan = data.isnull().sum(axis=ax).sort_values(0,ascending=False).reset_index()
sorted_by_nan_renamed = sorted_by_nan.rename(columns={sorted_by_nan.columns[0]: "Feature/Index", sorted_by_nan.columns[1]: "Nan Count" })
sorted_by_nan_renamed['Nan Percentage'] = sorted_by_nan_renamed['Nan Count']*100/data_count
return(sorted_by_nan_renamed[sorted_by_nan_renamed['Nan Percentage']>threshold])
col_outliers_nan = find_outliers_nan(azdias,0,33,nrow)
col_outliers_nan
plt.figure(figsize=(8,5))
sns.barplot(y='Feature/Index',x='Nan Percentage',data=col_outliers_nan,palette="Blues_d")
# Investigate patterns in the amount of missing data in each column.
pd.merge(col_outliers_nan,feat_info,left_on='Feature/Index',right_on='attribute',how='left')
# Remove the outlier columns from the dataset
# Azdiaz minus the outlier columns
azdias_less_out_col = azdias.drop(labels=col_outliers_nan['Feature/Index'].values,axis=1)
azdias_less_out_col.head()
Now, I will perform a similar assessment for the rows of the dataset. I will divide the data into two subsets: one for data points that are above some threshold for missing values, and a second subset for points below that threshold.
In order to know what to do with the outlier rows, I will check if the distribution of data values on columns that are not missing data (or are missing very little data) are similar or different between the two groups. I will compare the distribution of values for at least 5 such columns.
If the distributions of non-missing features look similar between the data with many missing values and the data with few or no missing values, then simply dropping those points from the analysis won't present a major issue. On the other hand, if the data with many missing values looks very different from the data with few or no missing values, then I will make a note on those data as special.
# How much data is missing in each row of the dataset?
row_outliers_nan = find_outliers_nan(azdias_less_out_col,1,50,ncol)
print("Percentage of rows marked as outliers: {}".format(len(row_outliers_nan)*100/nrow))
row_outliers_nan.tail()
# Write code to divide the data into two subsets based on the number of missing
# values in each row.
print("Number of rows before dropping: {}".format(len(azdias_less_out_col)))
azdias_regular = azdias_less_out_col.drop(labels=row_outliers_nan['Feature/Index'],axis=0)
print("Number of rows before dropping: {}".format(len(azdias_regular)))
azdias_outliers = azdias_less_out_col.iloc[row_outliers_nan['Feature/Index'].values,:]
print("Outlier data set length: {}".format(len(azdias_outliers)))
# Compare the distribution of values for at least five columns where there are
# no or few missing values, between the two subsets.
dist_features = ['ALTERSKATEGORIE_GROB','FINANZTYP','ONLINE_AFFINITAET','LP_STATUS_FEIN','HH_EINKOMMEN_SCORE']
print("Percent missing values in the data set with minimum outliers")
print(azdias_regular[dist_features].isnull().sum()*100/len(azdias_regular))
print("\nPercent missing values in the data set with more outliers")
print(azdias_outliers[dist_features].isnull().sum()*100/len(azdias_outliers))
for feature in dist_features:
f, axes = plt.subplots(1, 2, figsize=(12,4))
sns.countplot(x=feature, data=azdias_regular, ax = axes[0])
axes[0].set_title("Regular Data distribution")
sns.countplot(x=feature, data=azdias_outliers, ax = axes[1])
axes[1].set_title("Outliers Data Distribution")
For the 5 columns analyzed, the distribution of data between the two datasets look different
The outliers seem to have a lot of high income households
Since the unsupervised learning techniques to be used will only work on data that is encoded numerically, I will make a few encoding changes. In addition, while almost all of the values in the dataset are encoded using numbers, not all of them represent numeric values. The third column of the feature summary (feat_info
) shows the summary of types of measurement.
Imputing choices:
# How many features are there of each data type?
feat_info.groupby('type')['type'].count()
# Assess categorical variables: which are binary, which are multi-level, and
# which one needs to be re-encoded?
def cat_feat_investigation(feat_info,col_outliers_nan,df):
feat_less_outliers = feat_info[~feat_info['attribute'].isin(col_outliers_nan['Feature/Index'].values)] #Remove outlier columns from feat_info
cat_feat = feat_less_outliers[feat_less_outliers['type']=='categorical']['attribute'] # Get categorical features
multi_level_cat = []
binary_cat = []
for feature in df[cat_feat]:
print("Feature: {} , Unique values: {}".format(feature,df[feature].unique()))
if df[feature].nunique() > 2: #create a list of list of multi-level variables
multi_level_cat.append(feature)
else:
binary_cat.append(feature) #create a list of binary variables
#Remove CAMEO_DEU_2015 as it contains a very large number of unique values and causes issue with sklearn one hot encoding
multi_level_cat.remove('CAMEO_DEU_2015')
cat_feat = cat_feat[cat_feat.iloc[:]!='CAMEO_DEU_2015']
return(multi_level_cat,cat_feat,binary_cat,feat_less_outliers)
multi_level_cat,cat_feat,binary_cat,feat_less_outliers = cat_feat_investigation(feat_info,col_outliers_nan,azdias_regular)
# Impute and Encode categorical features
def impute_and_encode(df,cat_feat,multi_level_cat):
# Encode 'OST_WEST_KZ' from strings to numbers
df_imputed = df.copy()
le = LabelEncoder()
df_imputed['OST_WEST_KZ'] = le.fit_transform(df['OST_WEST_KZ'].astype(str))
#Impute most frequent values for categorical variables
imp_cat = Imputer(missing_values=np.nan,strategy='most_frequent')
df_imputed[cat_feat] = imp_cat.fit_transform(df_imputed[cat_feat])
# Re-encode categorical variable(s) to be kept in the analysis.
df_encoded = pd.DataFrame()
df_encoded = pd.get_dummies(df[multi_level_cat].astype(str),drop_first=True)
return(df_imputed,df_encoded)
azdias_imputed,azdias_encoded = impute_and_encode(azdias_regular,cat_feat,multi_level_cat)
azdias_encoded.head()
There are a handful of features that are marked as "mixed" in the feature summary that require special treatment in order to be included in the analysis
# Check all mixed features
feat_less_outliers[feat_less_outliers['type']=='mixed']['attribute']
# Investigate "PRAEGENDE_JUGENDJAHRE", "CAMEO_INTL_2015" and PLZ8_BAUMAX for missing values
mixed = ['PRAEGENDE_JUGENDJAHRE','CAMEO_INTL_2015','PLZ8_BAUMAX']
# Percentage and distribution of missing values
def mixed_feature_investigation(df,df_imputed,feature_name):
print("Percentage of missing values in {}:".format(feature_name))
print(df[feature_name].isnull().sum()*100/len(df))
#Impute missing values with the most frequent values
df_imputed[feature_name] = df[feature_name].fillna(df[feature_name].mode().iloc[0])
# Show before and after distributions
fig, ax = plt.subplots(1,2,figsize = (15,5))
ax[0].hist(df[feature_name].dropna())
ax[1].hist(df_imputed[feature_name])
# Add titles
ax[0].set_title("Distribution before Imputing")
ax[1].set_title("Distribution after Imputing")
plt.suptitle("{}".format(feature_name))
plt.show()
return(df_imputed)
def all_mixed_investigation(df,df_imputed,feature):
for feature in mixed:
df_imputed = mixed_feature_investigation(df,df_imputed,feature)
return(df_imputed)
azdias_imputed = all_mixed_investigation(azdias_regular,azdias_imputed,feature)
# # Engineer new features from the mixed features
#Create a dictionary for called PRAEGENDE_JUGENDJAHRE (Mainstream = 1, Avantgarde = 0)
PRAEGENDE_JUGENDJAHRE = {"1": [40,1], "2": [40,0], "3": [50,1], "4": [50,0],\
"5": [60,1], "6": [60,0], "7": [60,0], "8": [70,1],\
"9": [70,0], "10": [80,1], "11": [80,0], "12": [80,1],\
"13": [80,0], "14": [90,1], "15": [90,0]}
def eng_new_feat(df_imputed,df_encoded):
new_feat = ['DECADE','MOVEMENT','WEALTH','LIFE_STAGE','PLZ8_BAUMAX_TYPE','WOHNLAGE']
#Engineer new features
df_encoded['DECADE'] = df_imputed['PRAEGENDE_JUGENDJAHRE'].astype(int).astype(str).map(lambda x: PRAEGENDE_JUGENDJAHRE[x][0])
df_encoded['MOVEMENT'] = df_imputed['PRAEGENDE_JUGENDJAHRE'].astype(int).astype(str).map(lambda x: PRAEGENDE_JUGENDJAHRE[x][1])
df_encoded['WEALTH'] = df_imputed['CAMEO_INTL_2015'].astype(str).str[0].astype(int)
df_encoded['LIFE_STAGE'] = df_imputed['CAMEO_INTL_2015'].astype(str).str[1].astype(int)
df_encoded['PLZ8_BAUMAX_TYPE'] = df_imputed['PLZ8_BAUMAX'].replace(to_replace=[1.0,2.0,3.0,4.0],value=1)
df_encoded['PLZ8_BAUMAX_TYPE'] = df_imputed['PLZ8_BAUMAX'].replace(to_replace=5.0,value=0)
df_encoded['WOHNLAGE'] = df_imputed['WOHNLAGE'].replace(to_replace=8.0, value=7.0)
# Impute missing values with the most frequent value for the new features
imp_new = Imputer(missing_values = np.nan, strategy='most_frequent')
df_encoded[new_feat] = imp_new.fit_transform(df_encoded[new_feat])
return(df_encoded)
azdias_encoded = eng_new_feat(azdias_imputed,azdias_encoded)
Now I will include only the below columns in the final data set:
For the new columns that I engineered, I have excluded the original columns from the final dataset. Otherwise, their values will interfere with the analysis later on the project.
# Create final dataframe
print("Types of features:")
print(feat_info.groupby('type')['type'].count())
def create_final_df(df_imputed,df_encoded,feat_less_outliers,binary_cat):
#Find ordinal and numerical features
ord_feat = feat_less_outliers[feat_less_outliers['type'].isin(['ordinal'])]['attribute'].values
num_feat = feat_less_outliers[feat_less_outliers['type'].isin(['numeric'])]['attribute'].values
# Initialize imputer
imp_ord = Imputer(missing_values=np.nan,strategy='most_frequent')
imp_num = Imputer(missing_values=np.nan,strategy='median')
# Fit and transform imputer for ordinal and numerical features
df_imputed[ord_feat] = imp_ord.fit_transform(df_imputed[ord_feat])
df_imputed[num_feat] = imp_num.fit_transform(df_imputed[num_feat])
df_encoded = df_encoded.assign(**df_imputed[ord_feat])
df_encoded = df_encoded.assign(**df_imputed[num_feat])
#Include binary categorical features
df_encoded = df_encoded.assign(**df_imputed[binary_cat])
print("\nShape of the dataframe: {}".format(df_encoded.shape))
return(df_encoded)
azdias_encoded = create_final_df(azdias_imputed,azdias_encoded,feat_less_outliers,binary_cat)
I will need to perform the same cleaning steps I applied on the general population, again on the customer demographics data. In this substep, I will create a function to execute the main feature selection, encoding, and re-engineering steps I performed above, which can be used with the customer demographics data.
def clean_data(df,feat_info):
"""
Perform feature trimming, re-encoding, and engineering for demographics
data
INPUT: Demographics DataFrame
OUTPUT: Trimmed and cleaned demographics DataFrame
"""
nrows, ncols = df.shape
# Put in code here to execute all main cleaning steps:
# convert missing value codes into NaNs, ...
df = convert_missing_nan(df)
# remove selected columns and rows, ...
col_outliers_nan = find_outliers_nan(df,0,33,nrows)
df_less_out_col = df.drop(labels=col_outliers_nan['Feature/Index'].values,axis=1)
row_outliers_nan = find_outliers_nan(df_less_out_col,1,50,ncols)
df_regular = df_less_out_col.drop(labels=row_outliers_nan['Feature/Index'],axis=0)
# select, re-encode, and engineer column values.
# Encode categorical features
multi_level_cat,cat_feat,binary_cat,feat_less_outliers = cat_feat_investigation(feat_info,col_outliers_nan,df_regular)
df_imputed,df_encoded = impute_and_encode(df_regular,cat_feat,multi_level_cat)
# Encode mixed features
df_imputed = all_mixed_investigation(df_regular,df_imputed,feature)
df_encoded = eng_new_feat(df_imputed,df_encoded)
# Return the cleaned dataframe.
df_encoded = create_final_df(df_imputed,df_encoded,feat_less_outliers,binary_cat)
return(df_encoded)
# Apply feature scaling to the general population demographics data.
def feature_scaling(df_encoded):
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df_encoded)
df_scaled = pd.DataFrame(scaled_features, index=df_encoded.index, columns=df_encoded.columns)
return(scaler,df_scaled)
scaler_azdias,azdias_scaled = feature_scaling(azdias_encoded)
azdias_scaled.head()
I will now apply dimensionality reduction techniques on the data. I will use sklearn's PCA to apply principal component analysis on the data, thus finding the vectors of maximal variance in the data. To start, I will set a number of components that is at least half the number of features (so there's enough features to see the general trend in variability)
# Apply PCA to the data.
n_comp = 74
pca = PCA(n_components=n_comp)
pca.fit(azdias_scaled)
variance = pca.explained_variance_ratio_
print("Total variance explained by selected components : {}".format(variance.sum()))
# Investigate the variance accounted for by each principal component.
def plot_pca_variance(variance,n_comp):
plt.figure(figsize=(14,7))
plt.title("PCA Variance by components")
plt.bar(np.arange(n_comp),variance*100, width=.2, color = '#00A000', label="Feature Variance")
plt.bar(np.arange(n_comp)+.2,np.cumsum(variance*100), width=.2, color= '#00A0A0', label="Cumulative Feature Variance")
plt.ylabel("Explained Variance Percentage")
plt.xlabel("Feature Index")
plt.yticks(np.arange(0,100,5))
plt.ylim(0,(variance.sum()*100+10))
plt.legend(loc = 'upper center')
plt.tight_layout()
plt.show()
plot_pca_variance(variance,n_comp)
# Re-apply PCA to the data while selecting for number of components to retain.
def apply_pca(n_comp,df):
pca = PCA(n_components=n_comp)
pca.fit(df)
variance = pca.explained_variance_ratio_
print("Total variance explained by selected components : {}".format(variance.sum()))
plot_pca_variance(variance,n_comp)
return(pca, pca.transform(df))
pca_azdias, X_azdias = apply_pca(55,azdias_scaled)
Now that we have our transformed principal components, it's a nice idea to check out the weight of each variable on the first few components to see if they can be interpreted in some fashion.
As a reminder, each principal component is a unit vector that points in the direction of highest variance (after accounting for the variance captured by earlier principal components). The further a weight is from zero, the more the principal component is in the direction of the corresponding feature. If two features have large weights of the same sign (both positive or both negative), then increases in one tend expect to be associated with increases in the other. To contrast, features with different signs can be expected to show a negative correlation: increases in one variable should result in a decrease in the other.
# Map weights for the first principal component to corresponding feature names
# and then print the linked values, sorted by weight.
def pca_dim_feat_map(dimension,pca,df):
return(pd.DataFrame(pca.components_[dimension-1:dimension]*100,columns = df.columns,\
index = np.arange(dimension,dimension+1,1)).sort_values(by=dimension,axis=1,ascending=False))
pca_dim_feat_map(1,pca_azdias,azdias_scaled)
# Map weights for the second principal component to corresponding feature names
# and then print the linked values, sorted by weight.
pca_dim_feat_map(2,pca_azdias,azdias_scaled)
# Map weights for the third principal component to corresponding feature names
# and then print the linked values, sorted by weight.
pca_dim_feat_map(3,pca_azdias,azdias_scaled)
Below dimensions contain information related to the high weighted features and do not contain much information about the negatively weighted features.
Dimension 1:
High weights: Number of family houses in PLZ8 region (6-10+), estimated house hold net income, wealth, size of community
Negative weights: Financial habits, movement patterns, number of family houses (1-3), number of buildings in the microcell, number of houses in the microcell (1-2)
Dimension 2:
High weights: Age, financial preparedness, fairly supplied energy consumer, event oriented and sensual minded personality
Negative weights: Decade, money saver, inconspicuous spender, religiosity
Dimension 3:
High weights: Dreamful, family mindedness, social mindedness, cultural mindedness
Negative weights: Gender, combativeness, dominance, critical, rationality
Note - High weight does not mean that that dimension shows high value of that feature, it just means that the dimension contains more information about that feature than other features. And vice-versa for the negatively weighted features
We have assessed and cleaned the demographics data, then scaled and transformed them. Now, it's time to see how the data clusters in the principal components space. In this substep, I will apply k-means clustering to the dataset and use the average within-cluster distances from each point to their assigned cluster's centroid to decide on a number of clusters to keep.
# Over a number of different cluster counts...
kmeans_score = []
for n in np.arange(10,30,5):
kmeans = KMeans(n_clusters=n)
kmeans.fit(X_azdias)
kmeans_score.append(kmeans.score(X_azdias))
kmeans_score
# Investigate the change in within-cluster distance across number of clusters.
# HINT: Use matplotlib's plot function to visualize this relationship.
clusters = np.arange(10,30,5)
kmeans_score
plt.plot(kmeans_score,clusters)
# Re-fit the k-means model with the selected number of clusters and obtain
# cluster predictions for the general population demographics data.
def do_kmeans(n,X):
kmeans = KMeans(n_clusters=n)
kmeans_fit = kmeans.fit(X)
kmeans_pred = kmeans.predict(X)
kmeans_score = kmeans.score(X)
return(kmeans_fit,kmeans_pred,kmeans_score)
kmeans_fit_azdias,kmeans_pred_azdias,kmeans_score_azdias = do_kmeans(15,X_azdias)
kmeans_score_azdias
Now that you have clusters and cluster centers for the general population, it's time to see how the customer data maps on to those clusters. This is NOT re-fitting all of the models to the customer data. Instead, I am going to use the fits from the general population to clean, transform, and cluster the customer data. In the last step, I will interpret how the general population fits apply to the customer data.
# Check the structure of the data after it's loaded
print("Shape of the demographics data file for the customer population: {}".format(customers.shape))
nrow, ncol = customers.shape
print("\nPrint a sample of the data file")
customers.head(5)
# Need to remove 'GEBAEUDETYP' because it has different number of unique values in demographic data
# and the customer data causing one hot encoding to create different no of columns
customers = customers.drop(labels = 'GEBAEUDETYP',axis=1)
# Apply preprocessing, feature transformation, and clustering from the general
# demographics onto the customer data, obtaining cluster predictions for the
# customer demographics data.
customers_encoded = clean_data(customers,feat_info)
customers_scaled = scaler_azdias.transform(customers_encoded)
X_customers = pca_azdias.transform(customers_scaled)
kmeans_pred_customers = kmeans_fit_azdias.predict(X_customers)
At this point, I have clustered data based on demographics of the general population of Germany, and seen how the customer data for a mail-order sales company maps onto those demographic clusters. In this final substep, I will compare the two cluster distributions to see where the strongest customer base for the company is.
Consider the proportion of persons in each cluster for the general population, and the proportions for the customers. If we think the company's customer base to be universal, then the cluster assignment proportions should be fairly similar between the two. If there are only particular segments of the population that are interested in the company's products, then we should see a mismatch from one to the other. If there is a higher proportion of persons in a cluster for the customer data compared to the general population (e.g. 5% of persons are assigned to a cluster for the general population, but 15% of the customer data is closest to that cluster's centroid) then that suggests the people in that cluster to be a target audience for the company. On the other hand, the proportion of the data in a cluster being larger in the general population than the customer data (e.g. only 2% of customers closest to a population centroid that captures 6% of the data) suggests that group of persons to be outside of the target demographics.
# Compare the proportion of data in each cluster for the customer data to the
# proportion of data in each cluster for the general population.
cluster_prop = pd.DataFrame(columns={'General','Customers'},index=np.arange(0,15))
for i in range(15):
cluster_prop['General'][i] = (kmeans_pred_azdias==i).sum()*100/len(kmeans_pred_azdias)
cluster_prop['Customers'][i] = (kmeans_pred_customers==i).sum()*100/len(kmeans_pred_customers)
plt.figure(figsize=(12,6))
plt.title('General vs Customer Cluster Proportions', fontsize=14)
plt.bar(x=np.arange(0,15),height=cluster_prop['General'],width=.4,color='#0000A0',label='General')
plt.bar(x=np.arange(0,15)+.4,height=cluster_prop['Customers'],width=.4,color='#00A000',label='Customers')
plt.ylabel('Cluster percentage',fontsize=14)
plt.xlabel('Cluster Id',fontsize=14)
plt.xticks(np.arange(0,15,1))
plt.legend(fontsize=12)
# What kinds of people are part of a cluster that is overrepresented in the
# customer data compared to the general population?
# Get inverse PCA transform for general population
inv_pca_azdias = pca_azdias.inverse_transform(X_azdias)
all_clusters_azdias = pd.DataFrame(inv_pca_azdias, index=np.arange(0,len(inv_pca_azdias)), columns=azdias_encoded.columns)
# Function to get pre-PCA data for a given cluster
def cluster_data(pca,X,kmeans_pred,cluster_id,data_encoded):
cluster_data = pca.inverse_transform(X[np.where(kmeans_pred==cluster_id)])
cluster_df = pd.DataFrame(cluster_data, index=np.arange(0,len(cluster_data)), columns=data_encoded.columns)
return(cluster_df)
cluster_1_azdias = cluster_data(pca_azdias,X_azdias,kmeans_pred_azdias,1,azdias_encoded)
# Ratio of feature median between the given cluster and the all data
cluster_1_azdias_high_median = (cluster_1_azdias.median()/all_clusters_azdias.median()).reset_index().sort_values([0],ascending=False)
# Top 10 features having high values in the given cluster as compared to all data
cluster_1_azdias_high_median.head(10)
# What kinds of people are part of a cluster that is underrepresented in the
# customer data compared to the general population?
# Get pre-PCA data for a given cluster
cluster_10_azdias = cluster_data(pca_azdias,X_azdias,kmeans_pred_azdias,10,azdias_encoded)
# Ratio of feature median between the given cluster and the all data
cluster_10_azdias_high_median = (cluster_10_azdias.median()/all_clusters_azdias.median()).reset_index().sort_values([0],ascending=False)
# Top 10 features having high values in the given cluster as compared to all data
cluster_10_azdias_high_median.head(10)
Plotting the proportion of cluster population/all population for both general and customer data set shows that cluster 1 is overrepresented in customers and cluster 10 is underrepresented in customers