Data understanding and Preparation¶

In this Notebook we're going to demonstrate the typical steps a responsible Data Scientist can take to understand and prepare their Data.

Importing Data files¶

#Import Product DataSet here
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in a IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_5ae689904bd0473ea24ad0236f6cf6bb = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='BDofu6biX9qVa-nFQ-cf3avUI_ZYjZQqbkp8B7Hg7NKu',
    ibm_auth_endpoint="https://iam.eu-gb.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3.eu-geo.objectstorage.service.networklayer.com')

body = client_5ae689904bd0473ea24ad0236f6cf6bb.get_object(Bucket='project1-donotdelete-pr-lawgcttpb9ne6t',Key='Product Data Set - Student 2 of 3.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

product_data = pd.read_csv(body,sep='|')
product_data.head()

#Import Transaction DataSet Here
body = client_5ae689904bd0473ea24ad0236f6cf6bb.get_object(Bucket='project1-donotdelete-pr-lawgcttpb9ne6t',Key='Transaction Data Set - Student 3 of 3.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

transaction_data = pd.read_csv(body)
transaction_data.head()

#Import Customer Dataset Here
body = client_5ae689904bd0473ea24ad0236f6cf6bb.get_object(Bucket='project1-donotdelete-pr-lawgcttpb9ne6t',Key='Customer Data Set - Student 1 of 3.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

customer_data = pd.read_csv(body)
customer_data.head()

Quick Data Exploration¶

product_data.shape

transactions_data.shape

customer_data.shape

We can conclude from the above that Retailer X sells 30 products and served 500 customers in a total of 10,000 recorded transactions.¶

type(customer_data)

type(customer_data.AGE)

customer_data.dtypes

customer_data['INCOME']=customer_data['INCOME'].map(lambda x : x.replace('$',''))

customer_data.head(2)

customer_data['INCOME']=customer_data['INCOME'].map(lambda x : int(x.replace(',','')))

customer_data.head(2)

customer_data.dtypes

CUSTOMERID           int64
GENDER               int64
AGE                  int64
INCOME               int64
EXPERIENCE SCORE     int64
LOYALTY GROUP       object
ENROLLMENT DATE     object
HOUSEHOLD SIZE       int64
MARITAL STATUS      object
dtype: object

Now running the “dtypes” method reveals that data type conversion of INCOME was successful

customer_data["MARITAL STATUS"].describe()

count         500
unique          4
top       Married
freq          267
Name: MARITAL STATUS, dtype: object

customer_data["INCOME"].describe()

count       500.000000
mean      85792.482000
std       37157.766304
min       20256.000000
25%       52429.000000
50%       86846.500000
75%      118381.000000
max      149999.000000
Name: INCOME, dtype: float64

customer_data["MARITAL STATUS"].unique()

array(['Single', 'Married', 'Divorced', 'Widow/Widower'], dtype=object)

from datetime import datetime
customer_data['ENROLLMENT DATE']=\
customer_data['ENROLLMENT DATE'][customer_data['ENROLLMENT DATE'].notnull()].\
map(lambda x :datetime.strptime(x, '%d-%m-%Y') )

customer_data.dtypes

CUSTOMERID                   int64
GENDER                       int64
AGE                          int64
INCOME                       int64
EXPERIENCE SCORE             int64
LOYALTY GROUP               object
ENROLLMENT DATE     datetime64[ns]
HOUSEHOLD SIZE               int64
MARITAL STATUS              object
dtype: object

Data Quality¶

Data used in this tutorial is mostly free from data quality issues, however in real life, data scientists deal with data sets that needs to be cleaned and corrected for their quality issues

print('null values for transactoins ?',transactions_data.isnull().values.any())
print('null values for products ?',product_data.isnull().values.any())
print('null values for customers ?',customer_data.isnull().values.any())

null values for transactoins ? False
null values for products ? False
null values for customers ? True

customer_data.columns[customer_data.isna().any()].tolist()

['ENROLLMENT DATE']

It turned out that ENROLMENT DATE is the only column which has null values. The reasons behind is that not all customers are enrolled to loyalty and hence there is no enrolment date

Analysis of the distribution of variables using graphs¶

import matplotlib.pyplot as plt

Univariate Analysis (Single variable analysis)¶

customer_data['MARITAL STATUS'].value_counts().plot(kind='bar')
plt.xlabel("Marital Status")
plt.ylabel("Frequency Distribution")
plt.show()

customer_data['AGE'].hist(bins=10)  
plt.show()

plt.figure(figsize=(8,8))
plt.boxplot(customer_data.AGE,0,'rs',1)
plt.grid(linestyle='-',linewidth=1)
plt.show()

customer_data['AGE'].describe()

count    500.000000
mean      42.316000
std       17.567509
min       18.000000
25%       30.000000
50%       39.000000
75%       50.250000
max       90.000000
Name: AGE, dtype: float64

Constructing new features and generating Insights¶

Remember our business understanding objectives 1-Understanding the factors associated with loyalty program participation 2-Understanding the factors associated with increased spending

trans_products=transactions_data.merge(product_data,how='inner', left_on='PRODUCT NUM', right_on='PRODUCT CODE')

trans_products.head()

trans_products['UNIT LIST PRICE']=trans_products['UNIT LIST PRICE'].map(lambda x : float(x.replace('$','')))

trans_products.dtypes

CUSTOMER NUM            int64
PRODUCT NUM             int64
QUANTITY PURCHASED      int64
DISCOUNT TAKEN        float64
TRANSACTION DATE       object
STOCKOUT                int64
PRODUCT CODE            int64
PRODUCT CATEGORY       object
UNIT LIST PRICE       float64
dtype: object

trans_products['Total_Price']=trans_products['QUANTITY PURCHASED'] * trans_products['UNIT LIST PRICE'] * (1- trans_products['DISCOUNT TAKEN'])

trans_products.head()

Income_by_product = trans_products.groupby('PRODUCT CATEGORY').agg({'Total_Price':'sum'}).sort_values('Total_Price',ascending=False)

Income_by_product

Revenue_by_product=Income_by_product.rename(columns={'Total_Price':'Revenue Per Product'})

Revenue_by_product['Revenue Per Product'].plot(kind='pie',autopct='%1.1f%%',legend = True)

<matplotlib.axes._subplots.AxesSubplot at 0x7fa6e12a75c0>

For each customer , we will calculate total spend ,total spend per category ,recent transaction date¶

customer_prod_categ=trans_products.groupby(['CUSTOMER NUM','PRODUCT CATEGORY']).agg({'Total_Price':'sum'})

customer_prod_categ.head()

customer_prod_categ.columns

Index(['Total_Price'], dtype='object')

customer_prod_categ.reset_index().head()

customer_prod_categ=customer_prod_categ.reset_index()

customer_pivot=customer_prod_categ.pivot(index='CUSTOMER NUM',columns='PRODUCT CATEGORY',values='Total_Price')

customer_pivot.head()

trans_products['TRANSACTION DATE']=trans_products['TRANSACTION DATE'].map(lambda x :datetime.strptime(x, '%m/%d/%Y') )

recent_trans_total_spend=trans_products.groupby('CUSTOMER NUM').\
agg({'TRANSACTION DATE':'max','Total_Price':'sum'}). \
rename(columns={'TRANSACTION DATE':'RECENT TRANSACTION DATE','Total_Price':'TOTAL SPENT'})
recent_trans_total_spend.head()

customer_KPIs=customer_pivot.merge(recent_trans_total_spend,how='inner',left_index=True, right_index=True )

customer_KPIs.head()

customer_KPIs=customer_KPIs.fillna(0)
customer_KPIs.head()

customer_all_view=customer_data.merge(customer_KPIs,how='inner', left_on='CUSTOMERID', right_index=True)

customer_all_view.head()

Bivariate Analysis (2-variable analysis) – Loyalty as a target variable¶

Gender¶

table=pd.crosstab(customer_all_view['GENDER'],customer_all_view['LOYALTY GROUP'])
table

table.plot(kind='bar', stacked=True,figsize=(6,6))
plt.show()

Experience Score¶

table=pd.crosstab(customer_all_view['EXPERIENCE SCORE'],customer_all_view['LOYALTY GROUP'])
table

table.plot(kind='bar', stacked=True,figsize=(6,6))
plt.show()

Marital Status¶

table=pd.crosstab(customer_all_view['MARITAL STATUS'],customer_all_view['LOYALTY GROUP'])
table.plot(kind='bar', stacked=True,figsize=(6,6))
plt.show()

Age¶

customer_all_view['AGE_BINNED'] = pd.cut(customer_all_view['AGE'],10) # 10 bins of age

customer_all_view['AGE_BINNED'].value_counts()

(32.4, 39.6]      94
(39.6, 46.8]      91
(25.2, 32.4]      86
(17.928, 25.2]    78
(46.8, 54.0]      51
(54.0, 61.2]      24
(82.8, 90.0]      23
(61.2, 68.4]      23
(75.6, 82.8]      16
(68.4, 75.6]      14
Name: AGE_BINNED, dtype: int64

table=pd.crosstab(customer_all_view['AGE_BINNED'],customer_all_view['LOYALTY GROUP'])
table.plot(kind='bar', stacked=True,figsize=(6,6))
plt.show()

customer_all_view.groupby("LOYALTY GROUP").agg({'AGE':'mean'})

fig = plt.figure(1, figsize=(9, 6))
ax = fig.add_subplot(111)
plot1=customer_all_view['AGE'][customer_all_view['LOYALTY GROUP'] == "enrolled"]
plot2=customer_all_view['AGE'][customer_all_view['LOYALTY GROUP'] == "notenrolled"]
list1=[plot1,plot2]
ax.boxplot(list1,0,'rs',1)
ax.set_xticklabels(['Enrolled', 'Not Enrolled'])
plt.grid( linestyle='-', linewidth=1)
plt.show()

/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/numpy/core/fromnumeric.py:57: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  return getattr(obj, method)(*args, **kwds)

Total Spend¶

customer_all_view['TOTAL SPENT BINNED'] = pd.cut(customer_all_view['TOTAL SPENT'],10) # 10 bins of age

table=pd.crosstab(customer_all_view['TOTAL SPENT BINNED'],customer_all_view['LOYALTY GROUP'])
table.plot(kind='bar', stacked=True,figsize=(6,6))
plt.show()

Bivariate Analysis (2-variable analysis) – Customer spend as a target variable¶

Age¶

plt.scatter(customer_all_view['AGE'],customer_all_view['TOTAL SPENT'])
plt.xlabel("AGE")
plt.ylabel("Total Spent")
plt.show()

from scipy.stats import pearsonr
pearsonr(customer_all_view['AGE'],customer_all_view['TOTAL SPENT'])

(0.57601706772592709, 1.5608217502782303e-45)

Income¶

plt.scatter(customer_all_view['INCOME'],customer_all_view['TOTAL SPENT'])
plt.xlabel("Income")
plt.ylabel("Total Spent")
plt.show()

pearsonr(customer_all_view['INCOME'],customer_all_view['TOTAL SPENT'])

(0.68803110846251181, 2.3226326963813968e-71)

Experience Score¶

table = customer_all_view.groupby(['EXPERIENCE SCORE']).agg({'TOTAL SPENT':'mean'}).reset_index()

table['TOTAL SPENT'].plot(kind='bar')
plt.xlabel("Experience Score")
plt.ylabel("Average Total Spent per Score")
plt.xticks([0,1,2,3,4,5,6,7,8,9],[1,2,3,4,5,6,7,8,9,10])    
plt.show()

	CUSTOMER NUM	PRODUCT NUM	QUANTITY PURCHASED	DISCOUNT TAKEN	TRANSACTION DATE	PRODUCT CODE	PRODUCT CATEGORY	UNIT LIST PRICE
0	10114	30011	4	0.0	1/2/2015	30011	APPAREL	$25.46
1	10086	30011	6	0.0	1/2/2015	30011	APPAREL	$25.46
2	10174	30011	10	0.0	1/2/2015	30011	APPAREL	$25.46
3	10401	30011	12	0.0	1/2/2015	30011	APPAREL	$25.46
4	10216	30011	12	0.1	1/2/2015	30011	APPAREL	$25.46

	CUSTOMER NUM	PRODUCT NUM	QUANTITY PURCHASED	DISCOUNT TAKEN	TRANSACTION DATE	PRODUCT CODE	PRODUCT CATEGORY	UNIT LIST PRICE	Total_Price
0	10114	30011	4	0.0	1/2/2015	30011	APPAREL	25.46	101.840
1	10086	30011	6	0.0	1/2/2015	30011	APPAREL	25.46	152.760
2	10174	30011	10	0.0	1/2/2015	30011	APPAREL	25.46	254.600
3	10401	30011	12	0.0	1/2/2015	30011	APPAREL	25.46	305.520
4	10216	30011	12	0.1	1/2/2015	30011	APPAREL	25.46	274.968

	Total_Price
PRODUCT CATEGORY
ELECTRONICS	1607192.422
APPAREL	936757.914
FOOD	96044.610
HEALTH & BEAUTY	54776.312

		Total_Price
CUSTOMER NUM	PRODUCT CATEGORY
10001	APPAREL	4022.430
	ELECTRONICS	1601.315
	FOOD	68.688
	HEALTH & BEAUTY	1134.337
10002	APPAREL	2312.509

	CUSTOMER NUM	PRODUCT CATEGORY	Total_Price
0	10001	APPAREL	4022.430
1	10001	ELECTRONICS	1601.315
2	10001	FOOD	68.688
3	10001	HEALTH & BEAUTY	1134.337
4	10002	APPAREL	2312.509

	CUSTOMERID	GENDER	AGE	INCOME	EXPERIENCE SCORE	LOYALTY GROUP	ENROLLMENT DATE	HOUSEHOLD SIZE	MARITAL STATUS
0	10001	0	64	133,498	5	enrolled	06-03-2013	4	Single
1	10002	0	42	94,475	9	notenrolled	NaN	6	Married

	CUSTOMERID	GENDER	AGE	INCOME	EXPERIENCE SCORE	LOYALTY GROUP	ENROLLMENT DATE	HOUSEHOLD SIZE	MARITAL STATUS
0	10001	0	64	133498	5	enrolled	06-03-2013	4	Single
1	10002	0	42	94475	9	notenrolled	NaN	6	Married

	TOTAL SPENT	RECENT TRANSACTION DATE
CUSTOMER NUM
10001	6826.770	2015-12-24
10002	5062.451	2015-12-21
10003	8562.440	2015-12-31
10004	5522.694	2015-12-17
10005	213.512	2015-12-22

LOYALTY GROUP	enrolled	notenrolled
EXPERIENCE SCORE
1	0	28
2	0	19
3	0	18
4	0	22
5	43	23
6	48	32
7	49	22
8	42	21
9	44	28
10	38	23