Data Analysis on Wine Data Sets with Python



Introduction


The purpose of this project was to do preliminary exploration of wine data, with a goal of selecting right attributes for classifying wine into one of the quality categories: poor, normal and excellent. Since the data was already in tidy structure, not much attention was payed in data wrangling, but finding relationships among the attributes and between attributes and the wine quality.


Loading The Data set


Loading The Data set The dataset “WINE.csv” Is a colection of . Since the data was already in tidy structure, no data wrangling was done on it, except adding the wine type column, removing the index variable X and randomizing the rows.

In [171]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
In [45]:
wine_df = pd.read_csv("wine.csv")
wine_df.head()    #look at the head of our df
Out[45]:
Unnamed: 0 country description designation points price province region_1 region_2 variety winery
0 0 US This tremendous 100% varietal wine hails from ... Martha's Vineyard 96 235.0 California Napa Valley Napa Cabernet Sauvignon Heitz
1 1 Spain Ripe aromas of fig, blackberry and cassis are ... Carodorum Selección Especial Reserva 96 110.0 Northern Spain Toro NaN Tinta de Toro Bodega Carmen Rodríguez
2 2 US Mac Watson honors the memory of a wine once ma... Special Selected Late Harvest 96 90.0 California Knights Valley Sonoma Sauvignon Blanc Macauley
3 3 US This spent 20 months in 30% new French oak, an... Reserve 96 65.0 Oregon Willamette Valley Willamette Valley Pinot Noir Ponzi
4 4 France This is the top wine from La Bégude, named af... La Brûlade 95 66.0 Provence Bandol NaN Provence red blend Domaine de la Bégude

Brief summary of the Wine dataframe

In [40]:
print('Shape\n',wine_df.shape)
print('--------------------------------------------------------------------------')
print('Number of rows\n', len(wine_df))
print('--------------------------------------------------------------------------')
print('Column headers\n', wine_df.columns)
print('--------------------------------------------------------------------------')
print('Data types\n', wine_df.dtypes)
print('--------------------------------------------------------------------------')
print('Index\n', wine_df.index)
print('--------------------------------------------------------------------------')
print('Missing Data\n',wine_df.isnull().sum())
print('--------------------------------------------------------------------------')
print('Duplicated Rows\n', wine_df.duplicated().sum())
print('--------------------------------------------------------------------------')
Shape
 (1103, 10)
--------------------------------------------------------------------------
Number of rows
 1103
--------------------------------------------------------------------------
Column headers
 Index(['country', 'description', 'designation', 'points', 'price', 'province',
       'region_1', 'region_2', 'variety', 'winery'],
      dtype='object')
--------------------------------------------------------------------------
Data types
 country         object
description     object
designation     object
points           int64
price          float64
province        object
region_1        object
region_2        object
variety         object
winery          object
dtype: object
--------------------------------------------------------------------------
Index
 RangeIndex(start=0, stop=1103, step=1)
--------------------------------------------------------------------------
Missing Data
 country          0
description      0
designation    269
points           0
price           57
province         0
region_1       173
region_2       611
variety          0
winery           0
dtype: int64
--------------------------------------------------------------------------
Duplicated Rows
 42
--------------------------------------------------------------------------
In [27]:
#droping the extra index column.

wine_df.drop('Unnamed: 0',1, inplace=True)
In [98]:
#delete the rows that have identical values in each columns

wine_df.drop_duplicates(inplace=True) 
  • THE 5 MOST EXPENSIVE WINES AND THEIR COUNTRY
In [51]:
wine_df.loc[:,['country','designation', 'price']].sort_values('price', ascending=False).head().reset_index()
Out[51]:
index country designation price
0 898 Italy Sorì Tildin 500.0
1 16 US Rainin Vineyard 325.0
2 18 France Le Pigeonnier 290.0
3 954 Italy NaN 240.0
4 0 US Martha's Vineyard 235.0

Wine Spectator's 100-Point Scale

Wine Spectator tasters review wines on the following 100-point scale:

  • 95-100 Classic: a great wine
  • 90-94 Outstanding: a wine of superior character and style
  • 85-89 Very good: a wine with special qualities
  • 80-84 Good: a solid, well-made wine
  • 75-79 Mediocre: a drinkable wine that may have minor flaws
  • 50-74 Not recommended

https://www.winespectator.com/articles/scoring-scale

SO WE'RE GOING TO GROUP THE POINTS COLUMN ACCORDINGLY

In [54]:
wine_df.points.describe()
Out[54]:
count    1103.000000
mean       89.701723
std         2.390405
min        85.000000
25%        88.000000
50%        90.000000
75%        91.000000
max        96.000000
Name: points, dtype: float64
In [195]:
wine_df['scoring_scale'] = pd.cut(wine_df.points, (84,89,94,100), labels=['very_good', 'outstanding', 'classic'])
  • NOW LET'S SEE HOW THE POINT OR THE SCORING SCALE INFLUENCE THE PRICE OF WINES
In [196]:
wine_df.loc[wine_df.price > 200,['designation','scoring_scale']].head(10)
Out[196]:
designation scoring_scale
0 Martha's Vineyard classic
16 Rainin Vineyard classic
18 Le Pigeonnier classic
22 Termanthia classic
898 Sorì Tildin outstanding
954 NaN very_good
  • LET'S GROUPE THE DF BY SCORING SCALE AND COUNT TO SEE HOW THE COMPANIES ARE DOING IN TERMS OF PRODUCING QUALITY WINE
In [197]:
print('not too bad')
wine_df.groupby("scoring_scale").designation.count()
not too bad
Out[197]:
scoring_scale
very_good      297
outstanding    512
classic         25
Name: designation, dtype: int64
In [151]:
sns.countplot(x='scoring_scale', data=wine_df)
Out[151]:
<matplotlib.axes._subplots.AxesSubplot at 0xeff046d2b0>
  • IF I HAD TO RECOMEND YOU SOME WINE IN THIS DF BELOW IS THE TOP 5 CLASSIC WINE
In [96]:
wine_df.loc[wine_df['scoring_scale'] == 'classic', ['country','designation', 'price','variety','points']].head()
Out[96]:
country designation price variety points
0 US Martha's Vineyard 235.0 Cabernet Sauvignon 96
1 Spain Carodorum Selección Especial Reserva 110.0 Tinta de Toro 96
2 US Special Selected Late Harvest 90.0 Sauvignon Blanc 96
3 US Reserve 65.0 Pinot Noir 96
4 France La Brûlade 66.0 Provence red blend 95
  • LET'S HAVE QUICK LOOK AT THE PRICE SCALE
In [1]:
a = '''this is a list of over 10,000 Fast Food restaurants provided by Datafiniti's Business Database. The dataset includes the restaurant's address, city, latitude and longitude coordinates, name, and more. Note that this is a sample of a large dataset The dataset contains 8 Columns and 10000 rows, the columns are mostly of Categorical variables except for longitude and latitude columns which are Continuous variables.  See that the Dataset did not come with a Land Area Column I have decided to bring in another Dataset that has Land Area data and join it to our original Dataset this will help us to determine which state has the most Fast Food per meter or mile square. Some of the Questions discuss in the analysis include but not limited: -	States with the most and least McDonald's per capita -	Fast Food restaurants per capita for all states -	Fast Food restaurants with the most locations nationally -	States with the most and least Fast Food restaurants per capita - 	Cities with the most Fast Food restaurants per capita - 	States with the most and least Fast Food restaurants per capita -	 The number of Fast Food restaurants per capita  We have taken full advantage of the Longitude and Latitude data to generate some informative plot geographically.   '''
b = a.strip()
print(b)
this is a list of over 10,000 Fast Food restaurants provided by Datafiniti's Business Database. The dataset includes the restaurant's address, city, latitude and longitude coordinates, name, and more. Note that this is a sample of a large dataset The dataset contains 8 Columns and 10000 rows, the columns are mostly of Categorical variables except for longitude and latitude columns which are Continuous variables.  See that the Dataset did not come with a Land Area Column I have decided to bring in another Dataset that has Land Area data and join it to our original Dataset this will help us to determine which state has the most Fast Food per meter or mile square. Some of the Questions discuss in the analysis include but not limited: -	States with the most and least McDonald's per capita -	Fast Food restaurants per capita for all states -	Fast Food restaurants with the most locations nationally -	States with the most and least Fast Food restaurants per capita - 	Cities with the most Fast Food restaurants per capita - 	States with the most and least Fast Food restaurants per capita -	 The number of Fast Food restaurants per capita  We have taken full advantage of the Longitude and Latitude data to generate some informative plot geographically.
In [129]:
sns.distplot(wine_df.price.dropna()) # WE KNOW THAT WE HAVE SOME MISSING VALUES IN PRICE COLUMN SO WE DROP THOSE MISSING VALUES FIRST
Out[129]:
<matplotlib.axes._subplots.AxesSubplot at 0xefefe91fd0>

TOP 5 WINE MANUFACTURER COUNTRIES

In [193]:
top_pro = wine_df.country.value_counts()
top_5_pro = top_pro.iloc[0:5]
top_5_pro.plot(kind='barh', color=['r','b','g','r','y'])
Out[193]:
<matplotlib.axes._subplots.AxesSubplot at 0xeff43c5668>
In [ ]:
## MOST TOP 5 WINE VARIETIES
In [162]:
top_5_variety = wine_df.variety.value_counts().head()
top_5_variety.plot(kind='barh', color=['r', 'w','r','r','r'], edgecolor = 'black')
Out[162]:
<matplotlib.axes._subplots.AxesSubplot at 0xeff0c84940>
In [100]:
wine_df.isnull().sum()
Out[100]:
Unnamed: 0         0
country            0
description        0
designation      269
points             0
price             57
province           0
region_1         173
region_2         611
variety            0
winery             0
scoring_scale      0
dtype: int64

ALTHOUGH THERE'S A GOOD AMOUNT OF MISSING VALUE, IT DID NOT AFFECT GREATLY OUR EDA.

but if we had to fix the columns with missing values I think K nearest neighbor will appropriate in this case. because, columns like region_1 and region_2 have a very close relationship with the country, province, description and some other coolums. however for the missing value in price column I was thinking of interpolate() method of filling or using the average price of their scoring_scale.

In [ ]:
# LET'S GENERATE SOME WORD CLOUD FOR PROVINCES AND VARIETIES
In [199]:
province = list(wine_df.province) #CREATE A LIST OF PROVINCES
province = ' '.join(province)     # JOIN THE PROVINCES TO GET ONE STRING


plt.figure(figsize=(12,18))       #SET THE CANVAS TO SUIT OUR VIZ

wordcloud = WordCloud().generate(province)      #CREATE A WORDCLOUD AND LET THE MODULE DECIDE THE WIDTH AND HEIGHT
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.margins(x=0, y=0)
In [190]:
variety = list(wine_df.variety)
variety = ' '.join(variety)

plt.figure(figsize=(12,18))
wordcloud = WordCloud(background_color='white').generate(variety)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.margins(x=0, y=0)
In [ ]: