import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

wine_df = pd.read_csv("wine.csv")
wine_df.head()    #look at the head of our df

Brief summary of the Wine dataframe¶

print('Shape\n',wine_df.shape)
print('--------------------------------------------------------------------------')
print('Number of rows\n', len(wine_df))
print('--------------------------------------------------------------------------')
print('Column headers\n', wine_df.columns)
print('--------------------------------------------------------------------------')
print('Data types\n', wine_df.dtypes)
print('--------------------------------------------------------------------------')
print('Index\n', wine_df.index)
print('--------------------------------------------------------------------------')
print('Missing Data\n',wine_df.isnull().sum())
print('--------------------------------------------------------------------------')
print('Duplicated Rows\n', wine_df.duplicated().sum())
print('--------------------------------------------------------------------------')

Shape
 (1103, 10)
--------------------------------------------------------------------------
Number of rows
 1103
--------------------------------------------------------------------------
Column headers
 Index(['country', 'description', 'designation', 'points', 'price', 'province',
       'region_1', 'region_2', 'variety', 'winery'],
      dtype='object')
--------------------------------------------------------------------------
Data types
 country         object
description     object
designation     object
points           int64
price          float64
province        object
region_1        object
region_2        object
variety         object
winery          object
dtype: object
--------------------------------------------------------------------------
Index
 RangeIndex(start=0, stop=1103, step=1)
--------------------------------------------------------------------------
Missing Data
 country          0
description      0
designation    269
points           0
price           57
province         0
region_1       173
region_2       611
variety          0
winery           0
dtype: int64
--------------------------------------------------------------------------
Duplicated Rows
 42
--------------------------------------------------------------------------

#droping the extra index column.

wine_df.drop('Unnamed: 0',1, inplace=True)

#delete the rows that have identical values in each columns

wine_df.drop_duplicates(inplace=True)

THE 5 MOST EXPENSIVE WINES AND THEIR COUNTRY

wine_df.loc[:,['country','designation', 'price']].sort_values('price', ascending=False).head().reset_index()

Wine Spectator's 100-Point Scale¶

Wine Spectator tasters review wines on the following 100-point scale:

95-100 Classic: a great wine
90-94 Outstanding: a wine of superior character and style
85-89 Very good: a wine with special qualities
80-84 Good: a solid, well-made wine
75-79 Mediocre: a drinkable wine that may have minor flaws
50-74 Not recommended

https://www.winespectator.com/articles/scoring-scale

SO WE'RE GOING TO GROUP THE POINTS COLUMN ACCORDINGLY

wine_df.points.describe()

count    1103.000000
mean       89.701723
std         2.390405
min        85.000000
25%        88.000000
50%        90.000000
75%        91.000000
max        96.000000
Name: points, dtype: float64

wine_df['scoring_scale'] = pd.cut(wine_df.points, (84,89,94,100), labels=['very_good', 'outstanding', 'classic'])

NOW LET'S SEE HOW THE POINT OR THE SCORING SCALE INFLUENCE THE PRICE OF WINES

wine_df.loc[wine_df.price > 200,['designation','scoring_scale']].head(10)

LET'S GROUPE THE DF BY SCORING SCALE AND COUNT TO SEE HOW THE COMPANIES ARE DOING IN TERMS OF PRODUCING QUALITY WINE

print('not too bad')
wine_df.groupby("scoring_scale").designation.count()

not too bad

scoring_scale
very_good      297
outstanding    512
classic         25
Name: designation, dtype: int64

sns.countplot(x='scoring_scale', data=wine_df)

<matplotlib.axes._subplots.AxesSubplot at 0xeff046d2b0>

IF I HAD TO RECOMEND YOU SOME WINE IN THIS DF BELOW IS THE TOP 5 CLASSIC WINE

wine_df.loc[wine_df['scoring_scale'] == 'classic', ['country','designation', 'price','variety','points']].head()

LET'S HAVE QUICK LOOK AT THE PRICE SCALE

a = '''this is a list of over 10,000 Fast Food restaurants provided by Datafiniti's Business Database. The dataset includes the restaurant's address, city, latitude and longitude coordinates, name, and more. Note that this is a sample of a large dataset The dataset contains 8 Columns and 10000 rows, the columns are mostly of Categorical variables except for longitude and latitude columns which are Continuous variables.  See that the Dataset did not come with a Land Area Column I have decided to bring in another Dataset that has Land Area data and join it to our original Dataset this will help us to determine which state has the most Fast Food per meter or mile square. Some of the Questions discuss in the analysis include but not limited: -	States with the most and least McDonald's per capita -	Fast Food restaurants per capita for all states -	Fast Food restaurants with the most locations nationally -	States with the most and least Fast Food restaurants per capita - 	Cities with the most Fast Food restaurants per capita - 	States with the most and least Fast Food restaurants per capita -	 The number of Fast Food restaurants per capita  We have taken full advantage of the Longitude and Latitude data to generate some informative plot geographically.   '''
b = a.strip()
print(b)

this is a list of over 10,000 Fast Food restaurants provided by Datafiniti's Business Database. The dataset includes the restaurant's address, city, latitude and longitude coordinates, name, and more. Note that this is a sample of a large dataset The dataset contains 8 Columns and 10000 rows, the columns are mostly of Categorical variables except for longitude and latitude columns which are Continuous variables.  See that the Dataset did not come with a Land Area Column I have decided to bring in another Dataset that has Land Area data and join it to our original Dataset this will help us to determine which state has the most Fast Food per meter or mile square. Some of the Questions discuss in the analysis include but not limited: -	States with the most and least McDonald's per capita -	Fast Food restaurants per capita for all states -	Fast Food restaurants with the most locations nationally -	States with the most and least Fast Food restaurants per capita - 	Cities with the most Fast Food restaurants per capita - 	States with the most and least Fast Food restaurants per capita -	 The number of Fast Food restaurants per capita  We have taken full advantage of the Longitude and Latitude data to generate some informative plot geographically.

sns.distplot(wine_df.price.dropna()) # WE KNOW THAT WE HAVE SOME MISSING VALUES IN PRICE COLUMN SO WE DROP THOSE MISSING VALUES FIRST

<matplotlib.axes._subplots.AxesSubplot at 0xefefe91fd0>

TOP 5 WINE MANUFACTURER COUNTRIES¶

top_pro = wine_df.country.value_counts()
top_5_pro = top_pro.iloc[0:5]
top_5_pro.plot(kind='barh', color=['r','b','g','r','y'])

<matplotlib.axes._subplots.AxesSubplot at 0xeff43c5668>

## MOST TOP 5 WINE VARIETIES

top_5_variety = wine_df.variety.value_counts().head()
top_5_variety.plot(kind='barh', color=['r', 'w','r','r','r'], edgecolor = 'black')

<matplotlib.axes._subplots.AxesSubplot at 0xeff0c84940>

wine_df.isnull().sum()

Unnamed: 0         0
country            0
description        0
designation      269
points             0
price             57
province           0
region_1         173
region_2         611
variety            0
winery             0
scoring_scale      0
dtype: int64

ALTHOUGH THERE'S A GOOD AMOUNT OF MISSING VALUE, IT DID NOT AFFECT GREATLY OUR EDA.

but if we had to fix the columns with missing values I think K nearest neighbor will appropriate in this case. because, columns like region_1 and region_2 have a very close relationship with the country, province, description and some other coolums. however for the missing value in price column I was thinking of interpolate() method of filling or using the average price of their scoring_scale.

# LET'S GENERATE SOME WORD CLOUD FOR PROVINCES AND VARIETIES

province = list(wine_df.province) #CREATE A LIST OF PROVINCES
province = ' '.join(province)     # JOIN THE PROVINCES TO GET ONE STRING


plt.figure(figsize=(12,18))       #SET THE CANVAS TO SUIT OUR VIZ

wordcloud = WordCloud().generate(province)      #CREATE A WORDCLOUD AND LET THE MODULE DECIDE THE WIDTH AND HEIGHT
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.margins(x=0, y=0)

variety = list(wine_df.variety)
variety = ' '.join(variety)

plt.figure(figsize=(12,18))
wordcloud = WordCloud(background_color='white').generate(variety)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.margins(x=0, y=0)

	Unnamed: 0	country	description	designation	points	price	province	region_1	region_2	variety	winery
0	0	US	This tremendous 100% varietal wine hails from ...	Martha's Vineyard	96	235.0	California	Napa Valley	Napa	Cabernet Sauvignon	Heitz
1	1	Spain	Ripe aromas of fig, blackberry and cassis are ...	Carodorum Selecci√≥n Especial Reserva	96	110.0	Northern Spain	Toro	NaN	Tinta de Toro	Bodega Carmen Rodr√≠guez
2	2	US	Mac Watson honors the memory of a wine once ma...	Special Selected Late Harvest	96	90.0	California	Knights Valley	Sonoma	Sauvignon Blanc	Macauley
3	3	US	This spent 20 months in 30% new French oak, an...	Reserve	96	65.0	Oregon	Willamette Valley	Willamette Valley	Pinot Noir	Ponzi
4	4	France	This is the top wine from La B√©gude, named af...	La Br√ªlade	95	66.0	Provence	Bandol	NaN	Provence red blend	Domaine de la B√©gude

	index	country	designation	price
0	898	Italy	Sor√¨ Tildin	500.0
1	16	US	Rainin Vineyard	325.0
2	18	France	Le Pigeonnier	290.0
3	954	Italy	NaN	240.0
4	0	US	Martha's Vineyard	235.0

	designation	scoring_scale
0	Martha's Vineyard	classic
16	Rainin Vineyard	classic
18	Le Pigeonnier	classic
22	Termanthia	classic
898	Sor√¨ Tildin	outstanding
954	NaN	very_good

	country	designation	price	variety	points
0	US	Martha's Vineyard	235.0	Cabernet Sauvignon	96
1	Spain	Carodorum Selecci√≥n Especial Reserva	110.0	Tinta de Toro	96
2	US	Special Selected Late Harvest	90.0	Sauvignon Blanc	96
3	US	Reserve	65.0	Pinot Noir	96
4	France	La Br√ªlade	66.0	Provence red blend	95

Data Analysis on Wine Data Sets with Python

Introduction

Loading The Data set

Loading The Data set The dataset “WINE.csv” Is a colection of . Since the data was already in tidy structure, no data wrangling was done on it, except adding the wine type column, removing the index variable X and randomizing the rows.

Brief summary of the Wine dataframe¶

Wine Spectator's 100-Point Scale¶

TOP 5 WINE MANUFACTURER COUNTRIES¶