We have taken full advantage of the Longitude and Latitude data to generate some informative plot geographically
Thanks to the Data Processor of this Dataset we had very few missing data and data inconsistency.
The Column Name which refer to the names of restaurants had a considerable number of inconsistent Data. And the Column City, although there was some inconsistency it wasn’t that bad.
So, to clean up those inconsistent entries I used a module called FuzzyWuzzy (FuzzyWuzzy is a library of Python which is used for string matching). The computation method was helpful but did not clean thoroughly so I did use another computation method along with Regular expression (a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern) to finish up the work.
Again, thanks to the Data Processor we didn’t have to deal with Missing values. The only column that had missing values was the Website column which I removed from the Data because the column was not relevant and was not going to help us to answer any of the questions we were interested in.
The overall purpose of this study was to identify fast food land invasion in each state and which are the leading brands and how they affect the evasion. This is particularly important for new investors and since fast food contributes to the economy of the US which means the more Fast food we have the more jobs are created so we don’t want that to benefit only a particular territory in this case the Northern States, we want to be able to influence and encourage more investment in area where there’s little business.
Furthermore, this can help to study the proportion of fast food restaurants across neighborhoods to determine if there is a higher accessibility to fast food locations in more vulnerable populations. The basis for vulnerable populations will be viewed through socioeconomic demographics such as average household income and age.
Throughout our Exploratory and Visualization of this Dataset we come to an understanding of the Fast Food industry geographically and as you looking at the map below notice how the Northern States Is completely covered compare to the Western States, And Notice how McDonald’s is leading the Industry Compare to Burger King and the Yum Brands which is composed by the following brands: Taco Bell, KFC, Pizza Hut and WingStreet.
import re
import os
os.environ['PROJ_LIB'] = r"C:\ProgramData\Anaconda3\pkgs\proj4-5.2.0-ha925a31_1\Library\share"
from mpl_toolkits.basemap import Basemap
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.image as pic
from wordcloud import WordCloud as wc
import fuzzywuzzy
from fuzzywuzzy import process
food_df = pd.read_csv(r"C:\Users\byamu\Downloads\fast-food-restaurants\FastFoodRestaurants.csv")
food_df.head()
a = pic.imread(r"C:\Users\byamu\Downloads\fast-food-restaurants\Capture.PNG")
plt.figure( figsize=(20,4))
plt.imshow(a)
plt.axis("off")
plt.show()
print('|=================================================================================================|')
print(food_df.info())
print('|=================================MISSING DATA====================================================|')
print('websites',food_df.websites.isnull().sum())
print('|=================================TOTAL OBSERVATIONS==============================================|')
print(food_df.shape)
print('|=================================TOTAL OF FAST FOOD RESTAURANTS==================================|')
print(len(food_df.name.unique()))
print('|=================================CITIES AND PROVINCES============================================|')
print(f"TOTAL OF PROVINCEs:{len(food_df.province.unique())}\nTOTAL OF CITIES:{len(food_df.city.unique())}" )
print('|=================================ALL COLUMN HEADERS==============================================|')
print(food_df.columns)
print('|=================================INDEX===========================================================|')
print(food_df.index)
print('|=================================DUPLICATED ROWS=================================================|')
print(food_df.duplicated().sum())
print('|=================================================================================================|')
Environmental Factors Affecting the Fast Food Industry In 1970, the fast food industry was worth only \$6 billion. Today, according to statista.com, it is worth \\$198.9 billion and is set to grow to over \$223 billion by 2020. In less than 50 years, this industry has evolved a great deal. https://smallbusiness.chron.com/types-economic-factors-can-affect-fast-food-industry-36923.html
top_10_cities = pd.DataFrame(food_df['city'].value_counts(ascending=False).head(10).reset_index())
top_10_cities.columns=['City','total of restaurant']
top_10_cities
top_pro = pd.DataFrame(food_df['province'].value_counts(ascending=False).head(10).reset_index())
top_pro.columns=['province','total of restaurant']
top_pro
top_restaurant = pd.DataFrame(food_df['name'].value_counts(ascending=False).head(10).reset_index())
top_restaurant.columns=['restaurant','total of shops']
top_restaurant
THE ABOVE RESULTS MIGHT OR MIGHT NOT BE TRUE DUE TO THE FACT THAT THERE MIGHT BE DATA INCONSISTANCY IN THE DATAFRAME
#OVER HERE WE LOWER AND STRIP EXTRA WHITE SPACE TO PREPARE THE COLUMN FOR CLEANSING
food_df['name'] = food_df['name'].str.lower()
food_df['name'] = food_df['name'].str.strip()
food_df['name'].sort_values().unique()
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
# get a list of unique strings
strings = df[column].unique()
# get the top 10 closest matches to our input string
matches = fuzzywuzzy.process.extract(string_to_match, strings,
limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
# only get matches with a ratio > 90
close_matches = [matchs[0] for matchs in matches if matchs[1] >= min_ratio]
# get the rows of all the close matches in our dataframe
rows_with_matches = df[column].isin(close_matches)
# replace all rows with close matches with the input matches
df.loc[rows_with_matches, column] = string_to_match
print(len(food_df.name.unique()), 'RESTAURANT IN TOTAL BEFORE CLEANING')
# over here we loop through the intire column and clean one name after another
for e in food_df['name'].unique():
replace_matches_in_column(df=food_df,column='name', string_to_match=e)
len(food_df['name'].sort_values().unique()), 'UNIQUE VALUES LEFT AFTER CLEANING'
food_df['name'].sort_values().unique()[0:10]
AFTER IMPUTATION WE REALIZE THAT WE STILL HAVE SOME REPEATED NAMES IN THE COLUMN SO, SO I DECIDED TO HELP THE FUZZY FUZZY DECIDE THE MATCHING NAMES
from colorama import Fore
print(Fore.RED + '\n\n\n\n\nYOU MIGHT NOT HAVE TO RUN THE BELOW CELL SEE THAT YOU WILL BE REQUIRE TO CLEAN/CONFIRM MORETHAN 400 NAMES through input \n I HAVE SAVE A COPY OF A CLEAN DATASET HASSLE FREE')
for each in food_df['name'].sort_values().unique():#WE LOOP THROUGH THE UNIQUE VALUE
# get the top 10 closest matches to our ITERATOR
matches = fuzzywuzzy.process.extract(each,food_df['name'].unique() ,
limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
# TO CHECK IF THE FUNCTION IS WORKING WE PRINT THE CURRENT LENGHT OF UNIQUE VALUES IN THE COLUMN
print(len(food_df.name.unique()), '\n')
# AND WE ALSO DISPLAY THE ITERATOR AND THE 10 STRING THAT FUZZYWUZZY THINK THEIR THE CLOSEST
print(each,'\n\n',matches)
# TO PREVENT OUR FLOW FROM CRASHING WE CATCH IT WITH TRY EXCEPT
try:
#THIS LIST WILL SAVE ALL THE STRING THAT I THINK THEY MATCH THE ITERATOR
good = []
# THEN WE ASK THE USER (IN THIS CASE ME) TO ENTER THE TAGS (IN NUMBER) OF ALL THE NAMES THAT I THINK THEY MATCH THE ITERATOR
percent = input('enter the the tag(percentage)')
#THEN WE SPLIT THOSE NUMBERS ON A COMMA AND USE THE LIST COMPREHESION TO CHANGE ALL THE CHARACTER TO INTEGER
percent = percent.split(',')
percent = [int(i) for i in percent]
# NOW IT FUN TIME! WE LOOP THROUGH THE MATCHES NAMES WHICH COMES AS A TUPLE CONSISTING OF THE MATCH NAME AND ITS PERCENTAGE
for e in matches:
#THEN WE SAY IF THE ITERATOR IN THIS CASE e (BECAUSE IT A TUPLE WE CAN USE THE INDEX TO SELECT THE PERCENTAGE) IS IN percent
#WE APPEND THE RESULT TO THE LIST good AND PRINT THE TO CONFIRM THE TRANSACTION
if e[1] in percent:
good.append(e[0])
print(good,'\n')
except:
print('not match\n\n')
# get the rows of all the close matches in our dataframe
rows_with_matches = food_df['name'].isin(good)
# replace all rows with close matches with the ITERATOR
food_df.loc[rows_with_matches, 'name'] = each
len(food_df.name.unique()), 'AFTER HUSTLE CLEANING'
food_df.name.sort_values().unique()
SO FAR, WE ARE ABLE TO SEE THE NUMBER OF FAST FOOD RESTAURANT IN EACH PROVINCE, AND TO TELL WHICH STATE HAS THE MOST FAST FOOD AND WHERE FAST FOOD RESTAURANTS ARE SCARCE.
BUT WE CAN'T TELL WHICH STATE HAS THE FAST FOOD PER METER OR MILE SQUARE. (this is important because let's say for instance you want to find a city where you can start a fast food business, you will more likely go where the competition is not high. but if you looking at a 2000m² state that has 30 restaurants and another 1000m² one that has 20 restaurants if you don't take the surface into consideration you might end up making the mistake of investing in the state with 20 restaurant)
HAVING SAY THAT WE GONNA GO AHEAD AND BRING IN A PIECE OF ANOTHER DATASET THAT WILL GET US THE STATEs area
mile_sq = pd.read_csv(r"https://people.sc.fsu.edu/~jburkardt/datasets/states/state_area.txt", delim_whitespace=True,header=None, names = ['province','mile_sq']).set_index('province')
mile_sq.head()
#food_df.province['Co Spgs']
print(len(food_df.province.unique()))
pro = food_df.province.unique()
pro.sort()
print(pro)
print(mile_sq.shape)
sta = mile_sq.index.unique()
print(sta)
below is Map that prove that.
if we set that "Co spgs" to "CO" we gonna be left with 51 states.
Now the question is, what we gonna do with the other df where we have 53 states. well let's first identify those extra states then we can decide what to with them.
so we have["PR" For puerto rica and "US" for UNITED STATE] now we know we do not need the "US" and about Puerto rica below is some useful info and reason why we should drop it. and why we will be left with 51 states instead of 50.
51st state refers to a place or territory that is not one of the 50 states of the United States, but people think about making it the 51st state. ... It is usually said about the possibility of Puerto Rico or other U.S. territories becoming part of the United States. https://simple.m.wikipedia.org/wiki/51st_state
The United States of America. The District of Columbia is a federal district, not a state. Many lists include DC and Puerto Rico, which makes for 52 "states and other jurisdictions". ... The flag has 50 stars, one for each state.
#draw a map to prove that "CO" and "Co Spgs" are the same.
word = Basemap(projection='mill', llcrnrlat=20, urcrnrlat=50,
llcrnrlon=-130, urcrnrlon=-60, resolution='c')# over here we draw the map limit
# in this case USA limit
#then we draw the necessary lines, boundaries and colors
word.drawcoastlines()
word.fillcontinents(color='tan',lake_color='aqua',alpha=0.5)
word.drawmapboundary(fill_color='lightblue')
word.drawstates()
word.drawcountries()
#over here we set the the longitude and latitude for "Co spgs" we just gonna use one value
lat = list(food_df.loc[food_df.province == 'Co Spgs', 'latitude'])[0]
lon = list(food_df.loc[food_df.province == 'Co Spgs', 'longitude'])[0]
#then we set x,y axis based on the "Co spgs" data we have colected
x,y= word(lon,lat)
#now we plot the one of the restaurant and set the color red
word.plot(x,y,'ro',markersize=20, alpha=1)
#we do for "CO" same as we did with "Co Spgs" but here we gonna use green for color
lat = list(food_df.loc[food_df.province == 'CO', 'latitude'])[0]
lon = list(food_df.loc[food_df.province == 'CO', 'longitude'])[0]
x,y= word(lon,lat)
word.plot(x,y,'go',markersize=20, alpha=1)
#now that we have no doubt that they are the same its time to rename them
food_df.loc[food_df.province=="Co Spgs", 'province'] = 'CO'
d = food_df.province.unique()
d.sort()
d
# as agreed above we're going to drop the "US" and "PR"
mile_sq.drop(["US","PR"], axis=0, inplace=True)
#now the time to join the mile_sq df to food_df
#basically here what we're doing is: we take the mile_sq df and join it to food_df based on province
food_df = food_df.join(mile_sq, on='province', rsuffix='_mile_sq')
food_df.head()
AND WE'RE GONNA ALSO DROP THE COLUMN THAT WE'RE NOT GOING TO USE
food_df.drop(['postalCode', 'websites'], axis=1, inplace=True)
print('duplitcates\n',food_df.duplicated().sum())
print('missing values\n',food_df.isnull().sum())
#cool let's move on
#now just in case, we gonna save this CHANGES AS a check point IN A CSV FILEcsv file and take it from there
food_df.to_csv(r"C:\Users\...\fast_food_clean.CSV", index=False)
new_df = pd.read_csv(r"C:\Users\...\fast_food_clean.CSV")
new_df.head()
print('TOTAL RESTAURANTS(shops)\n\n', ' ',new_df['keys'].count(),'\n')
print('TOTAL BRANDS\n\n',' ', len(new_df.name.unique()))
top_10_restaurant = pd.DataFrame(new_df.name.value_counts(ascending=False).head(10)).reset_index()
top_10_restaurant.columns = ['restaurant' , 'total']
top_10_restaurant
THE CITY COLUMN IS CLEAN BUT JUST FOR ARGUMENT'S SAKE WE'RE GOING TO DOUBLE CHECK IT WITH OUR FUNCTION
new_df['city'] = new_df['city'].str.lower()
new_df['city'] = new_df['city'].str.strip()
len(new_df.city.unique()),'CITY COLUMN LENGHT BEFORE CLEANING'
# over here we loop through the intire column and clean one city after another
for e in new_df['city'].unique():
replace_matches_in_column(df=new_df,column='city', string_to_match=e)
len(new_df['city'].sort_values().unique()), 'UNIQUE VALUES LEFT AFTER CLEANING'
list(food_df['city'].sort_values().unique()) #the city is clean as well
But we know that the restaurant is not completly clean. ("name" column)
names = new_df['name'].unique()
names.sort()
names
# we can see that 7 eleven is repeated and a&w familly restaurant and some other name that the fuzzywuzzy did not detect
import re
# over here we're check in the name column and check if it contains a given string than replace it with one comon name
new_df.loc[new_df.name.str.contains("seven|7",flags=re.I, regex=True), 'name'] = '7-eleven'
new_df.loc[new_df.name.str.contains("seven|7",flags=re.I, regex=True), 'name']
#for a&w all-american foods we gonna specify that it need to start with 'a&w' because for other shop that conains
#a&W it problably an special arrangement type of brand. like the yum brands or the c'mon one we can be familiar with in
# SA is dunk and dougnut & burgerking now if you have a shop like that you cannot say is dunk & doughnut or is burger.
#same apply with the arrangement like kfc/ tako bell. two brand form one brand
new_df.loc[new_df.name.str.contains("^a&w",flags=re.I, regex=True), 'name'] = 'a&w all-american foods'
new_df.loc[new_df.name.str.contains("checker",flags=re.I, regex=True), 'name'] = "checker's hamburgers"
new_df.loc[new_df.name.str.contains("chipot",flags=re.I, regex=True), 'name'] = 'chipotle mexican grill'
new_df.loc[new_df.name.str.contains("|canes",flags=re.I, regex=True), 'name'] = "raising cane's chicken fingers"
new_df.loc[new_df.name.str.contains("raising|canes",flags=re.I, regex=True), 'name'] = "raising cane's chicken fingers"
#now let's see the difference
#from last check 395
print(new_df.name.unique().shape)
province = food_df.province.value_counts(ascending=False)
print('THE 5 TOP PROVINCE WITH THE MOST FAST FOOD RESTAURANT IN THE US\n\n\n',
province.head(),'\n\n\n',
'THE 5 TOP PROVINCES WITH THE LEAST FAST FOOD IN THE US\n\n\n', province.tail())
new_df
descending_order = new_df.province.value_counts(ascending=False).index
plt.figure( figsize=(18,6))
plt.subplot(211)
sns.countplot(x='province', data=new_df, order=descending_order)
plt.title('TOTAL OF FAST FOOD RESTAURANTS IN EACH PROVINCE')
plt.subplot(212)
pd.DataFrame(new_df['name'].groupby(new_df['province']
).value_counts(ascending=False).unstack(
)).loc[:,"mcdonald’s"].sort_values(ascending=False).plot.bar()
plt.ylabel("total of mcdonald's")
top_restaurant = pd.DataFrame(new_df['name'].value_counts(ascending=False).head(10).reset_index())
top_restaurant.columns=['restaurant','total of shops']
top_restaurant
We see that macdonald not only is the biggest fast food brand in the US but also it has the tendency of leading the industry
We can also masure the other two following brands and see how they are doing compare to macdonalds
plt.figure( figsize=(18,6), dpi=200)
pd.DataFrame(food_df['name'].groupby(food_df['province']).value_counts(ascending=False).unstack()
).loc[:,['burger king',"mcdonald’s",'kfc/taco bell']].sort_values(by ="mcdonald’s",
ascending=False).plot.bar(width=0.8,figsize=(18,6))
plt.title('MACDONALDS COMPARE TO BURGER KING AND KFC/TACO BELL')
name = list(new_df.name)
resto = [a.replace(' ', '~') for a in name]
word = ' '.join(resto)
t = wc(width=1200, height=1000,collocations = False).generate(word)
plt.figure( figsize=(26,12), dpi=300)
plt.imshow(t)
plt.axis('off')
plt.title('FAST FOOD LEADING BRANDS IN THE US')
top_10_cities = pd.DataFrame(new_df['city'].value_counts(ascending=False).head(10).reset_index())
top_10_cities.columns=['city','total of restaurant']
top_10_cities
plt.figure(figsize=(15,10),dpi=100)
sns.barplot(x='city',y='total of restaurant',data=top_10_cities)
plt.title('TOP 10 CITY IN THE US WITH HIGH FAST food RATE')
# SO TO THAT WE GONNA JOIN WE GONNA BRING IN THE MILE SQUARE TABLE
mile_pro = pd.DataFrame(new_df.province.value_counts().sort_values())
mile_pro = mile_pro.join(mile_sq, rsuffix='_province')
mile_pro.rename(columns={'province' : 'total_restaurants', 'mile_sq':'land_area'}, inplace=True)
mile_pro.head(10)
mile_pro.sort_values(by='total_restaurants', ascending=False).head(10)
mile_pro.describe()
we gonna use an histogram to understand the relation if there's any. but let's first scale the data to make it fit within one scale
from mlxtend.preprocessing import minmax_scaling #we import the necessary tool
#then fit or scale the data whithin 0-1
scaled_rest = minmax_scaling(mile_pro['total_restaurants'], columns = [0])
scaled_mile = minmax_scaling(mile_pro['land_area'], columns = [0])
scaled_rest[0:5]
plt.figure(figsize=(10,10))
sns.distplot(scaled_mile, color='r')
sns.distplot(scaled_rest)
plt.title('RELATIONSHIP BETWEEN RESTAURANTS AND SIZE OF LAND IN EACH STATES')
WELL DAPENDING ON HOW YOU LOOK AT IT.
WHAT IS TRUE THOUGH IS THAT THIS TWO VARIABLES ARE INDEPENDANT AND ALTHOUGH IT MIGHT LOOKS LIKE THERE'S A RELATIOSHIP IT IS VERY WEAK.
z=new_df.set_index('province')
z.loc['OH',['longitude','latitude']].mean()
plt.figure(figsize=(18,18))
mapa = Basemap(projection='mill', llcrnrlat=20, urcrnrlat=50,
llcrnrlon=-130, urcrnrlon=-60, resolution='c',
width = 90000, height=120000)# over here we draw the map limit
# in this case USA limit
#then we draw the necessary lines, boundaries and colors
mapa.drawcoastlines()
mapa.fillcontinents(color='white',lake_color='aqua',alpha=0.5)
mapa.drawmapboundary(fill_color='lightblue')
mapa.drawstates()
mapa.drawcountries()
#OVER HERE WE LOOP THROUGH THE INDES OF THE MILE_PRO DF WHICH AS THE STATES AS IT INDEX
#AND WE GONNA USE THE SCALES RESTAURANT DATA TO SET THE SIZE OF THE MARKER
for i,each in zip(mile_pro.index,scaled_rest):
z=new_df.set_index('province')
#WE GONNA USE THE MEDIAN TO GET THE CENTRAL LOCATION
lat = z.loc[i,'latitude'].median()
lon = z.loc[i,'longitude'].median()
#then we set x,y axis based on the "Co spgs" data we have colected
x,y= mapa(lon,lat)
#plt.figure(num=None,figsize=(18,18))
#now we plot the one of the restaurant and set the color red
mapa.plot(x,y,'ro',markersize=(each+1)**4*4, alpha=1)
plt.title('FAST FOOD LAND AREA INVASION')
m = Basemap(projection='mill', llcrnrlat=20, urcrnrlat=50, llcrnrlon=-130,\
urcrnrlon=-60, resolution='c')
plt.figure(num=None,figsize=(18,18))
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='aqua',alpha=0.5)
m.drawmapboundary(fill_color='lightblue')
m.drawstates()
m.drawcountries()
m.drawmapboundary()
lat = list(new_df['latitude'])
lon = list(new_df['longitude'])
x,y = m(lon,lat)
m.plot(x,y,'ro',markersize=5, alpha=0.3)