The Health Department of New York City conducts unannounced inspections of restaurants at least once a year. Inspectors check for compliance in food handling, food temperature, personal hygiene and vermin control. Each violation of a regulation gets a certain number of points. At the end of the inspection, the inspector totals the points, and this number is the restaurant's inspection score—the lower the score, the better the Grade.
Due to many factors,
some restaurants inspection data may be missing or for new restaurants that are
not yet inspected; there would be no scores present.
This project used Foursquare API to explore neighborhoods in
New York City to get the most common restaurant categories in a particular
neighborhood. Then various other dataset will be used along with restaurant
inspection data to build a model to predict scores for a restaurant based which
does not have a per-inspected score.
This model would be useful for a restaurant that has not
been inspected to have an idea of possible inspection score and take careful
measures to improve its score during first inspection from NYC Health
Department.
Data Used
Below are the data set used for this project
along with the source of availability.
a)
DOHMH New York City Restaurant Inspection
Results
The dataset contains every
sustained or not yet adjudicated violation citation from every full or special
program inspection conducted up to three years prior to the most recent
inspection for restaurants and college cafeterias in an active status on the
RECORD DATE (date of the data pull). When an inspection results in more than
one violation, values for associated fields are repeated for each additional
violation record. Establishments are uniquely identified by their CAMIS (record
ID) number. Keep in mind that thousands of restaurants start business and go
out of business every year; only restaurants in an active status are included
in the dataset.
Records are also included for each restaurant that has applied for a permit but has not yet been inspected and for inspections resulting in no violations
Website : https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j
a)
NYC boroughs and the neighborhoods data
Neighborhood has a total of 5 boroughs and 306
neighborhoods. In order to segment the neighborhoods and explore them, we will
essentially need a dataset that contains the 5 boroughs and the neighborhoods
that exist in each borough as well as the latitude and longitude coordinates of
each neighborhood.
Luckily, this dataset exists for free on the
web. Feel free to try to find this dataset on your own, but here is the link to
the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572
b)
Foursquare API data
Foursquare API
to explore neighborhoods in New York City. You will use the explore function to get the most common venue/restaurant categories
in each neighborhood.
This project
used Foursquare API to explore neighborhoods in New York City to get the most
common restaurant categories in a particular neighborhood
c)
NYC Rat Inspection Data
The Rat
Information Portal (RIP) is a web-based mapping application where users can
view rat inspection data. Findings from the Health Department's inspections are
searchable by address, or by borough, block and lot (BBL).
Information about the most recent inspections,
compliance, baiting’s, and cleanups on any given property are available by city
, Incident zip etc.
In this project, it will be
analyzed if there is any correlation with rat sighting incident in a zip with
the restaurant inspection score.
a)
NYC Rolling Sales Data
The Department of Finance’s Rolling Sales files
lists properties that sold in the last twelve-month period in New York City for
all tax classes. These files include:
·
the neighborhood;
·
building type;
·
square footage;
·
other data.
The per unit sales prices of the data for
zipcodes will be used to find any correlation with the restaurant inspection
score.
Methodology
INDEPENDENT/DEPENDENT VARIABLES:
For the analysis, the dependent
variable is Restaurant Inspection Score and everything else is considered to be
an independent variable.
It is worth noting that lesser
score is an indication of good inspection report and vice versa.
Since all variables utilized contain continuous numerical
data and we want to predict the Inspection Score given certain input
parameters, it would make most sense to apply Regression Analysis using Python.
Part A : Explore neighborhoods in New York City to get the most common restaurant categories in a particular neighborhood
a) In this project, we will start of NYC
boroughs and the neighborhoods data.
The json data will be transformed into a pandas data frame & geoopy library will be used to get the latitude and longitude values of New York City
b) Next,
we are going to start utilizing the Foursquare API to explore the venues in
Manhattan neighborhoods using pre-registered Foursquare client ID
c) Once Manhattan data is available, the top 100 venues that are in ‘Marble Hill’ within a radius of 500 meters are collected
Part B: Data Wrangling & Grouping Necessary Data Sets
d) All
the other data sets are collected and data wrangling approaches are applied to
clean null values , group data and keep only necessary attribute columns
e) Data
sets are joined compared based on zip codes
Part C: Exploratory Data Analysis
f) Here
different numerical variables like Unit Sales price, Rat sight counts are used
to explore the main characteristics which have the most impact on Inspection
score for different locations based on zip code
g) Scatter
plots with fitted lines using packages "Matplotlib" and
"Seaborn" are used to visualize the relationship between Inspection
score with different continuous numeric variables
h) Then
Pearson Correlation Coefficient & P value
method is used to measure the
linear dependence between Inspection Score variable and other dependent
variables.
i)
All the above steps give a better idea of what
our data looks like and which variables are important to take into account when
predicting the Inspection Score.
Part D : Model Development
j)
In this section, we will develop predictive Linear
Regression model that will predict the Inspection Score of a restaurant in a
particular zip code that do not have an inspection score yet.
k) Using
simple linear regression model , we will create a linear function with the predictor variable and the
"Inspection Score" as the response variable.