Bob’s Music Venue – Machine Learning Capstone

Where to open Bob’s awesome underground alternative music venue.

Machine Learning Capstone Project for
IBM’s Data Science Professional Certificate

1. Description of the problem

Bob Smith has recently come into a modest sum of money, and would like to fulfill his dream of opening a mid-sized music venue where he can book both local and larger performance artists, as well as providing a safe and interesting hangout for not only himself but people of all adult groups. While he has the money to invest, he still needs to be prudent in his use of the funds so he still has some limitations financially. He also wishes to open specifically within the city of Fargo, North Dakota (For what reason only the gods may speculate).

1. Overhead/Rent – He needs to find a large enough space to host events and that would be suitable to house a small kitchen, a bar, a barista bar, and a seating section. He needs space while minimizing rent overhead.

2. Crime – He needs to be somewhere that economizes rent, but where violent crime is minimized, in order to cut down on venue security and provide a safer environment for his patrons.

3. Accessibility To Desired Demographic – Since it will be a music venue, his venue will likely need to be at least accessible to younger college-aged crowds who may not have reliable transportation. Being a music venue, having access to hotels might be desirable, and also being located in relative proximity to nearby places of congruous interests may also be valuable.

2. Background, Data, and Approach

Pricing:

There are a total of 38 neighborhoods within the city of Fargo itself (defined from Zillow’s OpenDataSoft), and median property value (ZHVI) can be also pulled in csv format from their site at https://www.zillow.com/research/data/ . While home value does not equal commercial property, it can be used to make general assumptions regarding relative costs likely associated.

Crime:
While there are crime stats available for consumption, I thought it might be interesting to use a keyword search from Google and to scrape sites indexed there in order to create a crime index. Using Python’s request module to fetch and to Beautiful Soup to parse content from open sites, I compare a selection of keywords associated with violent crime to count the number articles that reference both crime and the neighborhood in question. (This is actually my first attempt at a Crawler/Scraper function, despite coding for quite a few years now).

Categories:

The category index is derived from the FourSquare API’s category attribute, and a list of unique venue categories is generated. From that list, a weighted list is manually generated based on the types of venues that would be indicative of a good area to open shop. Iterating through the venues, an index is created based on the number of most relevant venues within a given neighborhood.

3. Methodology and Exploratory Process:

Neighborhoods and Median Home Value:

For this data, I used Zillow/OpenDataSoft resources. Since there was no readily available neighborhood data otherwise, I had to parse out neighborhood names along with their geo coordinates (both 2d center points and geometrical shape boundaries). The geo data from the neighborhood dataset had to be formatted into a consumable geoJSON structure that could be digested by Folium in order to properly generate the neighborhood boundaries, completed immediately after merging the data from the ZHVI into the initial data frame:

And together this data was enough to generate the following choropleth map:

Crime Reports:

As mentioned, I thought it might be interesting to derive a basic “violent crime index” by running a hackneyed crawler/scraper, relying heavily on the Google Search Console and the the Beautiful Soup module. I set a timer to create a constraint against allowing the program to lag out due to slow servers and other issues that might arrive (disclaimer: I only performed a limited set of calls for this, within the bounds of the free number of Google API calls permitted, which is fairly limited, since I did not wish to adversely affect anyone’s site). I then parsed out page content for a small list of keywords that are directly associated with violent crime to create an index.

This approach, while not very scientific or practical, nonetheless was an interesting experiment. I could use this data to create another choropleth map, this time coding “crime” instead of “median_value”:

Venue Category Listings:

Without knowing anything about Fargo, ND, I had to rely entirely a cursory google search and the data/api’s listed above to work out my approach. The foursquare API was invaluable in knowing what unique categories of venues exist in Fargo. After exploring crime-related news and median house values, it was necessary to begin exploring the neighborhoods in order to derive a list of unique venue categories.

From this list, I was able to select a sublist by hand containing the categories most relevant to the type of venue I’d like to open. I then assigned weights to each of the values, and could then perform a weighted assessment of the relevance of venues within a given neighborhood.

This allowed me to generate another choropleth map, to which I also was able to append the list of categories to neighborhood:

The category list was way to long with un-useful categories, So I pared it down by intersecting the neighborhood category list with the list of desirable categories.

K-Means Clustering:

For statistical analysis, a clustering technique appeared to be most appropriate in order to provide some level of neighborhood segmentation based upon the available data. Given the wildly different magnitudes of data domains (the indexes being low while the median_values were relatively massive by comparison), some preprocessing and standardization of the data was necessary to prevent one variable from completely dominating the others.

I then ran a distortion test in order to determine the ideal number of clusters for this model:

This seemed to indicate that between 4 and 6 clusters would be ideal, so I chose conservatively and went with 4.

4. Results

The resulting clusters derived from K-Means seemed to segment in such a manner that three of the neighborhood clusters were segmented by one of the independent variables, while one cluster was sort of a mediocre hodgepodge of neighborhoods. Only one neighborhood was in the cluster centered on the Category Index variable, and that was the Downtown neighborhood.

The distribution of the values aggregated vs. the category index *please note the category 3 point for downtown is obscured by its group’s corresponding centroid :

5. Discussion

Overview:

Based upon the data, criteria, and analytic results discussed above, the Downtown neighborhood is likely the best neighborhood the open an alternative music venue. While the area does have some crime, the rent is likely cheap and it is proximal to a wide variety of interesting venues, allowing some level of demographic cross-pollination.

Runners-up would be anything from clusters 1 (the Hodgepodge cluster), and cluster 2 (High Crime Index). Cluster 0 (High Property Value) would likely feature very high rent, and little access to other venues of interest:

Personal Observations:

The crime metric was probably the least dependable data available, mostly due to the collection process. I think a news/keyword analysis technique might have some application though, perhaps in providing a low-weight augmentation for a more dependable metric.

The category index approach I think could use refinement, but might be a useful overall to have a way to quantify subjective values in decision-making. Building a list of similar venues, or ones that might be congruous or complimentary to the venue being proposed, and then weighting those categories provides a useful way to ensure the neighborhood you are choosing is likely a good fit. While having a hotel or coffee shop nearby would be positive, its not always indicative of a good location for the type of demographic you may be catering to: a skate-park, gaming cafe, and brewery might be better indicators.

For the clustering results, I should have likely chosen 5 clusters, since that might have split Cluster 1 more effectively, making it less of a hodgepodge cluster while distinguishing a definitive runner-up cluster for a good location. In from personal exploration into the data before clustering, I would have given Roosevelt/NDSU and West Acres votes for 2nd and 3rd choice respectively, since both feature good category values as well as likely have decent rent (Roosevelt/NDSU is second place because it is directly adjacent to Downtown, not rent-wise):

The relevant notebook can be viewed here: Visualize_FARGO_ND

Where to open Bob’s awesome underground alternative music venue.

Machine Learning Capstone Project for IBM’s Data Science Professional Certificate

1. Description of the problem

2. Background, Data, and Approach

3. Methodology and Exploratory Process:

4. Results

5. Discussion

Leave a Reply Cancel reply

Search

Tags

Categories

Music

Machine Learning Capstone Project for
IBM’s Data Science Professional Certificate