projA1

.docx

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

C200

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

26

Uploaded by DrStar12779 on coursehero.com

projA1 March 10, 2023 [1]: # Initialize Otter import otter grader = otter . Notebook( "projA1.ipynb" ) 1 Project A.1: Exploring Cook County Housing 1.1 Due Date: Thursday, March 16th, 11:59 PM PDT You must submit this assignment to Gradescope by the on-time deadline, Thursday, March 16th, 11:59 PM. Please read the syllabus for the grace period policy. No late submissions beyond the grace period will be accepted. While course staff is happy to help you if you encounter difficulties with submission, we may not be able to respond to last-minute requests for assistance (TAs need to sleep, after all!). We strongly encourage you to plan to submit your work to Gradescope several hours before the stated deadline. This way, you will have ample time to reach out to staff for submission support. 1.1.1 Collaboration Policy Data science is a collaborative activity. While you may talk with others about the homework, we ask that you write your solutions individually . If you do discuss the assignments with others please include their names in the collaborators cell below. Collaborators: list names here 1.2 Introduction This project explores what can be learned from an extensive housing dataset that is embedded in a dense social context in Cook County, Illinois. Here, in project A.1, we will guide you through some basic Exploratory Data Analysis (EDA) to understand the structure of the data. Next, you will be adding a few new features to the dataset, while cleaning the data as well in the process. In project A.2, you will specify and fit a linear model for the purpose of prediction. Finally, we will analyze the error of the model and brainstorm ways to improve the model’s performance. 1 1.3 Grading Grading is broken down into autograded answers and free response. For autograded answers, the results of your code are compared to provided and/or hidden tests.
For free response, readers will evaluate how well you answered the question and/or fulfilled the requirements of the question. Question Manual Points 1a Yes 1 1b Yes 1 1c Yes 1 1d Yes 1 2a Yes 1 2b No 1 3a No 1 3b No 1 3c Yes 1 4 No 2 5b No 1 5c Yes 2 5d No 2 6a No 1 6b No 2 6c Yes 1 6d No 2 6e No 1 7a No 1 7b No 2 Total 8 28 [2]: import numpy as np import pandas as pd % matplotlib inline import matplotlib.pyplot as plt import seaborn as sns import warnings warnings . filterwarnings( "ignore" ) import zipfile import os # Plot settings 2 plt . rcParams[ 'figure.figsize' ] = ( 12 , 9 ) plt . rcParams[ 'font.size' ] = 12 2 The Data
The dataset consists of over 500,000 records from Cook County, Illinois, the county where Chicago is located. The dataset has 61 features in total; the 62nd is sales price, which you will predict with linear regression in the next part of this project. An explanation of each variable can be found in the included codebook.txt file. Some of the columns have been filtered out to ensure this assignment doesn’t become overly long when dealing with data cleaning and formatting. The data are split into training and test sets with 204,792 and 68,264 observations, respectively, but we will only be working on the training set for this part of the project. Let’s first extract the data from the cook_county_data.zip. Notice we didn’t leave the csv files directly in the directory because they take up too much space without some prior compression. [3]: with zipfile . ZipFile( 'cook_county_data.zip' ) as item: item . extractall() Let’s load the training data. [4]: training_data = pd . read_csv( "cook_county_train.csv" , index_col = 'Unnamed: 0' ) As a good sanity check, we should at least verify that the data shape matches the description. [5]: # 204,792 observations and 62 features in training data assert training_data . shape == ( 204792 , 62 ) # Sale Price is provided in the training data assert 'Sale Price' in training_data . columns . values The next order of business is getting a feel for the variables in our data. A more detailed description of each variable is included in codebook.txt (in the same directory as this notebook). You should take some time to familiarize yourself with the codebook before moving forward. Let’s take a quick look at all the current columns in our training data. [6]: training_data . columns . values [6]: array(['PIN', 'Property Class', 'Neighborhood Code', 'Land Square Feet', 'Town Code', 'Apartments', 'Wall Material', 'Roof Material', 'Basement', 'Basement Finish', 'Central Heating', 'Other Heating', 'Central Air', 'Fireplaces', 'Attic Type', 'Attic Finish', 'Design Plan', 'Cathedral Ceiling', 'Construction Quality', 'Site Desirability', 'Garage 1 Size', 'Garage 1 Material', 'Garage 1 Attachment', 'Garage 1 Area', 'Garage 2 Size', 'Garage 2 Material', 'Garage 2 Attachment', 'Garage 2 Area', 'Porch', 'Other Improvements', 'Building Square Feet', 3 'Repair Condition', 'Multi Code', 'Number of Commercial Units', 'Estimate (Land)', 'Estimate (Building)', 'Deed No.', 'Sale Price', 'Longitude', 'Latitude', 'Census Tract', 'Multi Property Indicator', 'Modeling Group', 'Age', 'Use', "O'Hare Noise", 'Floodplain', 'Road Proximity', 'Sale Year', 'Sale Quarter', 'Sale Half-Year', 'Sale Quarter of Year', 'Sale Month of Year', 'Sale Half of Year', 'Most Recent Sale', 'Age Decade', 'Pure Market Filter', 'Garage Indicator', 'Neigborhood Code (mapping)', 'Town and Neighborhood',
'Description', 'Lot Size'], dtype=object) [7]: training_data[ 'Description' ][ 0 ] [7]: 'This property, sold on 09/14/2015, is a one-story houeshold located at 2950 S LYMAN ST.It has a total of 6 rooms, 3 of which are bedrooms, and 1.0 of which are bathrooms.' 3 Part 1: Contextualizing the Data Let’s try to understand the background of our dataset before diving into a full-scale analysis. 3.1 Question 1a Based on the columns in this dataset and the values that they take, what do you think each row represents? That is, what is the granularity of this dataset? Type your answer here, replacing this text. SOLUTION : Each row represents one sale of a house in Cook County. 3.2 Question 1b Why do you think this data was collected? For what purposes? By whom? This question calls for your speculation and is looking for thoughtfulness, not correctness. Type your answer here, replacing this text. SOLUTION : Answers will vary and should be assessed based on whether they provide a possible motivation for data collection and person or organization who conceivably could have collected it. Answers need not correctly identify that it was collected by CCAO for the purpose of property taxation. 4 3.3 Question 1c Certain variables in this dataset contain information that either directly contains demographic information (data on people) or could reveal demographic information when linked to other datasets. Identify at least one demographic-related variable and explain the nature of the demographic data it embeds. Type your answer here, replacing this text. SOLUTION : Answers should identify at least one of the following: 1. Census Tract could be linked to data from the US Census, which contains tract-level statistics regarding household size, ethnicity, income, etc. 2. Neighborhood Code and Town Code could conceivably be linked to neighborhood-
and town-level statistics that would be similar to the Census demographic data. 3. Some other variable with a description of the direct demographic data it embeds or that it could when joined with another data set. 3.4 Question 1d Craft at least two questions about housing in Cook County that can be answered with this dataset and provide the type of analytical tool you would use to answer it (e.g. “I would create a ___ plot of ___ and ” or ”I would calculate the [summary statistic] for ___ and ____”). Be sure to reference the columns that you would use and any additional datasets you would need to answer that question. Type your answer here, replacing this text. SOLUTION : Answers will vary with some possibilities listed below. 1. What is the average home price in Cook County over this time period? I would calculate the mean and median of the Sale Price. 2. Which month of the year has the highest Sale Prices? I would create a line plot showing the median sale price across the sale month of year. 3. Are the sale prices near the airport lower? I would calculate the median sale price for sales subject to O'hare noise and sale not subject to it. 4. What is the history of home construction across Cook County? I would make a map Cook County where each Neighborhood has a color shaded based on the average age of construction. This would require a separate data set 4 Part 2: Exploratory Data Analysis This dataset was collected by the Cook County Assessor’s Office in order to build a model to predict the monetary value of a home (if you didn’t put this for your answer for Question 1 Part 2, please don’t go back and change it - we wanted speculation!). You can read more about data collection in the CCAO’s Residential Data Integrity Preliminary Report . In part 2 of this project, you will be building a linear regression model that predicts sales prices using training data but it’s important to first understand how the structure of the data informs such a model. In this section, we will make a series of exploratory visualizations and feature engineering in preparation for that prediction task. Note that we will perform EDA on the training data . 5 4.0.1 Sale Price We begin by examining the distribution of our target variable SalePrice. At the same time, we also take a look at some descriptive statistics of this variable. We have provided the following helper method plot_distribution that you can use to visualize the distribution of the SalePrice using both the histogram and the box plot at the same time. Run the following 2 cells and describe what you think is wrong with the visualization. [8]: def plot_distribution (data, label): fig, axs = plt . subplots(nrows =2 ) sns . distplot( data[label], ax = axs[ 0 ] )
sns . boxplot( data[label], width =0.3 , ax = axs[ 1 ], showfliers = False , ) # Align axes spacer = np . max(data[label]) * 0.05 xmin = np . min(data[label]) - spacer xmax = np . max(data[label]) + spacer axs[ 0 ] . set_xlim((xmin, xmax)) axs[ 1 ] . set_xlim((xmin, xmax)) # Remove some axis text axs[ 0 ] . xaxis . set_visible( False ) axs[ 0 ] . yaxis . set_visible( False ) axs[ 1 ] . yaxis . set_visible( False ) # Put the two plots together plt . subplots_adjust(hspace =0 ) fig . suptitle( "Distribution of " + label) [9]: plot_distribution(training_data, label = 'Sale Price' ) 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help