Problem 3 You are working as a data scientists and you have received data on house prices in the Boston region. The data set contains the following variables: • crim: per capita crime rate by town • zn: proportion of residential land zoned for lots over 25,000 sq.ft. • indus: proportion of non-retail business acres per town • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) • nox: nitric oxides concentration • rm: average number of rooms per dwelling •age: proportion of owner-occupied units built prior to 1940 • dis: weighted distances to five Boston employment centers • rad: index of accessibility to radial highways • tax: full-value property-tax rate per $10,000 ptratio: pupil-teacher ratio by town • b: 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town • Istat: % lower status of the population • medv: Median value of owner-occupied homes in $1000s Given this information: ● 1. Download the dataset boston.csv and open it as a PANDAS dataframe. 2. Using 'medv' as the response variable and per capita crime rate by town, proportion of owner-occupied units built prior to 1940, and nitric oxides concentration as predictors, fit a linear model (OLS), and a k-nearest neigherbour model (using the 5 nearest neighbour). Which one has better prediction properties using k-fold cross validation (k=5)? Explain why. 3. Fit a model to predict the house prices using crim, zn, indus, chas,nox,rm, age, dis, rad, tax,ptratio, b, and Istat, using OLS, Ridge, and Lasso. Show the coefficients. Use lambda equal .1 to both Ridge and Lasso. What variable(s) can be eliminated from the analysis based on the Lasso results?

Database System Concepts
7th Edition
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Chapter1: Introduction
Section: Chapter Questions
Problem 1PE
icon
Related questions
Question

Solve in Python, 

Dataset can be accessed at this link: https://file.io/Gb8ACUwVODtg

Problem 3
You are working as a data scientists and you have received data on house prices in the Boston region.
The data set contains the following variables:
• crim: per capita crime rate by town
• zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town
• chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
• nox: nitric oxides concentration
●
• rm: average number of rooms per dwelling
•age: proportion of owner-occupied units built prior to 1940
• dis: weighted distances to five Boston employment centers
rad: index of accessibility to radial highways
• tax: full-value property-tax rate per $10,000
●
ptratio: pupil-teacher ratio by town
●
b: 1000(Bk - 0.63)² where Bk is the proportion of blacks by town
●
Istat: % lower status of the population
• medv: Median value of owner-occupied homes in $1000s
Given this information:
●
1. Download the dataset boston.csv and open it as a PANDAS dataframe.
2. Using 'medv' as the response variable and per capita crime rate by town, proportion of owner-occupied units built prior to 1940, and nitric oxides
concentration as predictors, fit a linear model (OLS), and a k-nearest neigherbour model (using the 5 nearest neighbour). Which one has better prediction
properties using k-fold cross validation (k=5)? Explain why.
3. Fit a model to predict the house prices using crim, zn, indus, chas,nox,rm, age, dis, rad, tax,ptratio, b, and Istat, using OLS, Ridge, and Lasso. Show the
coefficients. Use lambda equal .1 to both Ridge and Lasso. What variable(s) can be eliminated from the analysis based on the Lasso results?
Transcribed Image Text:Problem 3 You are working as a data scientists and you have received data on house prices in the Boston region. The data set contains the following variables: • crim: per capita crime rate by town • zn: proportion of residential land zoned for lots over 25,000 sq.ft. indus: proportion of non-retail business acres per town • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) • nox: nitric oxides concentration ● • rm: average number of rooms per dwelling •age: proportion of owner-occupied units built prior to 1940 • dis: weighted distances to five Boston employment centers rad: index of accessibility to radial highways • tax: full-value property-tax rate per $10,000 ● ptratio: pupil-teacher ratio by town ● b: 1000(Bk - 0.63)² where Bk is the proportion of blacks by town ● Istat: % lower status of the population • medv: Median value of owner-occupied homes in $1000s Given this information: ● 1. Download the dataset boston.csv and open it as a PANDAS dataframe. 2. Using 'medv' as the response variable and per capita crime rate by town, proportion of owner-occupied units built prior to 1940, and nitric oxides concentration as predictors, fit a linear model (OLS), and a k-nearest neigherbour model (using the 5 nearest neighbour). Which one has better prediction properties using k-fold cross validation (k=5)? Explain why. 3. Fit a model to predict the house prices using crim, zn, indus, chas,nox,rm, age, dis, rad, tax,ptratio, b, and Istat, using OLS, Ridge, and Lasso. Show the coefficients. Use lambda equal .1 to both Ridge and Lasso. What variable(s) can be eliminated from the analysis based on the Lasso results?
Expert Solution
steps

Step by step

Solved in 4 steps with 2 images

Blurred answer
Knowledge Booster
Time complexity
Learn more about
Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.
Similar questions
  • SEE MORE QUESTIONS
Recommended textbooks for you
Database System Concepts
Database System Concepts
Computer Science
ISBN:
9780078022159
Author:
Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:
McGraw-Hill Education
Starting Out with Python (4th Edition)
Starting Out with Python (4th Edition)
Computer Science
ISBN:
9780134444321
Author:
Tony Gaddis
Publisher:
PEARSON
Digital Fundamentals (11th Edition)
Digital Fundamentals (11th Edition)
Computer Science
ISBN:
9780132737968
Author:
Thomas L. Floyd
Publisher:
PEARSON
C How to Program (8th Edition)
C How to Program (8th Edition)
Computer Science
ISBN:
9780133976892
Author:
Paul J. Deitel, Harvey Deitel
Publisher:
PEARSON
Database Systems: Design, Implementation, & Manag…
Database Systems: Design, Implementation, & Manag…
Computer Science
ISBN:
9781337627900
Author:
Carlos Coronel, Steven Morris
Publisher:
Cengage Learning
Programmable Logic Controllers
Programmable Logic Controllers
Computer Science
ISBN:
9780073373843
Author:
Frank D. Petruzella
Publisher:
McGraw-Hill Education