CS484_IML_Assignment_5
.docx
keyboard_arrow_up
School
Illinois Institute Of Technology *
*We aren’t endorsed by this school
Course
484
Subject
Computer Science
Date
Dec 6, 2023
Type
docx
Pages
2
Uploaded by SargentFreedomCapybara39 on coursehero.com
CS 484: Introduction to Machine
Learning
Fall Semester 2023 Assignment 5
Question 1 (100 points)
The Center for Machine Learning and Intelligent Systems at the University of California, Irvine manages the
Machine Learning Repository (
https://archive.ics.uci.edu/ml/index.php
).
We will use two of the datasets in the
repository for analyses, namely, the
WineQuality_Train.csv
for training and the
WineQuality_Test.csv
for testing.
The categorical target variable is
quality_grp
.
It has two categories, namely, 0 and 1.
The Event category is 1. The
input features are
alcohol
,
citric_acid
,
free_sulfur_dioxide
,
residual_sugar
, and
sulphates
.
These five input features
are considered interval variables.
We will train two models.
One is a classification tree, and another is a binary logistic regression.
The classification tree has the following specifications.
The Splitting Criterion is Entropy.
The maximum tree depth is five.
The initial random state value is 20230101 for the classification tree.
The binary logistic regression has the following specifications.
The model must include the Intercept term.
Use the All-Possible Subset method to determine the model with the lowest Akaike Information Criterion.
After we train these two models, we will compare them using a suite of model performance metrics and charts.
(a)
(20 points) What are the Root Average Squared Error values of both models for both training and testing
partitions?
(b)
(20 points) What are the Area Under Curve values of both models for both training and testing partitions?
(c)
(10 points) Generate the Receiver Operating Characteristic curve for both models on the training partition.
Please put the two curves in the same chart frame. Don’t forget to add the diagonal reference line.
(d)
(10 points) Generate the Precision and Recall chart for both models on the training partition.
Please put the
two curves in the same chart frame.
Don’t forget to add the No-Skills line to the chart.
(e)
(10 points) What is the threshold for the Event probability based on the F1 Score from the training partition?
Please calculate the thresholds of both models.
(f)
(10 points) Using the F1 Score threshold, what are the Misclassification Rates of both models when evaluated
only on the testing partition?
(g)
(10 points) Generate the Cumulative Gain and Lift table for both models using the predicted Event probabilities
from the testing partition.
Which model has the highest Lift value in Decile 1?
Page 1
CS 484: Fall Semester 2023 Assignment 5
(h)
(10 points) Based on all the above model performance metrics and charts, which model will you pick as the
Champion model?
Page 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
16-7. Automated Title: In eq_world_map_3.py, we specified the title manually when defining my_layout, which means we have to remember to update the title every time the source file changes. Instead, you can use the title for the data set in the metadata part of the JSON file. Pull this value, assign it to a variable, and use this for the title of the map when you’re defining my_layout.
arrow_forward
V5. min jee wants to combine the sales data from each of the art fairs. switch to the combined sales worksheet and then update the worksheet as follows. in cell A5 enter a formula without using a function that references cell a5 in the Madison worksheet. copy the formula from cell a5 to the range a6:a8 without copying the formatting. in cell b5, enter a formula using the sum function, 3d references, grouped worksheets that totals the values from cell b:5 in the chicago: madison worksheets.
arrow_forward
nycflights13::flights
Q4<- flights %>%filter(carrier == "JFK") %>%summarise(average_dist = mean(distance)%>%summarise(max_dist = max(average_dist))%>%group_by(Q4, month, day)%>%head(Q4[order(dat$month),(dat$day),(dat$max_dist)],n=5)
*not pictured, I did upload, tidy verse, dplyr, and the nycflights13 libraries.
Error: Incomplete expression: Q4<- flights %>% filter(carrier == "JFK") %>% summarise(average_dist = mean(distance)%>% summarise(max_dist = max(average_dist))%>% group_by(Q4, month, day)%>% head(Q4[order(dat$month),(dat$day),(dat$max_dist)],n=5)
RStudio Question: I can't figure out what is wrong with my code. Could someone take a look at it? See below for what I am trying to do, my code and my error output.
I am trying to find what 5 days of the year had the highest mean distance from JFK airport, using the nycflights13 library. I want to format it in month, day, andvmean distance.
arrow_forward
dataframe = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/regression_sprint/titanic_train_raw.csv')
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/regression_sprint/titanic_test_raw.csv')
Write a function that takes in as input a dataframe and a column name, and returns the mean for numerical columns and the mode for non-numerical columns. Function Specifications: The function should take two inputs: (df, column_name), where df is a pandas DataFrame, column_name is a str. If the column_name does not exist in df, raise a ValueError. Should return as output the mean if the specified column is numerical and return a list of the mode(s) otherwise. The mean should be rounded to 2 decimal places. If there is more than one mode for a given non-numerical column, the fuction should return a list of all modes.
def calc_mean_mode(df, column_name): # your code here return
calc_mean_mode(df,'Age')
Expected Outputs:…
arrow_forward
3. Given the XML document NKU_Programs.xml file, a function that prints the contents ofthe Program elements in a table as shown below.Program Name College Name Student Count Year Started Type*************************************************************************************Data Science Informatics 100 2015 UComputer Science Informatics 400 1990 UBiology Science 300 2000 G
HERE IS THE NKU_Programs.xml FILE
<NKUPRograms><Program studCount="100" type="Undergraduate"><Name>Data Science</Name><YearStarted>2015</YearStarted><CollegeName>Informatics</CollegeName></Program><Program studCount="400" type="Undergraduate"><Name>Computer Science</Name><YearStarted>1990</YearStarted><CollegeName>Informatics</CollegeName></Program><Program studCount="300"…
arrow_forward
1)Consider the following DTD. All unspecified elements are #PCDATA.<!DOCTYPE bibliography [<!ELEMENT book (title,author+,year,publisher,place?)><!ELEMENT article (title,author+,journal,year,number,volume,pages?)><!ELEMENT author (last_name, first_name)>...]>Write the following queries in XQuery. Assume Last names of authors are unique.(a) List all books authored by Ullman.(b) List articles published in Vol. 32 No. 2 of ACM Transactions on Database Systems.(c) List journals that published an article on XML in 2001 (that is, “XML” appears in the title of thearticle).(d) List authors that have published a book in 2001 and another book in 2003.(e) List publishers, and for each publisher list the books they have published as subelements. Chooseappropriate tags for your output.(f) List every author, and for each author list the number of articles he/she published in 2001. Chooseappropriate tags for your output.
arrow_forward
1)Consider the following DTD. All unspecified elements are #PCDATA.<!DOCTYPE bibliography [<!ELEMENT book (title,author+,year,publisher,place?)><!ELEMENT article (title,author+,journal,year,number,volume,pages?)><!ELEMENT author (last_name, first_name)>...]>Write the following queries in XQuery. Assume Last names of authors are unique.(a) List all books authored by Ullman.(b) List articles published in Vol. 32 No. 2 of ACM Transactions on Database Systems.(c) List journals that published an article on XML in 2001 (that is, “XML” appears in the title of thearticle).(d) List authors that have published a book in 2001 and another book in 2003.(e) List publishers, and for each publisher list the books they have published as subelements. Chooseappropriate tags for your output.(f) List every author, and for each author list the number of articles he/she published in 2001. Chooseappropriate tags for your output.
answer d,e,f questions
arrow_forward
Exercise 6 (Operations on DataFrame)
Step a: Create two dataframes df1 and df2 as follows:import numpy as npimport pandas as pdrng = np.random.RandomState(100)df1 = pd.DataFrame(rng.randint(0, 100, (4, 3)), columns=['A', 'B', 'C'])df2 = pd.DataFrame(rng.randint(0, 100, (3, 4)), columns=['A', 'B', 'C', 'D'])
Step b: Create a new dataframe df which is the summation of df1 and df2;
Step c: Subtract all columns of df by the half of column 'C' in df1; (Remark: the values in df should be updated)
Step d: Replace the NaN in df by 10; (Remark: the values in df should be updated)
Step e: Use df.apply() to calculate the summation of the numbers in each row of df, and show the result. (Remark: the result should be a vector of four values)
arrow_forward
get_total_cases() takes the a 2D-list (similar to database) and an integer x from this set {0, 1, 2} as input parameters. Here, 0 represents Case_Reported_Date, 1 represents Age_Group and 2 represents Client_Gender (these are the fields on the header row, the integer value represents the index of each of these fields on that row). This function computes the total number of reported cases for each instance of x in the text file, and it stores this information in a dictionary in this form {an_instance_of_x : total_case}. Finally, it returns the dictionary and the total number of all reported cases saved in this dictionary.
arrow_forward
Consider the below given iris data set. This dataset contains 3 classes of 15 instances each and each class refers to a type of iris plant. The dataset has two features: sepal length, sepal width. The third column is for species, which holds the value for these types of plants. A new plant is identified. You have to classify the Species class of new identified plant with the help of KNN algorithm.
Note: Before applying KNN modify the given data by adding last two digits of your registration number. Such as if the last two digits of your registration number is 23 then first row in the given table will be 28.3and 26.3.
arrow_forward
#Question 4 use sku.csv and WarehouseLocations.csv##############################################################def warehouse_stats(sku):"""Question 4- Read sku.csv with CSV and create a dictionary of the New SKU Statistics.- The New Sku should be the key, with the corresponding value being an innerdictionary containing the following statistics:- 350 Loc: True if not 0- Warehouse Qty- Forcasted Qty- Items/Day: can be calculated using CuFt/Day divided by Item Cube.This result should be an float rounded to5 decimals places.- In your warehouse dictionary, add an inner dictionary with key Totals whichcontains:- Total Qty in Warehouse as key "Qty": Do Not add to Totals if '350 Loc' is not a valid location.- Number of Valid 350 Loc as key "Valid"Data Cleaning Steps:- In some variates of New Sku #, the Item Cube & CuFt/Day are faulty.Fix the manufacturers mistake. If either is **less than or equal to 0**,Item Cube can be assumed to be 5.0 and CuFt/Day is 10% of theForcasted Qty of the New…
arrow_forward
In R please provide the code and explanation for the following
One Way ANOVA with the coagulation data seta. Load the coagulation data set (it is from the faraway library)b. Data Summaries & Assumption Check i. Use the names() function to identify the column names ii. How many rows of data are there? iii. Create a single graph with 4 boxplots on the same scale, one for the coagulation for each of the diets. Each boxplot should be a different color. Use the plot function for this task.colNameData is the name of the column in the data set that you want to create a boxplot forcolNameCategory is the name of the column that you want to use to split the data into groups (so you want one boxplot for each category/value in this column)dataset is the name of the full dataset col=2:4 will give you 3 different colors (color indices are 1-8, then they repeat) iv. Create 4 different data frames, one for the data corresponding to each of the 4factors. How many observations are there for…
arrow_forward
Q1-
Suppose a group of 9 sales price records has been stored as follows:
9, 25, 28, 30, 48, 72, 78, 195, 213
Partition them into three bins then smoothing them by the following methods:
1.Partition into equal-width then smoothing by bin means.
2. Partition into equal-depth then smoothing by bin boundaries.
arrow_forward
Python code using pandas to find the users with the highest aggregate scores (over all their posts) for the whole dataset. You should restrict your results to only those whose aggregated score is above 10,000 points, in descending order.
Code should generate a dictionary of the form {author:aggregated_scores ... }.
arrow_forward
Use python machine learning.
A group of data scientists want to analyze some data. They already cleaned up the data, with the result being a Dataframe called X_train. The Dataframe X_train has 1058 rows and 13 columns. It has the below columns (picture attached on how it looks like):
'LotFrontage','LotArea','BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF','1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea','TotRmsAbvGrd','GarageArea','OpenPorchSF'
Answer the following questions:
1. Transform X_train using PCA. Assign the output to a variable X_train_pca.
2. What is wrong with above approach? Scale the data, then repeat the above.
arrow_forward
Load & check the data:1. Load the data into a pandas dataframe named data_firstname where first name is you name.6. Using Pandas, Matplotlib, seaborn (you can use any or a mix) generate 3-5 plots and add themto your written response explaining what are the key insights and findings from the plots.7. Separate the features from the class.8. Split your data into train 80% train and 20% test, use the last two digits of your student numberfor the seed.Build Classification ModelsSupport vector machine classifier with linear kernel
breast cancer problem : I have already answered 1 to 3. Please provide solution from 4,5,6,7.
Programming language python
arrow_forward
Using Pandas library in python - Calculate student grades project
Pandas is a data analysis library built in Python. Pandas can be used in a Python script, a Jupyter Notebook, or even as part of a web application. In this pandas project, you’re going to create a Python script that loads grade data of 5 to 10 students (a .csv file) and calculates their letter grades in the course. The CSV file contains 5 column students' names, score in-class participation (5% of final grade), score in assignments (20% of final grade), score in discussions (20% of final grade), score in the mid term (20% of final grade), score in final (25% of final grade). Create the .csv file as part of the project submission
Program Output
This will happen when the program runs
Enter the CSV file
Student 1 named Brian Tomas has a letter grade of B+
Student 2 named Tom Blank has a letter grade of C
Student 3 named Margo True has a letter grade of A
Student 4 named David Atkin has a letter grade of B+
Student 5 named…
arrow_forward
You will use the wine for this task (dataset: sklearn.datasets.load_wine()), perform NN and NearestCentroid on the dataset to perform classification. Select different parameter values (i.e., parameter tuning) and discuss the influence. Finally, report result of comparisons. The dataset can be found here https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html
arrow_forward
google colab [jupyter notebook]
Amazon Musical Instrument Reviews
General Readme on Projects
Web commerce sites get a substantial amount of feedback from reviews users post on various websites. It is not practical to go through all this information by hand to determine whether a user liked a particular product or not.
For our project we are going to use a dataset of Amazon Musical Instrument Reviews. The main reason I selected this dataset is that it is significantly smaller than the Amazon review datasets for movies, music, and books. This dataset has a bit over 221,000 reviews. The columns in the dataset are
name
description
verified
whether the reviewer bought the product from Amazon or not
reviewTime
time of the review
reviewerID
ID of the reviewer, e.g. A2SUAM1J3GNN3B
asin
ID of the product
reviewerName
name of the reviewer
reviewText
the text of the body of the review
summary
the test of the heading of the review
unixReviewTime
time of the review (Unix time)…
arrow_forward
Question 22Explain the procedure you will take to automatically choose a good value for the regularization parameter using different splits on a given dataset.
arrow_forward
Sort results by reverse domain: Create a data category. Domain that depicts domain names, including a suitable compareTo() function where the natural order is in reverse domain name order. For example, cs.princeton.edu's mirror name is edu.princeton.cs. This is helpful for analysing site logs. Use s.split(".") to divide the string s into pieces separated by underscores. Create a client that takes normal input and displays the reverse domains in sorted order.
arrow_forward
Representation (Metadata) Assignment
- Foundations of Data and Information
§ This is a 2-part assignment.
Part 1: Dublin Core Metadata Record
Working the DCMI Metadata Terms http://www.dublincore.org/documents/dcmi-terms/.§ Create one “full” metadata record for an article, a webpage, a photograph, a book, OR data set
representing a topic about our natural environment. For example, the object you select may
represent ecology, botany, a natural disaster, weather data, etc.
§ Please work with the following encoding scheme:
Guidelines for implementing Dublin Core inTM XML https://www.dublincore.org/specifications/dublin-core/dc-xml-guidelines/
For your metadata record (a representation):
§ Use as many of the 15 DC (Dublin Core) elements (properties) as you can from the DCMES,
version 1.1 when creating you’re your representation
§ You may also include additional elements (full level properties) from the DCTERMS namespace.
§ Show that you understand element repeatability.
§…
arrow_forward
Reverse address sort: Develop a data structure. A domain that depicts domain names and has a suitable compareTo() function with the reverse domain name's natural order as the comparison criteria. For instance, cs.princeton.edu's mirror name is edu.princeton.cs. Analysis of online logs can benefit from this. Use s.split(".") to divide the string s into pieces that are separated by spaces. Create a client that displays the reverse domains in ordered order after reading domain names from standard input.
arrow_forward
Computer Science
NODEJS
How do I read specific data from csv file (for example extract data only from Canada and United States) and put that into a variable and then make a txt file and use that variable to put that data into txt file? Here is what I have so far below, I'm struggling with the ////grab data for canada section. Thanks for helping.
const csv = require('csv-parser');
const fs = require('fs');
const inputs = [];
//use csv parser
fs.createReadStream('input_countries.csv')
.pipe(csv())
.on('data', (row) => {
inputs.push(row);
})
.on('end', () => {
console.log('CSV file successfully processed');
});
console.log("Deleting canada.txt file if it exists");
fs.unlink('canada.txt', function (err) {
if (err) {
return console.error(err);
}
console.log("canada.txt deleted sucessfully")
});
console.log("Deleting usa.txt file if it exists");
fs.unlink('usa.txt', function (err) {
if (err) {
return console.error(err);
}
console.log("usa.txt deleted sucessfully")
});
const header…
arrow_forward
Q1
Consider X to be a 100-by-100 matrix. Which of the following commands will extract elements common to every 2nd row (starting from the 2nd row) and every 3rd column (starting from the 1stcolumn)?
Select one:
a.
X(2:2:end, 1:3:end)
b.
X[1:2:end, 1:3:end]
c.
X([2:, 3:])
d.
X([2:2:end, 1:3:end])
e.
X(1:2:99, 1:3:96)
Q2
Which of the following syntax performs the forward elimination process in Gaussian elimination? Here r represents the row index, c is the column index and n represents the number of rows in the augmented matrix.
Select one:
a.
Aug(r,:) = Aug(r,:) - factor*Aug(c,:)
b.
Aug(r,c) = Aug(r,c) - factor*Aug(c,:)
c.
Aug(r,[c n+1]) = Aug(r,[c n+1]) - factor*Aug(r,[c n+1])
d.
Aug(r,n+1) = Aug(r,n+1) - factor*Aug(c,:)
e.
Aug(r,c) = Aug(r,:) - factor*Aug(c,c)
Clear my choice
Q3
Which syntax will solve for the differential equation using the in-built function ode45() with a time interval of 0 to 10 and an initial condition of 0?…
arrow_forward
Exactly what does it mean when someone refers to the Dataset object?
arrow_forward
Design a direct file organization using a hash function, to store an item file with item number as its primary key. The primary keys of a sample set of records of the item file are listed below. Assume that the buckets can hold two records each and the blocks in the primary storage area can accommodate a maximum of four records each. Make use of the hash function h(k) = k mod 8, where k represents the numerical value of the primary key (item number).369 760 692 871 659 975 981 115 620 208 821 111 554 781 181 965
don't copy bartleby old answer its wrong
arrow_forward
Design a direct file organization using a hash function, to store an item file with item number as its primary key. The primary keys of a sample set of records of the item file are listed below. Assume that the buckets can hold two records each and the blocks in the primary storage area can accommodate a maximum of four records each. Make use of the hash function h(k) = k mod 8, where k represents the numerical value of the primary key (item number).369 760 692 871 659 975 981 115 620 208 821 111 554 781 181 965
arrow_forward
The data in flat files has been provided:
INVOICE TABLE
INVOICE_NUM
CUSTOMER_ID
INVOICE_DATE
EMPLOYEE_ID
COIN_ID
DELIVERY_ID
8111
11011
15 May 2021
emp103
7111
511
8112
11013
15 May 2021
emp101
7116
512
8113
11012
17 May 2021
emp101
7112
513
8114
11015
17 May 2021
emp102
7111
514
8115
11011
17 May 2021
emp102
7115
515
8116
11015
18 May 2021
emp103
7115
516
8117
11012
19 May 2021
emp105
7112
517
8118
11013
19 May 2021
emp105
7112
517
COIN_RETURNS TABLE
RETURN_ID
RETURN_DATE
REASON
CUSTOMER_ID
COIN_ID
EMPLOYEE_ID
ret001
25 May 2021
Customer not satisfied with product
11011
7116
emp101
ret002
25 May 2021
Product missing part
11013
7114
emp103
COIN TABLE
COIN_ID
PRODUCT
PRICE
QTY
7111
1oz Gold Kruger Rand
R 5 999
10
7112
1oz Silver Kruger Rand
R 12 999
8
7113
Gold Big 5 Uncirculated
R 15 999
8
7114
Silver Big 5 Pack
R 7 999
5
7115
1oz Gold Palaeontology
R 11 999
15
7116
1oz Silver Palaeontology
R 7 999
12
COIN_DELIVERY TABLE…
arrow_forward
Cource : Data structure (please read the following bolded words !!!!!!)
Write the code in C++ and provide a link for CPP File Please don't send a picture of the code, send the code as a text !!
Write the code in C++ not C !!
Problem: Create an employee Record Management system using linked listthat can perform the following operations:• Insert employee record• Delete employee record• Update employee record• Show employee• Search employee• Update salary
The employee record should contain the following items• Name of Employee• ID of Employee• First day of work• Phone number of the employee• Address of the employee• Work hours• Salary
Approach:With the basic knowledge of operations on Linked Lists like insertion, deletion of elementsin the Linked list, the employee record management system can be created. Below are thefunctionalities explained that are to be implemented:Check Record: It is a utility function of creating a record it checks before insertionthat the Record Already Exist or…
arrow_forward
convert UNF for 1NF to 2NF to 3NF, progressively.
UNF:Order(OrderID, OrderDate, CustID, CustName, CustPhone, CCNum, CCExpDate, CCBank, BnkContName, BnkContPhone, CustEmail, OrderIP, SiteRefFrom, ShipStreet, ShipCity, ShipSt, ShipZip, OrderLineNum, ItemID, ItemName, ItemDesc, ItemQtyOrdered,ItemListPrice, ItemSalePrice, ItemQtyShip, ShipCharge, Tax, TotalDue)
arrow_forward
Display a function that can rank the dataset based on subjectivity on Twitter scraping
arrow_forward
TODO 11
Using the clean_df split our columns into features and labels.
Index/slice our label 'area' and store the output into the variable y.
Index/slice all other features EXCEPT 'area' into the variable X. To do so you can use the Pandas DataFrame drop() method or slicing with iloc, loc or [ ].
# TODO 11.1y = display(y)
todo_check([ (y.shape == (517,), 'y does not have the correct shape of (517,)'), (np.all(np.isclose(y.values[-5:], np.array([2.00687085, 4.01259206, 2.49815188, 0. , 0. ]),rtol=.01)),'y has the incorrect values'),])
# TODO 11.2X = display(X)
todo_check([ (X.shape == (517, 29), 'X does not have the correct shape of (517, 29)! Make sure the `area` column is not included!'), (np.all(np.isclose(X.values[-5:, -4], np.array([27.8, 21.9, 21.2, 25.6, 11.8]),rtol=.01)),'X has the incorrect values'),])
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education
Related Questions
- 16-7. Automated Title: In eq_world_map_3.py, we specified the title manually when defining my_layout, which means we have to remember to update the title every time the source file changes. Instead, you can use the title for the data set in the metadata part of the JSON file. Pull this value, assign it to a variable, and use this for the title of the map when you’re defining my_layout.arrow_forwardV5. min jee wants to combine the sales data from each of the art fairs. switch to the combined sales worksheet and then update the worksheet as follows. in cell A5 enter a formula without using a function that references cell a5 in the Madison worksheet. copy the formula from cell a5 to the range a6:a8 without copying the formatting. in cell b5, enter a formula using the sum function, 3d references, grouped worksheets that totals the values from cell b:5 in the chicago: madison worksheets.arrow_forwardnycflights13::flights Q4<- flights %>%filter(carrier == "JFK") %>%summarise(average_dist = mean(distance)%>%summarise(max_dist = max(average_dist))%>%group_by(Q4, month, day)%>%head(Q4[order(dat$month),(dat$day),(dat$max_dist)],n=5) *not pictured, I did upload, tidy verse, dplyr, and the nycflights13 libraries. Error: Incomplete expression: Q4<- flights %>% filter(carrier == "JFK") %>% summarise(average_dist = mean(distance)%>% summarise(max_dist = max(average_dist))%>% group_by(Q4, month, day)%>% head(Q4[order(dat$month),(dat$day),(dat$max_dist)],n=5) RStudio Question: I can't figure out what is wrong with my code. Could someone take a look at it? See below for what I am trying to do, my code and my error output. I am trying to find what 5 days of the year had the highest mean distance from JFK airport, using the nycflights13 library. I want to format it in month, day, andvmean distance.arrow_forward
- dataframe = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/regression_sprint/titanic_train_raw.csv') df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/regression_sprint/titanic_test_raw.csv') Write a function that takes in as input a dataframe and a column name, and returns the mean for numerical columns and the mode for non-numerical columns. Function Specifications: The function should take two inputs: (df, column_name), where df is a pandas DataFrame, column_name is a str. If the column_name does not exist in df, raise a ValueError. Should return as output the mean if the specified column is numerical and return a list of the mode(s) otherwise. The mean should be rounded to 2 decimal places. If there is more than one mode for a given non-numerical column, the fuction should return a list of all modes. def calc_mean_mode(df, column_name): # your code here return calc_mean_mode(df,'Age') Expected Outputs:…arrow_forward3. Given the XML document NKU_Programs.xml file, a function that prints the contents ofthe Program elements in a table as shown below.Program Name College Name Student Count Year Started Type*************************************************************************************Data Science Informatics 100 2015 UComputer Science Informatics 400 1990 UBiology Science 300 2000 G HERE IS THE NKU_Programs.xml FILE <NKUPRograms><Program studCount="100" type="Undergraduate"><Name>Data Science</Name><YearStarted>2015</YearStarted><CollegeName>Informatics</CollegeName></Program><Program studCount="400" type="Undergraduate"><Name>Computer Science</Name><YearStarted>1990</YearStarted><CollegeName>Informatics</CollegeName></Program><Program studCount="300"…arrow_forward1)Consider the following DTD. All unspecified elements are #PCDATA.<!DOCTYPE bibliography [<!ELEMENT book (title,author+,year,publisher,place?)><!ELEMENT article (title,author+,journal,year,number,volume,pages?)><!ELEMENT author (last_name, first_name)>...]>Write the following queries in XQuery. Assume Last names of authors are unique.(a) List all books authored by Ullman.(b) List articles published in Vol. 32 No. 2 of ACM Transactions on Database Systems.(c) List journals that published an article on XML in 2001 (that is, “XML” appears in the title of thearticle).(d) List authors that have published a book in 2001 and another book in 2003.(e) List publishers, and for each publisher list the books they have published as subelements. Chooseappropriate tags for your output.(f) List every author, and for each author list the number of articles he/she published in 2001. Chooseappropriate tags for your output.arrow_forward
- 1)Consider the following DTD. All unspecified elements are #PCDATA.<!DOCTYPE bibliography [<!ELEMENT book (title,author+,year,publisher,place?)><!ELEMENT article (title,author+,journal,year,number,volume,pages?)><!ELEMENT author (last_name, first_name)>...]>Write the following queries in XQuery. Assume Last names of authors are unique.(a) List all books authored by Ullman.(b) List articles published in Vol. 32 No. 2 of ACM Transactions on Database Systems.(c) List journals that published an article on XML in 2001 (that is, “XML” appears in the title of thearticle).(d) List authors that have published a book in 2001 and another book in 2003.(e) List publishers, and for each publisher list the books they have published as subelements. Chooseappropriate tags for your output.(f) List every author, and for each author list the number of articles he/she published in 2001. Chooseappropriate tags for your output. answer d,e,f questionsarrow_forwardExercise 6 (Operations on DataFrame) Step a: Create two dataframes df1 and df2 as follows:import numpy as npimport pandas as pdrng = np.random.RandomState(100)df1 = pd.DataFrame(rng.randint(0, 100, (4, 3)), columns=['A', 'B', 'C'])df2 = pd.DataFrame(rng.randint(0, 100, (3, 4)), columns=['A', 'B', 'C', 'D']) Step b: Create a new dataframe df which is the summation of df1 and df2; Step c: Subtract all columns of df by the half of column 'C' in df1; (Remark: the values in df should be updated) Step d: Replace the NaN in df by 10; (Remark: the values in df should be updated) Step e: Use df.apply() to calculate the summation of the numbers in each row of df, and show the result. (Remark: the result should be a vector of four values)arrow_forwardget_total_cases() takes the a 2D-list (similar to database) and an integer x from this set {0, 1, 2} as input parameters. Here, 0 represents Case_Reported_Date, 1 represents Age_Group and 2 represents Client_Gender (these are the fields on the header row, the integer value represents the index of each of these fields on that row). This function computes the total number of reported cases for each instance of x in the text file, and it stores this information in a dictionary in this form {an_instance_of_x : total_case}. Finally, it returns the dictionary and the total number of all reported cases saved in this dictionary.arrow_forward
- Consider the below given iris data set. This dataset contains 3 classes of 15 instances each and each class refers to a type of iris plant. The dataset has two features: sepal length, sepal width. The third column is for species, which holds the value for these types of plants. A new plant is identified. You have to classify the Species class of new identified plant with the help of KNN algorithm. Note: Before applying KNN modify the given data by adding last two digits of your registration number. Such as if the last two digits of your registration number is 23 then first row in the given table will be 28.3and 26.3.arrow_forward#Question 4 use sku.csv and WarehouseLocations.csv##############################################################def warehouse_stats(sku):"""Question 4- Read sku.csv with CSV and create a dictionary of the New SKU Statistics.- The New Sku should be the key, with the corresponding value being an innerdictionary containing the following statistics:- 350 Loc: True if not 0- Warehouse Qty- Forcasted Qty- Items/Day: can be calculated using CuFt/Day divided by Item Cube.This result should be an float rounded to5 decimals places.- In your warehouse dictionary, add an inner dictionary with key Totals whichcontains:- Total Qty in Warehouse as key "Qty": Do Not add to Totals if '350 Loc' is not a valid location.- Number of Valid 350 Loc as key "Valid"Data Cleaning Steps:- In some variates of New Sku #, the Item Cube & CuFt/Day are faulty.Fix the manufacturers mistake. If either is **less than or equal to 0**,Item Cube can be assumed to be 5.0 and CuFt/Day is 10% of theForcasted Qty of the New…arrow_forwardIn R please provide the code and explanation for the following One Way ANOVA with the coagulation data seta. Load the coagulation data set (it is from the faraway library)b. Data Summaries & Assumption Check i. Use the names() function to identify the column names ii. How many rows of data are there? iii. Create a single graph with 4 boxplots on the same scale, one for the coagulation for each of the diets. Each boxplot should be a different color. Use the plot function for this task.colNameData is the name of the column in the data set that you want to create a boxplot forcolNameCategory is the name of the column that you want to use to split the data into groups (so you want one boxplot for each category/value in this column)dataset is the name of the full dataset col=2:4 will give you 3 different colors (color indices are 1-8, then they repeat) iv. Create 4 different data frames, one for the data corresponding to each of the 4factors. How many observations are there for…arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Database System ConceptsComputer ScienceISBN:9780078022159Author:Abraham Silberschatz Professor, Henry F. Korth, S. SudarshanPublisher:McGraw-Hill EducationStarting Out with Python (4th Edition)Computer ScienceISBN:9780134444321Author:Tony GaddisPublisher:PEARSONDigital Fundamentals (11th Edition)Computer ScienceISBN:9780132737968Author:Thomas L. FloydPublisher:PEARSON
- C How to Program (8th Edition)Computer ScienceISBN:9780133976892Author:Paul J. Deitel, Harvey DeitelPublisher:PEARSONDatabase Systems: Design, Implementation, & Manag...Computer ScienceISBN:9781337627900Author:Carlos Coronel, Steven MorrisPublisher:Cengage LearningProgrammable Logic ControllersComputer ScienceISBN:9780073373843Author:Frank D. PetruzellaPublisher:McGraw-Hill Education
Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education