
This project aims to predict the sale prices of houses based on various features. The dataset used for this project contains 80 columns, including the target variable (SalePrice) and a variety of descriptive features such as building class, zoning classification, lot size, neighborhood, overall quality, number of rooms, and many more.
The main objective of this project is to develop a predictive model that can accurately estimate the sale price of a house given its characteristics. By analyzing the provided dataset and applying machine learning techniques, we will train a model to make predictions on unseen data.
Data Loading and Exploration: The dataset will be loaded into a pandas DataFrame, and we will perform initial exploratory data analysis (EDA) to understand the structure, content, and statistical properties of the data.
Data Preprocessing: This step involves handling missing values, encoding categorical variables, and performing any necessary data transformations to prepare the dataset for model training.
Feature Selection and Engineering: We will analyze the relevance and importance of the available features and perform feature selection or engineering if required to improve model performance.
Model Training and Evaluation: Several regression algorithms, such as Linear Regression, Random Forest Regression, or others, will be trained on the training set and evaluated using appropriate metrics such as mean squared error (MSE).
Model Optimization: We will explore techniques for hyperparameter tuning and model optimization to enhance the predictive performance of our chosen algorithm.
Final Prediction and Interpretation: Once the model is optimized, we will make predictions on the test set and evaluate its performance. We will also interpret the results and gain insights into which features contribute most to the predicted house prices.
The project will be implemented using the Python programming language and various libraries commonly used for data analysis and machine learning tasks, including:
The dataset for this project is sourced from Kaggle and consists of 1460 rows and 80 columns. The provided features cover various aspects of the houses, including physical attributes, location, quality, and additional features. The target variable is the sale price of each house.
By the end of this project, we aim to develop a reliable model that accurately predicts house prices based on the given features. This prediction model can be useful for real estate agents, buyers, and sellers to estimate the fair market value of houses and make informed decisions.
Let's get started with the project and predict house prices!
Here are the names of the columns in the dataset along with their corresponding meanings:
MSSubClass: Identifies the type of dwelling involved in the sale.
- 20: 1-STORY 1946 & NEWER ALL STYLES
- 30: 1-STORY 1945 & OLDER
- 40: 1-STORY W/FINISHED ATTIC ALL AGES
- 45: 1-1/2 STORY - UNFINISHED ALL AGES
- 50: 1-1/2 STORY FINISHED ALL AGES
- 60: 2-STORY 1946 & NEWER
- 70: 2-STORY 1945 & OLDER
- 75: 2-1/2 STORY ALL AGES
- 80: SPLIT OR MULTI-LEVEL
- 85: SPLIT FOYER
- 90: DUPLEX - ALL STYLES AND AGES
- 120: 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
- 150: 1-1/2 STORY PUD - ALL AGES
- 160: 2-STORY PUD - 1946 & NEWER
- 180: PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
- 190: 2 FAMILY CONVERSION - ALL STYLES AND AGES
MSZoning: Identifies the general zoning classification of the sale.
- A: Agriculture
- C: Commercial
- FV: Floating Village Residential
- I: Industrial
- RH: Residential High Density
- RL: Residential Low Density
- RP: Residential Low Density Park
- RM: Residential Medium Density
LotFrontage: Linear feet of street connected to the property.
LotArea: Lot size in square feet.
Street: Type of road access to the property.
- Grvl: Gravel
- Pave: Paved
Alley: Type of alley access to the property.
- Grvl: Gravel
- Pave: Paved
- NA: No alley access
LotShape: General shape of the property.
- Reg: Regular
- IR1: Slightly irregular
- IR2: Moderately Irregular
- IR3: Irregular
LandContour: Flatness of the property.
- Lvl: Near Flat/Level
- Bnk: Banked - Quick and significant rise from street grade to building
- HLS: Hillside - Significant slope from side to side
- Low: Depression
Utilities: Type of utilities available.
- AllPub: All public Utilities (E,G,W,& S)
- NoSewr: Electricity, Gas, and Water (Septic Tank)
- NoSeWa: Electricity and Gas Only
- ELO: Electricity only
LotConfig: Lot configuration.
- Inside: Inside lot
- Corner: Corner lot
- CulDSac: Cul-de-sac
- FR2: Frontage on 2 sides of property
- FR3: Frontage on 3 sides of property
LandSlope: Slope of the property.
- Gtl: Gentle slope
- Mod: Moderate Slope
- Sev: Severe Slope
Neighborhood: Physical locations within Ames city limits.
- Blmngtn: Bloomington Heights
- Blueste: Bluestem
- BrDale: Briardale
- BrkSide: Brookside
- ClearCr: Clear Creek
- CollgCr: College Creek
- Crawfor: Crawford
- Edwards: Edwards
- Gilbert: Gilbert
- IDOTRR: Iowa DOT and Rail Road
- MeadowV: Meadow Village
- Mitchel: Mitchell
- Names: North Ames
- NoRidge: Northridge
- NPkVill: Northpark Villa
- NridgHt: Northridge Heights
- NWAmes: Northwest Ames
- OldTown: Old Town
- SWISU: South & West of Iowa State University
- Sawyer: Sawyer
- SawyerW: Sawyer West
- Somerst: Somerset
- StoneBr: Stone Brook
- Timber: Timberland
- Veenker: Veenker
Condition1: Proximity to various conditions.
- Artery: Adjacent to arterial street
- Feedr: Adjacent to feeder street
- Norm: Normal
- RRNn: Within 200' of North-South Railroad
- RRAn: Adjacent to North-South Railroad
- PosN: Near positive off-site feature--park, greenbelt, etc.
- PosA: Adjacent to positive off-site feature
- RRNe: Within 200' of East-West Railroad
- RRAe: Adjacent to East-West Railroad
Condition2: Proximity to various conditions (if more than one is present).
- Artery: Adjacent to arterial street
- Feedr: Adjacent to feeder street
-Here's the continuation of the column descriptions:
- Norm: Normal
- RRNn: Within 200' of North-South Railroad
- RRAn: Adjacent to North-South Railroad
- PosN: Near positive off-site feature--park, greenbelt, etc.
- PosA: Adjacent to positive off-site feature
- RRNe: Within 200' of East-West Railroad
- RRAe: Adjacent to East-West Railroad
BldgType: Type of dwelling.
- 1Fam: Single-family Detached
- 2FmCon: Two-family Conversion; originally built as one-family dwelling
- Duplx: Duplex
- TwnhsE: Townhouse End Unit
- TwnhsI: Townhouse Inside Unit
HouseStyle: Style of dwelling.
- 1Story: One story
- 1.5Fin: One and one-half story: 2nd level finished
- 1.5Unf: One and one-half story: 2nd level unfinished
- 2Story: Two story
- 2.5Fin: Two and one-half story: 2nd level finished
- 2.5Unf: Two and one-half story: 2nd level unfinished
- SFoyer: Split Foyer
- SLvl: Split Level
OverallQual: Rates the overall material and finish of the house.
- 10: Very Excellent
- 9: Excellent
- 8: Very Good
- 7: Good
- 6: Above Average
- 5: Average
- 4: Below Average
- 3: Fair
- 2: Poor
- 1: Very Poor
OverallCond: Rates the overall condition of the house.
- 10: Very Excellent
- 9: Excellent
- 8: Very Good
- 7: Good
- 6: Above Average
- 5: Average
- 4: Below Average
- 3: Fair
- 2: Poor
- 1: Very Poor
YearBuilt: Original construction date.
YearRemodAdd: Remodel date (same as construction date if no remodeling or additions).
RoofStyle: Type of roof.
- Flat: Flat
- Gable: Gable
- Gambrel: Gabrel (Barn)
- Hip: Hip
- Mansard: Mansard
- Shed: Shed
RoofMatl: Roof material.
- ClyTile: Clay or Tile
- CompShg: Standard (Composite) Shingle
- Membran: Membrane
- Metal: Metal
- Roll: Roll
- Tar&Grv: Gravel & Tar
- WdShake: Wood Shakes
- WdShngl: Wood Shingles
Exterior1st: Exterior covering on the house.
- AsbShng: Asbestos Shingles
- AsphShn: Asphalt Shingles
- BrkComm: Brick Common
- BrkFace: Brick Face
- CBlock: Cinder Block
- CemntBd: Cement Board
- HdBoard: Hard Board
- ImStucc: Imitation Stucco
- MetalSd: Metal Siding
- Other: Other
- Plywood: Plywood
- PreCast: PreCast
- Stone: Stone
- Stucco: Stucco
- VinylSd: Vinyl Siding
- Wd Sdng: Wood Siding
- WdShing: Wood Shingles
Exterior2nd: Exterior covering on the house (if more than one material).
- AsbShng: Asbestos Shingles
- AsphShn: Asphalt Shingles
- BrkComm: Brick Common
- BrkFace: Brick Face
- CBlock: Cinder Block
- CemntBd: Cement Board
- HdBoard: Hard Board
- ImStucc: Imitation Stucco
- MetalSd: Metal Siding
- Other: Other
- Plywood: Plywood
- PreCast: PreCast
- Stone: Stone
- Stucco: Stucco
- VinylSd: Vinyl Siding
- Wd Sdng: Wood Siding
- WdShing: Wood Shingles
MasVnrType: Masonry veneer type.
- BrkCmn: Brick Common
- BrkFace: Brick Face
- CBlock: Cinder Block
- None: None
- Stone: Stone
MasVnrArea: Masonry veneer area in square feet.
ExterQual: Evaluates the quality of the material on the exterior.
- Ex: Excellent
- Gd: Good
- TA: Average/Typical
- Fa: Fair
- Po: Poor
ExterCond: Evaluates the present condition of the material on the exterior.
- Ex: Excellent
- Gd: Good
- TA: Average/Typical
- Fa: Fair
- Po: Poor
Foundation: Type of foundation.
- BrkTil: Brick & Tile
- CBlock: Cinder Block
- PConc: Poured Concrete
- Slab: Slab
- Stone: Stone
- Wood: Wood
**BsmtHere's the continuation of the column descriptions:
BsmtQual: Evaluates the height of the basement.
- Ex: Excellent (100+ inches)
- Gd: Good (90-99 inches)
- TA: Typical (80-89 inches)
- Fa: Fair (70-79 inches)
- Po: Poor (<70 inches)
- NA: No Basement
BsmtCond: Evaluates the general condition of the basement.
- Ex: Excellent
- Gd: Good
- TA: Typical - slight dampness allowed
- Fa: Fair - dampness or some cracking or settling
- Po: Poor - Severe cracking, settling, or wetness
- NA: No Basement
BsmtExposure: Refers to walkout or garden level walls.
- Gd: Good Exposure
- Av: Average Exposure (split levels or foyers typically score average or above)
- Mn: Minimum Exposure
- No: No Exposure
- NA: No Basement
BsmtFinType1: Rating of basement finished area.
- GLQ: Good Living Quarters
- ALQ: Average Living Quarters
- BLQ: Below Average Living Quarters
- Rec: Average Rec Room
- LwQ: Low Quality
- Unf: Unfinished
- NA: No Basement
BsmtFinSF1: Type 1 finished square feet.
BsmtFinType2: Rating of basement finished area (if multiple types).
- GLQ: Good Living Quarters
- ALQ: Average Living Quarters
- BLQ: Below Average Living Quarters
- Rec: Average Rec Room
- LwQ: Low Quality
- Unf: Unfinished
- NA: No Basement
BsmtFinSF2: Type 2 finished square feet.
BsmtUnfSF: Unfinished square feet of basement area.
TotalBsmtSF: Total square feet of basement area.
Heating: Type of heating.
- Floor: Floor Furnace
- GasA: Gas forced warm air furnace
- GasW: Gas hot water or steam heat
- Grav: Gravity furnace
- OthW: Hot water or steam heat other than gas
- Wall: Wall furnace
HeatingQC: Heating quality and condition.
- Ex: Excellent
- Gd: Good
- TA: Average/Typical
- Fa: Fair
- Po: Poor
CentralAir: Central air conditioning.
- N: No
- Y: Yes
Electrical: Electrical system.
- SBrkr: Standard Circuit Breakers & Romex
- FuseA: Fuse Box over 60 AMP and all Romex wiring (Average)
- FuseF: 60 AMP Fuse Box and mostly Romex wiring (Fair)
- FuseP: 60 AMP Fuse Box and mostly knob & tube wiring (poor)
- Mix: Mixed
1stFlrSF: First Floor square feet.
2ndFlrSF: Second floor square feet.
LowQualFinSF: Low quality finished square feet (all floors).
GrLivArea: Above grade (ground) living area square feet.
BsmtFullBath: Basement full bathrooms.
BsmtHalfBath: Basement half bathrooms.
FullBath: Full bathrooms above grade.
HalfBath: Half baths above grade.
Bedroom: Bedrooms above grade (does NOT include basement bedrooms).
Kitchen: Kitchens above grade.
KitchenQual: Kitchen quality.
- Ex: Excellent
- Gd: Good
- TA: Typical/Average
- Fa: Fair
- Po: Poor
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms).
Functional: Home functionality (Assume typical unless deductions are warranted).
- Typ: Typical Functionality
- Min1: Minor Deductions 1
- Min2: Minor Deductions 2
- Mod: Moderate Deductions
- Maj1: Major Deductions 1
- Maj2: Major Deductions 2
- Sev: Severely Damaged
- Sal: Salvage only
Fireplaces: Number of fireplaces.
FireplaceQu: Fireplace quality.
- Ex: Excellent - Exceptional Masonry Fireplace
- Gd: Good - Masonry Fireplace in the main level
- TA: Average - Prefabricated Fireplace in the main living area or Masonry Fireplace in the basement
- Fa: Fair - Prefabricated Fireplace in the basement
- Po: Poor - Ben Franklin Stove
- NA: No Fireplace
GarageType: Garage location.
- 2Types: More than one type of garage
- Attchd: Attached to home
- Basment: Basement Garage
- BuiltIn: Built-In (Garage part of the house - typically has a room above the garage)
- CarPort: Car Port
- Detchd: Detached from home
- NA: No Garage
GarageYrBlt: Year the garage was built.
GarageFinish: Interior finish of the garage.
- Fin: Finished
- RFn: Rough Finished
- Unf: Unfinished
- NA: No Garage
GarageCars: Size of the garage in car capacity.
GarageArea: Size of the garage in square feet.
GarageQual: Garage quality.
- Ex: Excellent
- Gd: Good
- TA: Typical/Average
- Fa: Fair
- Po: Poor
- NA: No Garage
GarageCond: Garage condition.
- Ex: Excellent
- Gd: Good
- TA: Typical/Average
- Fa: Fair
- Po: Poor
- NA: No Garage
PavedDrive: Paved driveway.
- Y: Paved
- P: Partial Pavement
- N: Dirt/Gravel
WoodDeckSF: Wood deck area in square feet.
OpenPorchSF: Open porch area in square feet.
EnclosedPorch: Enclosed porch area in square feet.
3SsnPorch: Three-season porch area in square feet.
ScreenPorch: Screen porch area in square feet.
PoolArea: Pool area in square feet.
PoolQC: Pool quality.
- Ex: Excellent
- Gd: Good
- TA: Average/Typical
- Fa: Fair
- NA: No Pool
Fence: Fence quality.
- GdPrv: Good Privacy
- MnPrv: Minimum Privacy
- GdWo: Good Wood
- MnWw: Minimum Wood/Wire
- NA: NoHere's the continuation of the column descriptions:
MiscFeature: Miscellaneous feature not covered in other categories.
- Elev: Elevator
- Gar2: 2nd Garage (if not described in garage section)
- Othr: Other
- Shed: Shed (over 100 SF)
- TenC: Tennis Court
- NA: No Miscellaneous Feature
MiscVal: $Value of miscellaneous feature.
MoSold: Month Sold (MM).
YrSold: Year Sold (YYYY).
SaleType: Type of sale.
- WD: Warranty Deed - Conventional
- CWD: Warranty Deed - Cash
- VWD: Warranty Deed - VA Loan
- New: Home just constructed and sold
- COD: Court Officer Deed/Estate
- Con: Contract 15% Down payment regular terms
- ConLw: Contract Low Down payment and low interest
- ConLI: Contract Low Interest
- ConLD: Contract Low Down
- Oth: Other
SaleCondition: Condition of sale.
- Normal: Normal Sale
- Abnorml: Abnormal Sale - trade, foreclosure, short sale
- AdjLand: Adjoining Land Purchase
- Alloca: Allocation - two linked properties with separate deeds, typically condo with a garage unit
- Family: Sale between family members
- Partial: Home was not completed when last assessed (associated with New Homes)
Install the Kaggle API package by running the command !pip install kaggle in your Python environment.
Go to the Kaggle website (https://www.kaggle.com/) and create an account if you don't have one already.
On Kaggle, go to your Account page and scroll down to the section "API" to download your Kaggle API credentials (a JSON file). This file contains your username and API key.
Upload the JSON file containing your API credentials to your working directory or any directory you prefer in your local machine.
Use the following code to download a specific data set from Kaggle:
!pip install kaggle in your Python environment.!pip install kaggle
to do it in your browser:
import webbrowser
import ipywidgets as widgets
from IPython.display import display
def open_kaggle_website(url):
kaggle_url = "https://www.kaggle.com/"
full_url = kaggle_url + url.strip("/")
webbrowser.open_new_tab(full_url)
# Create an input widget for entering the Kaggle URL
url_input = widgets.Text(
placeholder="",
description="Kaggle URL:",
layout=widgets.Layout(width="500px")
)
# Create a button widget for triggering the website opening
open_button = widgets.Button(description="Open Kaggle Website")
def on_button_clicked(b):
open_kaggle_website(url_input.value)
open_button.on_click(on_button_clicked)
# Display the widgets
display(url_input, open_button)
to do it in the notebook:
from IPython.display import IFrame
kaggle = IFrame(src = "https://www.kaggle.com", width =1000 , height = 500)
display(kaggle)
On Kaggle, go to your Account page and scroll down to the section "API" to download your Kaggle API credentials (a JSON file). This file contains your username and API key.
Upload the JSON file containing your API credentials to your working directory or any directory you prefer in your local machine.
import json
from pathlib import Path
import requests
# Load your Kaggle API credentials from the JSON file
api_credentials_path = Path("path_to_your_api_credentials.json")
with open(api_credentials_path, 'r') as f:
api_credentials = json.load(f)
# Set your Kaggle API credentials
username = api_credentials["username"]
api_key = api_credentials["key"]
# Specify the dataset you want to download (replace with the dataset slug)
dataset_slug = "your_dataset_slug"
# Specify the directory where you want to save the downloaded dataset
download_dir = Path("path_to_download_directory")
# Construct the URL to download the dataset
download_url = f"https://www.kaggle.com/{username}/{dataset_slug}/download"
# Set the Kaggle API credentials
kaggle_auth = (username, api_key)
# Download the dataset using requests
response = requests.get(download_url, auth=kaggle_auth)
# Save the downloaded dataset to the specified directory
downloaded_file_path = download_dir / f"{dataset_slug}.zip"
with open(downloaded_file_path, 'wb') as f:
f.write(response.content)
print("Dataset downloaded and saved:", downloaded_file_path)
The dataset will be loaded into a pandas DataFrame, and we will perform initial exploratory data analysis (EDA) to understand the structure, content, and statistical properties of the data.
import numpy as np # Library for numerical computations
import pandas as pd # Library for data manipulation and analysis
import seaborn as sns # Library for statistical data visualization
import matplotlib.pyplot as plt # Library for creating plots and visualizations
# Import the necessary Plotly libraries
import plotly.graph_objects as go # Low-level interface for creating Plotly plots
import plotly.express as px # Higher-level interface for creating interactive plots
# Import the necessary ipywidgets libraries
import ipywidgets as widgets # Library for creating interactive widgets
from ipywidgets import interact, interact_manual # Functions for creating interactive controls
# to delete warnings
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split, cross_val_score # Functions for splitting data into train and test sets, and performing cross-validation
from sklearn.preprocessing import StandardScaler # Class for standardizing features by removing the mean and scaling to unit variance
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error # Evaluation metrics for regression models
from sklearn.linear_model import LinearRegression # Linear regression model
from sklearn.linear_model import Ridge # Ridge regression model
from sklearn.linear_model import Lasso # Lasso regression model
from sklearn.linear_model import ElasticNet # ElasticNet regression model
from sklearn.ensemble import RandomForestRegressor # Random Forest regression model
from sklearn.svm import SVR # Support Vector Regression model
from xgboost import XGBRegressor # XGBoost regression model
from sklearn.preprocessing import PolynomialFeatures # Class for generating polynomial features
# Import the train and test datasets
df_train = pd.read_csv(r'C:\Users\User\Desktop\GitHub-projects\projects\Data-Dives-Projects-Unleashed\dataSets\course 1\week 1\datasets\House_Price_Prediction\train _price_house.csv')
df_test = pd.read_csv(r'C:\Users\User\Desktop\GitHub-projects\projects\Data-Dives-Projects-Unleashed\dataSets\course 1\week 1\datasets\House_Price_Prediction\test.csv')
df_train.shape
(1460, 81)
df_train.head(10)
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
| 5 | 6 | 50 | RL | 85.0 | 14115 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | MnPrv | Shed | 700 | 10 | 2009 | WD | Normal | 143000 |
| 6 | 7 | 20 | RL | 75.0 | 10084 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 307000 |
| 7 | 8 | 60 | RL | NaN | 10382 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | Shed | 350 | 11 | 2009 | WD | Normal | 200000 |
| 8 | 9 | 50 | RM | 51.0 | 6120 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2008 | WD | Abnorml | 129900 |
| 9 | 10 | 190 | RL | 50.0 | 7420 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 1 | 2008 | WD | Normal | 118000 |
10 rows × 81 columns
dtypes_train = (df_train.dtypes)
(df_train.dtypes).to_csv("dtypes_train.csv")
dtypes_test = (df_test.dtypes)
(df_train.dtypes).to_csv("dtypes_train.csv")
dtypes_train
dtypes_test
Id int64
MSSubClass int64
MSZoning object
LotFrontage float64
LotArea int64
...
MiscVal int64
MoSold int64
YrSold int64
SaleType object
SaleCondition object
Length: 80, dtype: object
# Check for null values in the train dataset
null_values_train = df_train.isnull()
# Print the number of rows that have NaN values
num_rows_with_nan_train = len(df_train) - len(df_train.dropna())
num_rows_with_nan_train_with_thresh = len(df_train) - len(df_train.dropna(thresh=66))
print(f"The number of rows that have NaN values: {num_rows_with_nan_train}, with thresh = 65: {num_rows_with_nan_train_with_thresh }")
The number of rows that have NaN values: 1460, with thresh = 65: 0
# Check for null values in the test dataset
null_values_test = df_test.isnull()
# Print the number of rows that have NaN values
num_rows_with_nan_test = len(df_test) - len(df_test.dropna())
num_rows_with_nan_test_with_thresh = len(df_test) - len(df_test.dropna(thresh=65))
print(f"The number of rows that have NaN values: {num_rows_with_nan_test}, with thresh = 65 : {num_rows_with_nan_test_with_thresh }")
The number of rows that have NaN values: 1459, with thresh = 65 : 1
# Display information about the DataFrame
df_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1460 non-null int64 1 MSSubClass 1460 non-null int64 2 MSZoning 1460 non-null object 3 LotFrontage 1201 non-null float64 4 LotArea 1460 non-null int64 5 Street 1460 non-null object 6 Alley 91 non-null object 7 LotShape 1460 non-null object 8 LandContour 1460 non-null object 9 Utilities 1460 non-null object 10 LotConfig 1460 non-null object 11 LandSlope 1460 non-null object 12 Neighborhood 1460 non-null object 13 Condition1 1460 non-null object 14 Condition2 1460 non-null object 15 BldgType 1460 non-null object 16 HouseStyle 1460 non-null object 17 OverallQual 1460 non-null int64 18 OverallCond 1460 non-null int64 19 YearBuilt 1460 non-null int64 20 YearRemodAdd 1460 non-null int64 21 RoofStyle 1460 non-null object 22 RoofMatl 1460 non-null object 23 Exterior1st 1460 non-null object 24 Exterior2nd 1460 non-null object 25 MasVnrType 1452 non-null object 26 MasVnrArea 1452 non-null float64 27 ExterQual 1460 non-null object 28 ExterCond 1460 non-null object 29 Foundation 1460 non-null object 30 BsmtQual 1423 non-null object 31 BsmtCond 1423 non-null object 32 BsmtExposure 1422 non-null object 33 BsmtFinType1 1423 non-null object 34 BsmtFinSF1 1460 non-null int64 35 BsmtFinType2 1422 non-null object 36 BsmtFinSF2 1460 non-null int64 37 BsmtUnfSF 1460 non-null int64 38 TotalBsmtSF 1460 non-null int64 39 Heating 1460 non-null object 40 HeatingQC 1460 non-null object 41 CentralAir 1460 non-null object 42 Electrical 1459 non-null object 43 1stFlrSF 1460 non-null int64 44 2ndFlrSF 1460 non-null int64 45 LowQualFinSF 1460 non-null int64 46 GrLivArea 1460 non-null int64 47 BsmtFullBath 1460 non-null int64 48 BsmtHalfBath 1460 non-null int64 49 FullBath 1460 non-null int64 50 HalfBath 1460 non-null int64 51 BedroomAbvGr 1460 non-null int64 52 KitchenAbvGr 1460 non-null int64 53 KitchenQual 1460 non-null object 54 TotRmsAbvGrd 1460 non-null int64 55 Functional 1460 non-null object 56 Fireplaces 1460 non-null int64 57 FireplaceQu 770 non-null object 58 GarageType 1379 non-null object 59 GarageYrBlt 1379 non-null float64 60 GarageFinish 1379 non-null object 61 GarageCars 1460 non-null int64 62 GarageArea 1460 non-null int64 63 GarageQual 1379 non-null object 64 GarageCond 1379 non-null object 65 PavedDrive 1460 non-null object 66 WoodDeckSF 1460 non-null int64 67 OpenPorchSF 1460 non-null int64 68 EnclosedPorch 1460 non-null int64 69 3SsnPorch 1460 non-null int64 70 ScreenPorch 1460 non-null int64 71 PoolArea 1460 non-null int64 72 PoolQC 7 non-null object 73 Fence 281 non-null object 74 MiscFeature 54 non-null object 75 MiscVal 1460 non-null int64 76 MoSold 1460 non-null int64 77 YrSold 1460 non-null int64 78 SaleType 1460 non-null object 79 SaleCondition 1460 non-null object 80 SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(43) memory usage: 924.0+ KB
# Get the columns with non-null values
non_null_columns = df_train.columns[df_train.notnull().any()].tolist()
# Print the columns with non-null values
print("Columns with non-null values:")
for column in non_null_columns:
print(column)
Columns with non-null values: Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
# Get the columns with null values in the train data
null_columns = df_train.columns[df_train.isnull().any()].tolist()
# Print the data frames with null values
print("Data frames with null values in train data:")
for column in null_columns:
print(column,"------>", df_train[column].dtype)
Data frames with null values in train data: LotFrontage ------> float64 Alley ------> object MasVnrType ------> object MasVnrArea ------> float64 BsmtQual ------> object BsmtCond ------> object BsmtExposure ------> object BsmtFinType1 ------> object BsmtFinType2 ------> object Electrical ------> object FireplaceQu ------> object GarageType ------> object GarageYrBlt ------> float64 GarageFinish ------> object GarageQual ------> object GarageCond ------> object PoolQC ------> object Fence ------> object MiscFeature ------> object
# Get the columns with null values in the test data
null_columns = df_test.columns[df_test.isnull().any()].tolist()
# Print the data frames with null values
print("Data frames with null values in test data:")
for column in null_columns:
print(column)
Data frames with null values in test data: MSZoning LotFrontage Alley Utilities Exterior1st Exterior2nd MasVnrType MasVnrArea BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PoolQC Fence MiscFeature SaleType
such as mean, median, minimum, maximum, and standard deviation.
# Calculate basic statistics for each column using the describe() method
statistics = df_train.describe()
# Transpose the statistics DataFrame to have columns as variable names
statistics_transposed = statistics.T
# Print the transposed statistics DataFrame
print("Descriptive statistics for each column in train data:")
statistics_transposed
Descriptive statistics for each column in train data:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Id | 1460.0 | 730.500000 | 421.610009 | 1.0 | 365.75 | 730.5 | 1095.25 | 1460.0 |
| MSSubClass | 1460.0 | 56.897260 | 42.300571 | 20.0 | 20.00 | 50.0 | 70.00 | 190.0 |
| LotFrontage | 1201.0 | 70.049958 | 24.284752 | 21.0 | 59.00 | 69.0 | 80.00 | 313.0 |
| LotArea | 1460.0 | 10516.828082 | 9981.264932 | 1300.0 | 7553.50 | 9478.5 | 11601.50 | 215245.0 |
| OverallQual | 1460.0 | 6.099315 | 1.382997 | 1.0 | 5.00 | 6.0 | 7.00 | 10.0 |
| OverallCond | 1460.0 | 5.575342 | 1.112799 | 1.0 | 5.00 | 5.0 | 6.00 | 9.0 |
| YearBuilt | 1460.0 | 1971.267808 | 30.202904 | 1872.0 | 1954.00 | 1973.0 | 2000.00 | 2010.0 |
| YearRemodAdd | 1460.0 | 1984.865753 | 20.645407 | 1950.0 | 1967.00 | 1994.0 | 2004.00 | 2010.0 |
| MasVnrArea | 1452.0 | 103.685262 | 181.066207 | 0.0 | 0.00 | 0.0 | 166.00 | 1600.0 |
| BsmtFinSF1 | 1460.0 | 443.639726 | 456.098091 | 0.0 | 0.00 | 383.5 | 712.25 | 5644.0 |
| BsmtFinSF2 | 1460.0 | 46.549315 | 161.319273 | 0.0 | 0.00 | 0.0 | 0.00 | 1474.0 |
| BsmtUnfSF | 1460.0 | 567.240411 | 441.866955 | 0.0 | 223.00 | 477.5 | 808.00 | 2336.0 |
| TotalBsmtSF | 1460.0 | 1057.429452 | 438.705324 | 0.0 | 795.75 | 991.5 | 1298.25 | 6110.0 |
| 1stFlrSF | 1460.0 | 1162.626712 | 386.587738 | 334.0 | 882.00 | 1087.0 | 1391.25 | 4692.0 |
| 2ndFlrSF | 1460.0 | 346.992466 | 436.528436 | 0.0 | 0.00 | 0.0 | 728.00 | 2065.0 |
| LowQualFinSF | 1460.0 | 5.844521 | 48.623081 | 0.0 | 0.00 | 0.0 | 0.00 | 572.0 |
| GrLivArea | 1460.0 | 1515.463699 | 525.480383 | 334.0 | 1129.50 | 1464.0 | 1776.75 | 5642.0 |
| BsmtFullBath | 1460.0 | 0.425342 | 0.518911 | 0.0 | 0.00 | 0.0 | 1.00 | 3.0 |
| BsmtHalfBath | 1460.0 | 0.057534 | 0.238753 | 0.0 | 0.00 | 0.0 | 0.00 | 2.0 |
| FullBath | 1460.0 | 1.565068 | 0.550916 | 0.0 | 1.00 | 2.0 | 2.00 | 3.0 |
| HalfBath | 1460.0 | 0.382877 | 0.502885 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| BedroomAbvGr | 1460.0 | 2.866438 | 0.815778 | 0.0 | 2.00 | 3.0 | 3.00 | 8.0 |
| KitchenAbvGr | 1460.0 | 1.046575 | 0.220338 | 0.0 | 1.00 | 1.0 | 1.00 | 3.0 |
| TotRmsAbvGrd | 1460.0 | 6.517808 | 1.625393 | 2.0 | 5.00 | 6.0 | 7.00 | 14.0 |
| Fireplaces | 1460.0 | 0.613014 | 0.644666 | 0.0 | 0.00 | 1.0 | 1.00 | 3.0 |
| GarageYrBlt | 1379.0 | 1978.506164 | 24.689725 | 1900.0 | 1961.00 | 1980.0 | 2002.00 | 2010.0 |
| GarageCars | 1460.0 | 1.767123 | 0.747315 | 0.0 | 1.00 | 2.0 | 2.00 | 4.0 |
| GarageArea | 1460.0 | 472.980137 | 213.804841 | 0.0 | 334.50 | 480.0 | 576.00 | 1418.0 |
| WoodDeckSF | 1460.0 | 94.244521 | 125.338794 | 0.0 | 0.00 | 0.0 | 168.00 | 857.0 |
| OpenPorchSF | 1460.0 | 46.660274 | 66.256028 | 0.0 | 0.00 | 25.0 | 68.00 | 547.0 |
| EnclosedPorch | 1460.0 | 21.954110 | 61.119149 | 0.0 | 0.00 | 0.0 | 0.00 | 552.0 |
| 3SsnPorch | 1460.0 | 3.409589 | 29.317331 | 0.0 | 0.00 | 0.0 | 0.00 | 508.0 |
| ScreenPorch | 1460.0 | 15.060959 | 55.757415 | 0.0 | 0.00 | 0.0 | 0.00 | 480.0 |
| PoolArea | 1460.0 | 2.758904 | 40.177307 | 0.0 | 0.00 | 0.0 | 0.00 | 738.0 |
| MiscVal | 1460.0 | 43.489041 | 496.123024 | 0.0 | 0.00 | 0.0 | 0.00 | 15500.0 |
| MoSold | 1460.0 | 6.321918 | 2.703626 | 1.0 | 5.00 | 6.0 | 8.00 | 12.0 |
| YrSold | 1460.0 | 2007.815753 | 1.328095 | 2006.0 | 2007.00 | 2008.0 | 2009.00 | 2010.0 |
| SalePrice | 1460.0 | 180921.195890 | 79442.502883 | 34900.0 | 129975.00 | 163000.0 | 214000.00 | 755000.0 |
# Calculate basic statistics for each column in test data using the describe() method
statistics = df_test.describe()
# Transpose the statistics DataFrame to have columns as variable names
statistics_transposed = statistics.T
# Print the transposed statistics DataFrame
print("Descriptive statistics for each column in test data:")
statistics_transposed
Descriptive statistics for each column in test data:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Id | 1459.0 | 2190.000000 | 421.321334 | 1461.0 | 1825.50 | 2190.0 | 2554.50 | 2919.0 |
| MSSubClass | 1459.0 | 57.378341 | 42.746880 | 20.0 | 20.00 | 50.0 | 70.00 | 190.0 |
| LotFrontage | 1232.0 | 68.580357 | 22.376841 | 21.0 | 58.00 | 67.0 | 80.00 | 200.0 |
| LotArea | 1459.0 | 9819.161069 | 4955.517327 | 1470.0 | 7391.00 | 9399.0 | 11517.50 | 56600.0 |
| OverallQual | 1459.0 | 6.078821 | 1.436812 | 1.0 | 5.00 | 6.0 | 7.00 | 10.0 |
| OverallCond | 1459.0 | 5.553804 | 1.113740 | 1.0 | 5.00 | 5.0 | 6.00 | 9.0 |
| YearBuilt | 1459.0 | 1971.357779 | 30.390071 | 1879.0 | 1953.00 | 1973.0 | 2001.00 | 2010.0 |
| YearRemodAdd | 1459.0 | 1983.662783 | 21.130467 | 1950.0 | 1963.00 | 1992.0 | 2004.00 | 2010.0 |
| MasVnrArea | 1444.0 | 100.709141 | 177.625900 | 0.0 | 0.00 | 0.0 | 164.00 | 1290.0 |
| BsmtFinSF1 | 1458.0 | 439.203704 | 455.268042 | 0.0 | 0.00 | 350.5 | 753.50 | 4010.0 |
| BsmtFinSF2 | 1458.0 | 52.619342 | 176.753926 | 0.0 | 0.00 | 0.0 | 0.00 | 1526.0 |
| BsmtUnfSF | 1458.0 | 554.294925 | 437.260486 | 0.0 | 219.25 | 460.0 | 797.75 | 2140.0 |
| TotalBsmtSF | 1458.0 | 1046.117970 | 442.898624 | 0.0 | 784.00 | 988.0 | 1305.00 | 5095.0 |
| 1stFlrSF | 1459.0 | 1156.534613 | 398.165820 | 407.0 | 873.50 | 1079.0 | 1382.50 | 5095.0 |
| 2ndFlrSF | 1459.0 | 325.967786 | 420.610226 | 0.0 | 0.00 | 0.0 | 676.00 | 1862.0 |
| LowQualFinSF | 1459.0 | 3.543523 | 44.043251 | 0.0 | 0.00 | 0.0 | 0.00 | 1064.0 |
| GrLivArea | 1459.0 | 1486.045922 | 485.566099 | 407.0 | 1117.50 | 1432.0 | 1721.00 | 5095.0 |
| BsmtFullBath | 1457.0 | 0.434454 | 0.530648 | 0.0 | 0.00 | 0.0 | 1.00 | 3.0 |
| BsmtHalfBath | 1457.0 | 0.065202 | 0.252468 | 0.0 | 0.00 | 0.0 | 0.00 | 2.0 |
| FullBath | 1459.0 | 1.570939 | 0.555190 | 0.0 | 1.00 | 2.0 | 2.00 | 4.0 |
| HalfBath | 1459.0 | 0.377656 | 0.503017 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| BedroomAbvGr | 1459.0 | 2.854010 | 0.829788 | 0.0 | 2.00 | 3.0 | 3.00 | 6.0 |
| KitchenAbvGr | 1459.0 | 1.042495 | 0.208472 | 0.0 | 1.00 | 1.0 | 1.00 | 2.0 |
| TotRmsAbvGrd | 1459.0 | 6.385195 | 1.508895 | 3.0 | 5.00 | 6.0 | 7.00 | 15.0 |
| Fireplaces | 1459.0 | 0.581220 | 0.647420 | 0.0 | 0.00 | 0.0 | 1.00 | 4.0 |
| GarageYrBlt | 1381.0 | 1977.721217 | 26.431175 | 1895.0 | 1959.00 | 1979.0 | 2002.00 | 2207.0 |
| GarageCars | 1458.0 | 1.766118 | 0.775945 | 0.0 | 1.00 | 2.0 | 2.00 | 5.0 |
| GarageArea | 1458.0 | 472.768861 | 217.048611 | 0.0 | 318.00 | 480.0 | 576.00 | 1488.0 |
| WoodDeckSF | 1459.0 | 93.174777 | 127.744882 | 0.0 | 0.00 | 0.0 | 168.00 | 1424.0 |
| OpenPorchSF | 1459.0 | 48.313914 | 68.883364 | 0.0 | 0.00 | 28.0 | 72.00 | 742.0 |
| EnclosedPorch | 1459.0 | 24.243317 | 67.227765 | 0.0 | 0.00 | 0.0 | 0.00 | 1012.0 |
| 3SsnPorch | 1459.0 | 1.794380 | 20.207842 | 0.0 | 0.00 | 0.0 | 0.00 | 360.0 |
| ScreenPorch | 1459.0 | 17.064428 | 56.609763 | 0.0 | 0.00 | 0.0 | 0.00 | 576.0 |
| PoolArea | 1459.0 | 1.744345 | 30.491646 | 0.0 | 0.00 | 0.0 | 0.00 | 800.0 |
| MiscVal | 1459.0 | 58.167923 | 630.806978 | 0.0 | 0.00 | 0.0 | 0.00 | 17000.0 |
| MoSold | 1459.0 | 6.104181 | 2.722432 | 1.0 | 4.00 | 6.0 | 8.00 | 12.0 |
| YrSold | 1459.0 | 2007.769705 | 1.301740 | 2006.0 | 2007.00 | 2008.0 | 2009.00 | 2010.0 |
# Create a histogram of the target variable 'SalePrice'
plt.figure(figsize=(8, 6))
plt.hist(df_train['SalePrice'], bins=30, edgecolor='black')
plt.xlabel('Sale Price')
plt.ylabel('Frequency')
plt.title('Distribution of Sale Price')
plt.show()
# Create a boxplot of the target variable 'SalePrice'
plt.figure(figsize=(8, 6))
plt.boxplot(df_train['SalePrice'])
plt.xlabel('Sale Price')
plt.title('Distribution of Sale Price')
plt.show()
The code snippets below provided demonstrate how to create an interactive histogram using Plotly in Python. The histogram allows you to explore the distribution of various columns in the dataset.
By running the code, you can interactively select a column from the predefined list and observe the distribution of the data through the histogram. The colors of the bars will change based on the frequency of data points.
This interactive visualization provides a powerful way to explore and understand the distributions of various numeric variables in the dataset.
# Define the columns suitable for histogram visualization
histogram_columns = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
'1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'SalePrice']
# Create an interactive function to plot the histogram
@interact
def plot_histogram(column=histogram_columns):
# Create a Figure object
fig = go.Figure()
# Plot the histogram using Plotly
fig.add_trace(go.Histogram(x=df_train[column], nbinsx=30, marker=dict(color='steelblue')))
# Update layout
fig.update_layout(
title=f"Distribution of {column} in Train Data",
xaxis_title=column,
yaxis_title="Frequency",
bargap=0.1,
bargroupgap=0.2,
template="plotly_white"
)
# Show the figure
fig.show()
# Plot the histogram using Matplotlib and Seaborn
plt.figure(figsize=(8, 6))
sns.histplot(data=df_train, x=column, kde=True, color='steelblue')
plt.title(f"Distribution of {column} in Train Data")
plt.xlabel(column)
plt.ylabel("Frequency")
plt.grid(True)
plt.show()
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
# Define the columns suitable for histogram visualization
histogram_columns = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
'1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'SalePrice']
# Create an interactive function to plot the histogram
@interact
def plot_histogram(column=histogram_columns):
# Create a histogram using Plotly Express
fig = px.histogram(df_train, x=column, nbins=30, marginal="rug", color_discrete_sequence=["steelblue"])
# Update layout
fig.update_layout(
title=f"Distribution of {column}",
xaxis_title=column,
yaxis_title="Frequency",
bargap=0.1,
bargroupgap=0.2,
template="plotly_white"
)
# Show the histogram
fig.show()
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
# Create an interactive function to plot the histograms
@interact(column=histogram_columns)
def plot_histogram(column):
# Check if the column is in df_test
if column not in df_test.columns:
print(f"{column} is not present in df_test.")
return
# Create a figure with two subplots
fig, axes = plt.subplots(2, 1, figsize=(8, 10))
# Plot histogram for df_train
sns.histplot(data=df_train, x=column, kde=True, color='steelblue', ax=axes[0])
axes[0].set_title(f"Train Data - Distribution of {column}")
axes[0].set_xlabel(column)
axes[0].set_ylabel("Frequency")
axes[0].grid(True)
# Plot histogram for df_test
sns.histplot(data=df_test, x=column, kde=True, color='lightseagreen', ax=axes[1])
axes[1].set_title(f"Test Data - Distribution of {column}")
axes[1].set_xlabel(column)
axes[1].set_ylabel("Frequency")
axes[1].grid(True)
# Adjust spacing between subplots
fig.tight_layout()
# Show the plots
plt.show()
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
# Create an interactive function to plot the histogram
@interact
def plot_histogram(column=histogram_columns):
# Remove missing values
data = df_train[column].dropna()
# Calculate the frequency of each bin
hist_data, bin_edges = np.histogram(data, bins=30)
# Calculate the color scale based on the frequency
colorscale = ['#f0f0f0', '#0074D9']
# Create a bar trace for the histogram
trace = go.Bar(
x=data,
marker=dict(
color=data,
colorscale=colorscale,
line=dict(color='black', width=0.5)
),
opacity=0.8
)
# Create a layout for the histogram
layout = go.Layout(
title=f"Distribution of {column}",
xaxis=dict(
title=column,
gridcolor='#eee'
),
yaxis=dict(
title='Frequency',
gridcolor='#eee'
),
plot_bgcolor='rgba(0, 0, 0, 0)',
paper_bgcolor='rgba(0, 0, 0, 0)'
)
# Create a figure and add the trace
fig = go.Figure(data=[trace], layout=layout)
# Show the histogram
fig.show()
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
by plotting scatter plots or correlation matrices.
import seaborn as sns
# Select the numerical columns
numerical_columns = df_train.select_dtypes(include=np.number).columns.tolist()
# Convert the columns to numeric
df_train[numerical_columns] = df_train[numerical_columns].apply(pd.to_numeric, errors='coerce')
# Create a scatter matrix plot
sns.set(style='ticks')
sns.pairplot(df_train[numerical_columns].dropna(), height=2.5)
plt.suptitle("Scatter Matrix Plot", y=1.02)
plt.show()
import seaborn as sns
# Select the numerical columns
numerical_columns = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
'1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'SalePrice']
# Convert the columns to numeric
df_train[numerical_columns] = df_train[numerical_columns].apply(pd.to_numeric, errors='coerce')
# Create a scatter matrix plot
sns.set(style='ticks')
sns.pairplot(df_train[numerical_columns].dropna(), height=2.5)
plt.suptitle("Scatter Matrix Plot", y=1.02)
plt.show()
and examine their unique values and frequency distributions.
It will print the column name, unique values, and the frequency distribution (value counts) for each categorical column. The output will show the unique values and their corresponding counts.
# Select the categorical columns
categorical_columns = df_train.select_dtypes(include='object').columns.tolist()
# Loop through each categorical column
for column in categorical_columns:
unique_values = df_train[column].unique()
value_counts = df_train[column].value_counts()
print(f"Column: {column}")
print(f"Unique Values: {unique_values}")
print(f"Value Counts:\n{value_counts}")
print("-" * 50)
Column: MSZoning Unique Values: ['RL' 'RM' 'C (all)' 'FV' 'RH'] Value Counts: RL 1151 RM 218 FV 65 RH 16 C (all) 10 Name: MSZoning, dtype: int64 -------------------------------------------------- Column: Street Unique Values: ['Pave' 'Grvl'] Value Counts: Pave 1454 Grvl 6 Name: Street, dtype: int64 -------------------------------------------------- Column: Alley Unique Values: [nan 'Grvl' 'Pave'] Value Counts: Grvl 50 Pave 41 Name: Alley, dtype: int64 -------------------------------------------------- Column: LotShape Unique Values: ['Reg' 'IR1' 'IR2' 'IR3'] Value Counts: Reg 925 IR1 484 IR2 41 IR3 10 Name: LotShape, dtype: int64 -------------------------------------------------- Column: LandContour Unique Values: ['Lvl' 'Bnk' 'Low' 'HLS'] Value Counts: Lvl 1311 Bnk 63 HLS 50 Low 36 Name: LandContour, dtype: int64 -------------------------------------------------- Column: Utilities Unique Values: ['AllPub' 'NoSeWa'] Value Counts: AllPub 1459 NoSeWa 1 Name: Utilities, dtype: int64 -------------------------------------------------- Column: LotConfig Unique Values: ['Inside' 'FR2' 'Corner' 'CulDSac' 'FR3'] Value Counts: Inside 1052 Corner 263 CulDSac 94 FR2 47 FR3 4 Name: LotConfig, dtype: int64 -------------------------------------------------- Column: LandSlope Unique Values: ['Gtl' 'Mod' 'Sev'] Value Counts: Gtl 1382 Mod 65 Sev 13 Name: LandSlope, dtype: int64 -------------------------------------------------- Column: Neighborhood Unique Values: ['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel' 'Somerst' 'NWAmes' 'OldTown' 'BrkSide' 'Sawyer' 'NridgHt' 'NAmes' 'SawyerW' 'IDOTRR' 'MeadowV' 'Edwards' 'Timber' 'Gilbert' 'StoneBr' 'ClearCr' 'NPkVill' 'Blmngtn' 'BrDale' 'SWISU' 'Blueste'] Value Counts: NAmes 225 CollgCr 150 OldTown 113 Edwards 100 Somerst 86 Gilbert 79 NridgHt 77 Sawyer 74 NWAmes 73 SawyerW 59 BrkSide 58 Crawfor 51 Mitchel 49 NoRidge 41 Timber 38 IDOTRR 37 ClearCr 28 StoneBr 25 SWISU 25 MeadowV 17 Blmngtn 17 BrDale 16 Veenker 11 NPkVill 9 Blueste 2 Name: Neighborhood, dtype: int64 -------------------------------------------------- Column: Condition1 Unique Values: ['Norm' 'Feedr' 'PosN' 'Artery' 'RRAe' 'RRNn' 'RRAn' 'PosA' 'RRNe'] Value Counts: Norm 1260 Feedr 81 Artery 48 RRAn 26 PosN 19 RRAe 11 PosA 8 RRNn 5 RRNe 2 Name: Condition1, dtype: int64 -------------------------------------------------- Column: Condition2 Unique Values: ['Norm' 'Artery' 'RRNn' 'Feedr' 'PosN' 'PosA' 'RRAn' 'RRAe'] Value Counts: Norm 1445 Feedr 6 Artery 2 RRNn 2 PosN 2 PosA 1 RRAn 1 RRAe 1 Name: Condition2, dtype: int64 -------------------------------------------------- Column: BldgType Unique Values: ['1Fam' '2fmCon' 'Duplex' 'TwnhsE' 'Twnhs'] Value Counts: 1Fam 1220 TwnhsE 114 Duplex 52 Twnhs 43 2fmCon 31 Name: BldgType, dtype: int64 -------------------------------------------------- Column: HouseStyle Unique Values: ['2Story' '1Story' '1.5Fin' '1.5Unf' 'SFoyer' 'SLvl' '2.5Unf' '2.5Fin'] Value Counts: 1Story 726 2Story 445 1.5Fin 154 SLvl 65 SFoyer 37 1.5Unf 14 2.5Unf 11 2.5Fin 8 Name: HouseStyle, dtype: int64 -------------------------------------------------- Column: RoofStyle Unique Values: ['Gable' 'Hip' 'Gambrel' 'Mansard' 'Flat' 'Shed'] Value Counts: Gable 1141 Hip 286 Flat 13 Gambrel 11 Mansard 7 Shed 2 Name: RoofStyle, dtype: int64 -------------------------------------------------- Column: RoofMatl Unique Values: ['CompShg' 'WdShngl' 'Metal' 'WdShake' 'Membran' 'Tar&Grv' 'Roll' 'ClyTile'] Value Counts: CompShg 1434 Tar&Grv 11 WdShngl 6 WdShake 5 Metal 1 Membran 1 Roll 1 ClyTile 1 Name: RoofMatl, dtype: int64 -------------------------------------------------- Column: Exterior1st Unique Values: ['VinylSd' 'MetalSd' 'Wd Sdng' 'HdBoard' 'BrkFace' 'WdShing' 'CemntBd' 'Plywood' 'AsbShng' 'Stucco' 'BrkComm' 'AsphShn' 'Stone' 'ImStucc' 'CBlock'] Value Counts: VinylSd 515 HdBoard 222 MetalSd 220 Wd Sdng 206 Plywood 108 CemntBd 61 BrkFace 50 WdShing 26 Stucco 25 AsbShng 20 BrkComm 2 Stone 2 AsphShn 1 ImStucc 1 CBlock 1 Name: Exterior1st, dtype: int64 -------------------------------------------------- Column: Exterior2nd Unique Values: ['VinylSd' 'MetalSd' 'Wd Shng' 'HdBoard' 'Plywood' 'Wd Sdng' 'CmentBd' 'BrkFace' 'Stucco' 'AsbShng' 'Brk Cmn' 'ImStucc' 'AsphShn' 'Stone' 'Other' 'CBlock'] Value Counts: VinylSd 504 MetalSd 214 HdBoard 207 Wd Sdng 197 Plywood 142 CmentBd 60 Wd Shng 38 Stucco 26 BrkFace 25 AsbShng 20 ImStucc 10 Brk Cmn 7 Stone 5 AsphShn 3 Other 1 CBlock 1 Name: Exterior2nd, dtype: int64 -------------------------------------------------- Column: MasVnrType Unique Values: ['BrkFace' 'None' 'Stone' 'BrkCmn' nan] Value Counts: None 864 BrkFace 445 Stone 128 BrkCmn 15 Name: MasVnrType, dtype: int64 -------------------------------------------------- Column: ExterQual Unique Values: ['Gd' 'TA' 'Ex' 'Fa'] Value Counts: TA 906 Gd 488 Ex 52 Fa 14 Name: ExterQual, dtype: int64 -------------------------------------------------- Column: ExterCond Unique Values: ['TA' 'Gd' 'Fa' 'Po' 'Ex'] Value Counts: TA 1282 Gd 146 Fa 28 Ex 3 Po 1 Name: ExterCond, dtype: int64 -------------------------------------------------- Column: Foundation Unique Values: ['PConc' 'CBlock' 'BrkTil' 'Wood' 'Slab' 'Stone'] Value Counts: PConc 647 CBlock 634 BrkTil 146 Slab 24 Stone 6 Wood 3 Name: Foundation, dtype: int64 -------------------------------------------------- Column: BsmtQual Unique Values: ['Gd' 'TA' 'Ex' nan 'Fa'] Value Counts: TA 649 Gd 618 Ex 121 Fa 35 Name: BsmtQual, dtype: int64 -------------------------------------------------- Column: BsmtCond Unique Values: ['TA' 'Gd' nan 'Fa' 'Po'] Value Counts: TA 1311 Gd 65 Fa 45 Po 2 Name: BsmtCond, dtype: int64 -------------------------------------------------- Column: BsmtExposure Unique Values: ['No' 'Gd' 'Mn' 'Av' nan] Value Counts: No 953 Av 221 Gd 134 Mn 114 Name: BsmtExposure, dtype: int64 -------------------------------------------------- Column: BsmtFinType1 Unique Values: ['GLQ' 'ALQ' 'Unf' 'Rec' 'BLQ' nan 'LwQ'] Value Counts: Unf 430 GLQ 418 ALQ 220 BLQ 148 Rec 133 LwQ 74 Name: BsmtFinType1, dtype: int64 -------------------------------------------------- Column: BsmtFinType2 Unique Values: ['Unf' 'BLQ' nan 'ALQ' 'Rec' 'LwQ' 'GLQ'] Value Counts: Unf 1256 Rec 54 LwQ 46 BLQ 33 ALQ 19 GLQ 14 Name: BsmtFinType2, dtype: int64 -------------------------------------------------- Column: Heating Unique Values: ['GasA' 'GasW' 'Grav' 'Wall' 'OthW' 'Floor'] Value Counts: GasA 1428 GasW 18 Grav 7 Wall 4 OthW 2 Floor 1 Name: Heating, dtype: int64 -------------------------------------------------- Column: HeatingQC Unique Values: ['Ex' 'Gd' 'TA' 'Fa' 'Po'] Value Counts: Ex 741 TA 428 Gd 241 Fa 49 Po 1 Name: HeatingQC, dtype: int64 -------------------------------------------------- Column: CentralAir Unique Values: ['Y' 'N'] Value Counts: Y 1365 N 95 Name: CentralAir, dtype: int64 -------------------------------------------------- Column: Electrical Unique Values: ['SBrkr' 'FuseF' 'FuseA' 'FuseP' 'Mix' nan] Value Counts: SBrkr 1334 FuseA 94 FuseF 27 FuseP 3 Mix 1 Name: Electrical, dtype: int64 -------------------------------------------------- Column: KitchenQual Unique Values: ['Gd' 'TA' 'Ex' 'Fa'] Value Counts: TA 735 Gd 586 Ex 100 Fa 39 Name: KitchenQual, dtype: int64 -------------------------------------------------- Column: Functional Unique Values: ['Typ' 'Min1' 'Maj1' 'Min2' 'Mod' 'Maj2' 'Sev'] Value Counts: Typ 1360 Min2 34 Min1 31 Mod 15 Maj1 14 Maj2 5 Sev 1 Name: Functional, dtype: int64 -------------------------------------------------- Column: FireplaceQu Unique Values: [nan 'TA' 'Gd' 'Fa' 'Ex' 'Po'] Value Counts: Gd 380 TA 313 Fa 33 Ex 24 Po 20 Name: FireplaceQu, dtype: int64 -------------------------------------------------- Column: GarageType Unique Values: ['Attchd' 'Detchd' 'BuiltIn' 'CarPort' nan 'Basment' '2Types'] Value Counts: Attchd 870 Detchd 387 BuiltIn 88 Basment 19 CarPort 9 2Types 6 Name: GarageType, dtype: int64 -------------------------------------------------- Column: GarageFinish Unique Values: ['RFn' 'Unf' 'Fin' nan] Value Counts: Unf 605 RFn 422 Fin 352 Name: GarageFinish, dtype: int64 -------------------------------------------------- Column: GarageQual Unique Values: ['TA' 'Fa' 'Gd' nan 'Ex' 'Po'] Value Counts: TA 1311 Fa 48 Gd 14 Ex 3 Po 3 Name: GarageQual, dtype: int64 -------------------------------------------------- Column: GarageCond Unique Values: ['TA' 'Fa' nan 'Gd' 'Po' 'Ex'] Value Counts: TA 1326 Fa 35 Gd 9 Po 7 Ex 2 Name: GarageCond, dtype: int64 -------------------------------------------------- Column: PavedDrive Unique Values: ['Y' 'N' 'P'] Value Counts: Y 1340 N 90 P 30 Name: PavedDrive, dtype: int64 -------------------------------------------------- Column: PoolQC Unique Values: [nan 'Ex' 'Fa' 'Gd'] Value Counts: Gd 3 Ex 2 Fa 2 Name: PoolQC, dtype: int64 -------------------------------------------------- Column: Fence Unique Values: [nan 'MnPrv' 'GdWo' 'GdPrv' 'MnWw'] Value Counts: MnPrv 157 GdPrv 59 GdWo 54 MnWw 11 Name: Fence, dtype: int64 -------------------------------------------------- Column: MiscFeature Unique Values: [nan 'Shed' 'Gar2' 'Othr' 'TenC'] Value Counts: Shed 49 Gar2 2 Othr 2 TenC 1 Name: MiscFeature, dtype: int64 -------------------------------------------------- Column: SaleType Unique Values: ['WD' 'New' 'COD' 'ConLD' 'ConLI' 'CWD' 'ConLw' 'Con' 'Oth'] Value Counts: WD 1267 New 122 COD 43 ConLD 9 ConLI 5 ConLw 5 CWD 4 Oth 3 Con 2 Name: SaleType, dtype: int64 -------------------------------------------------- Column: SaleCondition Unique Values: ['Normal' 'Abnorml' 'Partial' 'AdjLand' 'Alloca' 'Family'] Value Counts: Normal 1198 Partial 125 Abnorml 101 Family 20 Alloca 12 AdjLand 4 Name: SaleCondition, dtype: int64 --------------------------------------------------
using bar plots .
To visualize the unique values and frequency distributions of categorical features in both test data and train data, you can use the following code:
from ipywidgets import interact
# Select the categorical columns
categorical_columns = df_train.select_dtypes(include='object').columns.tolist()
@interact(column=categorical_columns)
def plot_categorical(column):
# Get the value counts for df_train
train_value_counts = df_train[column].value_counts()
# Plot the value counts for df_train using a bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x=train_value_counts.index, y=train_value_counts.values, palette='viridis')
plt.title(f"Value Counts of {column} (Train Data)")
plt.xlabel(column)
plt.ylabel("Frequency")
plt.xticks(rotation=45, ha='right')
plt.show()
# Print the unique values for df_train
train_unique_values = df_train[column].unique()
print(f"Train Data - Column: {column}")
print(f"Unique Values: {train_unique_values}")
print("-" * 50)
# Check if the column exists in df_test
if column in df_test.columns:
# Get the value counts for df_test
test_value_counts = df_test[column].value_counts()
# Plot the value counts for df_test using a bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x=test_value_counts.index, y=test_value_counts.values, palette='viridis')
plt.title(f"Value Counts of {column} (Test Data)")
plt.xlabel(column)
plt.ylabel("Frequency")
plt.xticks(rotation=45, ha='right')
plt.show()
# Print the unique values for df_test
test_unique_values = df_test[column].unique()
print(f"Test Data - Column: {column}")
print(f"Unique Values: {test_unique_values}")
print("-" * 50)
else:
print(f"Test Data - Column: {column}")
print("This column does not exist in the test data.")
print("-" * 50)
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
using count plots.
# Select the categorical columns
categorical_columns = df_train.select_dtypes(include='object').columns.tolist()
@interact(column=categorical_columns)
def plot_categorical(column):
# Plot the count plot for df_train
plt.figure(figsize=(12, 6))
sns.countplot(data=df_train, x=column, palette='viridis')
plt.title(f"Distribution of {column} (Train Data)")
plt.xlabel(column)
plt.ylabel("Count")
plt.xticks(rotation=45, ha='right')
plt.show()
# Print the value counts for df_train
train_value_counts = df_train[column].value_counts()
print(f"Train Data - Column: {column}")
print(f"Value Counts:\n{train_value_counts}")
print("-" * 50)
# Check if the column exists in df_test
if column in df_test.columns:
# Plot the count plot for df_test
plt.figure(figsize=(12, 6))
sns.countplot(data=df_test, x=column, palette='viridis')
plt.title(f"Distribution of {column} (Test Data)")
plt.xlabel(column)
plt.ylabel("Count")
plt.xticks(rotation=45, ha='right')
plt.show()
# Print the value counts for df_test
test_value_counts = df_test[column].value_counts()
print(f"Test Data - Column: {column}")
print(f"Value Counts:\n{test_value_counts}")
print("-" * 50)
else:
print(f"Test Data - Column: {column}")
print("This column does not exist in the test data.")
print("-" * 50)
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
using a correlation matrix or heatmap.
import plotly.graph_objects as go
# Calculate the correlation matrix
correlation_matrix = df_train.corr()
# Get the column names
columns = correlation_matrix.columns
# Create a hovertext for the correlation matrix
hovertext = [["<b>Correlation:</b> {:.2f}".format(correlation_matrix.iloc[i, j]), "<b>Column 1:</b> {}<br><b>Column 2:</b> {}".format(columns[i], columns[j])]
for i in range(len(columns))
for j in range(len(columns))]
# Create a trace for the correlation matrix heatmap
trace = go.Heatmap(
z=correlation_matrix.values,
x=columns,
y=columns,
hovertemplate='<b>Correlation:</b> %{z:.2f}<br>',
text=hovertext,
colorscale='RdYlBu', # You can customize the colorscale if desired
)
# Create a layout for the plot
layout = go.Layout(
title="Correlation Matrix",
xaxis=dict(
title="Column",
tickangle=-45, # Rotate the tick labels for better readability
tickfont=dict(size=12), # Adjust the font size of the tick labels
),
yaxis=dict(
title="Column",
tickangle=-45, # Rotate the tick labels for better readability
tickfont=dict(size=12), # Adjust the font size of the tick labels
),
width=1000, # Set the width of the plot
height=1000, # Set the height of the plot
)
# Create a figure and add the trace
fig = go.Figure(data=[trace], layout=layout)
# Show the plot
fig.show()
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
Next, I'll upload this correlation matrix to a server so that I can demonstrate its functionality when executed in that environment.
from IPython.display import IFrame
corr_matrix = IFrame(src = "https://plotly.com/~zeed/1/", width =1400 , height = 1000)
display(corr_matrix)
by plotting boxplots or scatter plots.
# Define the columns suitable for outlier investigation
outlier_columns = df_train.select_dtypes(include=np.number).columns.tolist()
@interact(column=outlier_columns)
def plot_outliers(column):
if column == 'SalePrice':
# Plot boxplot and scatter plot for df_train
plt.figure(figsize=(8, 6))
sns.boxplot(data=df_train, y=column, color='steelblue')
plt.title(f"Boxplot: {column} (Train Data)")
plt.ylabel(column)
plt.show()
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df_train, x=column, y='SalePrice', color='steelblue')
plt.title(f"Scatter Plot: {column} vs SalePrice (Train Data)")
plt.xlabel(column)
plt.ylabel("SalePrice")
plt.grid(True)
plt.show()
# Print message for df_test
print("Test Data - Column: SalePrice")
print("This column does not exist in the test data.")
print("-" * 50)
else:
# Plot boxplot for df_train
plt.figure(figsize=(8, 6))
sns.boxplot(data=df_train, y=column, color='steelblue')
plt.title(f"Boxplot: {column} (Train Data)")
plt.ylabel(column)
plt.show()
# Plot boxplot for df_test if column exists
if column in df_test.columns and "SalePrice" in df_test.columns:
plt.figure(figsize=(8, 6))
sns.boxplot(data=df_test, y=column, color='steelblue')
plt.title(f"Boxplot: {column} (Test Data)")
plt.ylabel(column)
plt.show()
else:
print(f"Test Data - Column: {column}")
print("This column does not exist in the test data.")
print("-" * 50)
# Plot scatter plot for df_train
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df_train, x=column, y='SalePrice', color='steelblue')
plt.title(f"Scatter Plot: {column} vs SalePrice (Train Data)")
plt.xlabel(column)
plt.ylabel("SalePrice")
plt.grid(True)
plt.show()
# Plot scatter plot for df_test if column exists
if column in df_test.columns and "SalePrice" in df_test.columns:
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df_test, x=column, y='SalePrice', color='steelblue')
plt.title(f"Scatter Plot: {column} vs SalePrice (Test Data)")
plt.xlabel(column)
plt.ylabel("SalePrice")
plt.grid(True)
plt.show()
else:
print(f"Test Data - Column: {column}")
print("This column does not exist in the test data.")
print("-" * 50)
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
# Define the columns suitable for outlier investigation
outlier_columns = df_train.select_dtypes(include=np.number).columns.tolist()
outlier_columns.remove("SalePrice")
outlier_columns.remove("Id")
@interact(column1=outlier_columns, column2=outlier_columns)
def plot_outliers(column1, column2):
# Plot boxplot and scatter plot for df_train
plt.figure(figsize=(16, 6))
# Boxplot for column1 in df_train
plt.subplot(1, 2, 1)
sns.boxplot(data=df_train, y=column1, color='steelblue')
plt.title(f"Boxplot: {column1} (Train Data)")
plt.ylabel(column1)
# Scatter plot for column2 vs column1 in df_train
plt.subplot(1, 2, 2)
sns.scatterplot(data=df_train, x=column2, y=column1, color='steelblue')
plt.title(f"Scatter Plot: {column2} vs {column1} (Train Data)")
plt.xlabel(column2)
plt.ylabel("SalePrice")
plt.grid(True)
plt.tight_layout()
plt.show()
# Plot boxplot and scatter plot for df_test if column exists
if column1 in df_test.columns and column2 in df_test.columns:
plt.figure(figsize=(16, 6))
# Boxplot for column1 in df_test
plt.subplot(1, 2, 1)
sns.boxplot(data=df_test, y=column1, color='steelblue')
plt.title(f"Boxplot: {column1} (Test Data)")
plt.ylabel(column1)
# Scatter plot for column1 vs column2 in df_train
plt.subplot(1, 2, 2)
sns.scatterplot(data=df_train, x=column1, y=column2, color='steelblue')
plt.title(f"Scatter Plot: {column1} vs {column2} (Train Data)")
plt.xlabel(column2)
plt.ylabel("SalePrice")
plt.grid(True)
plt.tight_layout()
plt.show()
else:
print("One or both of the selected columns do not exist in the test data.")
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
and decide on a strategy for handling missing data (e.g., imputation or dropping columns/rows).
import pandas as pd
import matplotlib.pyplot as plt
# Calculate the percentage of missing values in each column
missing_percentage = df_train.isnull().mean() * 100
# Sort the columns based on the missing percentage in descending order
sorted_missing = missing_percentage.sort_values(ascending=False)
# Plot the distribution of missing values
plt.figure(figsize=(24, 6))
sorted_missing.plot(kind='bar', color='steelblue')
plt.xlabel('Features')
plt.ylabel('Missing Percentage')
plt.title('Distribution of Missing Values')
plt.xticks(rotation=45, ha='right')
plt.show()
(e.g., neighborhood, year built) and analyzing their impact on the target variable.
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
# Group the data by a categorical feature (e.g., neighborhood)
grouped_data = df_train.groupby('Neighborhood')
# Select a numerical feature of interest
numerical_feature = 'YearBuilt'
# Create a font dictionary for bold text
font = {'fontweight': 'bold'}
# Plot the box plot for the selected feature grouped by the categorical feature
plt.figure(figsize=(24, 12))
sns.boxplot(x='Neighborhood', y=numerical_feature, data=df_train)
plt.xlabel('Neighborhood', **font)
plt.ylabel("Year Built", **font)
plt.title('Distribution of {} by Neighborhood'.format(numerical_feature), **font)
plt.xticks(rotation=45)
plt.show()
This step involves handling missing values, encoding categorical variables, and performing any necessary data transformations to prepare the dataset for model training. We will analyze the relevance and importance of the available features and perform feature selection or engineering if required to improve model performance.
Deal with missing values in the dataset. Depending on the nature and extent of missing data, you can choose to either drop missing values, fill them with appropriate values (e.g., mean, median, mode), or use more advanced techniques like
The code provided takes into consideration that many columns with missing values are categorical columns. Therefore, it incorporates a different approach to calculate the correlation between these categorical columns and the target column.
df_train.dtypes
Id int64
MSSubClass int64
MSZoning object
LotFrontage float64
LotArea int64
...
MoSold int64
YrSold int64
SaleType object
SaleCondition object
SalePrice int64
Length: 81, dtype: object
# Define the threshold for the percentage of missing values
missing_threshold = 0.5
# Calculate the percentage of missing values in each column
missing_percentage = df_train.isnull().mean()
# Identify columns with missing values exceeding the threshold and no significant impact on SalePrice
columns_to_drop = []
for column in df_train.columns:
if missing_percentage[column] > missing_threshold:
if df_train[column].dtype == 'object':
# Use label encoding or one-hot encoding for categorical columns
encoded_column = pd.factorize(df_train[column])[0] # Label encoding
# or
encoded_column = pd.get_dummies(df_train[column], prefix=column) # One-hot encoding
# Calculate the correlation between the encoded column and SalePrice
correlation = encoded_column.corrwith(df_train['SalePrice']).values[0]
# Check if the absolute correlation value is less than 0.5
if abs(correlation) < 0.4:
columns_to_drop.append(column)
else:
# Calculate the correlation between the column and SalePrice
correlation = df_train[column].corr(df_train['SalePrice'])
# Check if the absolute correlation value is less than 0.5
if abs(correlation) < 0.5:
columns_to_drop.append(column)
# Drop the identified columns from the dataset
#df_train.drop(columns_to_drop, axis=1, inplace=True)
# Print the updated dataframe after dropping the columns
df_train.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
We also want to identify and drop columns that do not have any missing values but have no significant impact on the target variable, SalePrice. To accomplish this, we have modified the code to calculate the correlation between each column and SalePrice. If the absolute correlation value is less than 0.5, it indicates that the column has a weak correlation with SalePrice and may not provide valuable information for predicting house prices.
We consider both numerical and categorical columns in this analysis. For numerical columns, we calculate the correlation using the corr() function. For categorical columns, we apply label encoding or one-hot encoding and then calculate the correlation between the encoded column and SalePrice.
By identifying these columns and dropping them from the dataset, we can focus on the most relevant features that have a stronger impact on the target variable. This helps us streamline the dataset and improve the efficiency and accuracy of our predictive modeling process.
# Calculate the percentage of missing values in each column
missing_percentage = df_train.isnull().mean()
# Identify columns with no significant impact on SalePrice
#columns_to_drop = []
for column in df_train.columns:
if df_train[column].dtype == 'object':
# Use label encoding or one-hot encoding for categorical columns
encoded_column = pd.factorize(df_train[column])[0] # Label encoding
# or
encoded_column = pd.get_dummies(df_train[column], prefix=column) # One-hot encoding
# Calculate the correlation between the encoded column and SalePrice
correlation = encoded_column.corrwith(df_train['SalePrice']).values[0]
# Check if the absolute correlation value is less than 0.5
if abs(correlation) < 0.5:
columns_to_drop.append(column)
else:
# Calculate the correlation between the column and SalePrice
correlation = df_train[column].corr(df_train['SalePrice'])
# Check if the absolute correlation value is less than 0.5
if abs(correlation) < 0.4:
columns_to_drop.append(column)
# Drop the identified columns from the dataset
#df_train.drop(columns_to_drop, axis=1, inplace=True)
# Print the updated dataframe after dropping the columns
print(columns_to_drop)
df_train.head()
['Alley', 'PoolQC', 'Fence', 'MiscFeature', 'Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallCond', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '2ndFlrSF', 'LowQualFinSF', 'BsmtFullBath', 'BsmtHalfBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition']
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
len(df_train.columns) - len(columns_to_drop)
df_train.columns.tolist()
['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'SalePrice']
columns_to_keep = [x for x in df_train.columns if x not in columns_to_drop]
columns_to_keep
len(columns_to_keep)
16
import seaborn as sns
def plot_dropped_columns():
numeric_columns_to_drop = [column for column in columns_to_drop if df_train[column].dtype != 'object']
correlations = []
for column in numeric_columns_to_drop:
correlation = df_train[column].corr(df_train['SalePrice'])
correlations.append(correlation)
plt.figure(figsize=(20, 12))
plt.bar(numeric_columns_to_drop, correlations)
plt.xlabel('Columns')
plt.ylabel('Correlation with SalePrice')
plt.title('Correlation of Dropped Columns with SalePrice')
plt.xticks(rotation=45)
plt.show()
if len(columns_to_drop) > 0:
plt.figure(figsize=(20, 12))
missing_data = df_train[columns_to_drop].isnull().sum()
missing_data.plot(kind='bar')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.title('Missing Values in Dropped Columns')
plt.xticks(rotation=90)
plt.show()
else:
print("No columns to drop with missing values.")
def plot_kept_columns():
plt.figure(figsize=(10, len(columns_to_keep)*0.5))
kept_columns_corr = df_train[columns_to_keep].corrwith(df_train['SalePrice']).sort_values()
kept_columns_corr.plot(kind='barh')
plt.xlabel('Correlation with SalePrice')
plt.ylabel('Columns')
plt.title('Correlation of Kept Columns with SalePrice')
plt.show()
plt.figure(figsize=(10, len(columns_to_keep)*0.5))
sns.heatmap(df_train[columns_to_keep].isnull(), cmap='viridis')
plt.title('Missing Values in Kept Columns')
plt.xticks(rotation=45)
plt.show()
@interact(option=['Dropped Columns', 'Kept Columns'])
def plot_columns(option):
if option == 'Dropped Columns':
plot_dropped_columns()
elif option == 'Kept Columns':
plot_kept_columns()
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
def plot_dropped_columns():
numeric_columns_to_drop = [column for column in columns_to_drop if df_train[column].dtype != 'object']
correlations = []
for column in numeric_columns_to_drop:
correlation = df_train[column].corr(df_train['SalePrice'])
correlations.append(correlation)
plt.figure(figsize=(10, 6))
plt.bar(numeric_columns_to_drop, correlations)
plt.xlabel('Columns')
plt.ylabel('Correlation with SalePrice')
plt.title('Correlation of Dropped Columns with SalePrice')
plt.xticks(rotation=90)
plt.show()
if len(columns_to_drop) > 0:
plt.figure(figsize=(10, 6))
missing_data = df_train[columns_to_drop].isnull().sum()
missing_data.plot(kind='bar')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.title('Missing Values in Dropped Columns')
plt.xticks(rotation=90)
plt.show()
else:
print("No columns to drop with missing values.")
def plot_kept_columns():
plt.figure(figsize=(10, len(columns_to_keep)*0.5))
kept_columns_corr = df_train[columns_to_keep].corrwith(df_train['SalePrice']).sort_values()
kept_columns_corr.plot(kind='barh')
plt.xlabel('Correlation with SalePrice')
plt.ylabel('Columns')
plt.title('Correlation of Kept Columns with SalePrice')
plt.show()
plt.figure(figsize=(10, len(columns_to_keep)*0.5))
sns.heatmap(df_train[columns_to_keep].isnull(), cmap='viridis')
plt.title('Missing Values in Kept Columns')
plt.show()
@interact(option=['Dropped Columns', 'Kept Columns'])
def plot_columns(option):
if option == 'Dropped Columns':
plot_dropped_columns()
elif option == 'Kept Columns':
plot_kept_columns()
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
Now we have ensured that the columns we want to drop are indeed deserving of being dropped.
df_train.drop(columns_to_drop, axis=1, inplace=True)
df_train
| OverallQual | YearBuilt | YearRemodAdd | MasVnrArea | BsmtQual | TotalBsmtSF | 1stFlrSF | GrLivArea | FullBath | KitchenQual | TotRmsAbvGrd | Fireplaces | GarageYrBlt | GarageCars | GarageArea | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7 | 2003 | 2003 | 196.0 | Gd | 856 | 856 | 1710 | 2 | Gd | 8 | 0 | 2003.0 | 2 | 548 | 208500 |
| 1 | 6 | 1976 | 1976 | 0.0 | Gd | 1262 | 1262 | 1262 | 2 | TA | 6 | 1 | 1976.0 | 2 | 460 | 181500 |
| 2 | 7 | 2001 | 2002 | 162.0 | Gd | 920 | 920 | 1786 | 2 | Gd | 6 | 1 | 2001.0 | 2 | 608 | 223500 |
| 3 | 7 | 1915 | 1970 | 0.0 | TA | 756 | 961 | 1717 | 1 | Gd | 7 | 1 | 1998.0 | 3 | 642 | 140000 |
| 4 | 8 | 2000 | 2000 | 350.0 | Gd | 1145 | 1145 | 2198 | 2 | Gd | 9 | 1 | 2000.0 | 3 | 836 | 250000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1455 | 6 | 1999 | 2000 | 0.0 | Gd | 953 | 953 | 1647 | 2 | TA | 7 | 1 | 1999.0 | 2 | 460 | 175000 |
| 1456 | 6 | 1978 | 1988 | 119.0 | Gd | 1542 | 2073 | 2073 | 2 | TA | 7 | 2 | 1978.0 | 2 | 500 | 210000 |
| 1457 | 7 | 1941 | 2006 | 0.0 | TA | 1152 | 1188 | 2340 | 2 | Gd | 9 | 2 | 1941.0 | 1 | 252 | 266500 |
| 1458 | 5 | 1950 | 1996 | 0.0 | TA | 1078 | 1078 | 1078 | 1 | Gd | 5 | 0 | 1950.0 | 1 | 240 | 142125 |
| 1459 | 5 | 1965 | 1965 | 0.0 | TA | 1256 | 1256 | 1256 | 1 | TA | 6 | 0 | 1965.0 | 1 | 276 | 147500 |
1460 rows × 16 columns
Variables in data analysis can have different scales and measurement levels. It is essential to convert them to the appropriate scale and measurement level when building machine learning models.
Here are the main scales and their importance in modeling:
Ratio Scale: Variables with a meaningful zero point and equal intervals, like age or income. These variables are used as-is in modeling.
Interval Scale: Variables with equal intervals but no meaningful zero point, such as temperature. They can be used directly but require careful interpretation.
Nominal Scale: Categorical variables without an inherent order, like gender or occupation. They need to be encoded into numerical form before modeling.
Ordinal Scale: Variables with an order or ranking but unequal intervals, like survey ratings. They also need encoding to reflect the order.
Converting variables to the right scale and measurement level is important for:
Model Performance: Ensuring accurate representation of variables improves model performance.
Statistical Assumptions: Meeting assumptions, such as linearity or normality, leads to more accurate model results.
Interpretability: Properly scaled variables enhance interpretability, enabling meaningful comparisons and accurate predictions.
Feature Engineering: Converting variables is part of feature engineering, which helps models capture relevant patterns.
By understanding scales and converting variables appropriately, machine learning models can be more accurate, interpretable, and performant.
import pandas as pd
# Define the correct data types for each column
correct_dtypes = {
'Id': int,
'MSSubClass': 'category',
'MSZoning': 'category',
'LotFrontage': float,
'LotArea': int,
'Street': 'category',
'Alley': 'category',
'LotShape': 'category',
'LandContour': 'category',
'Utilities': 'category',
'LotConfig': 'category',
'LandSlope': 'category',
'Neighborhood': 'category',
'Condition1': 'category',
'Condition2': 'category',
'BldgType': 'category',
'HouseStyle': 'category',
'OverallQual': pd.CategoricalDtype(categories=[1,2,3,4,5,6,7,8,9,10], ordered=True),
'OverallCond': pd.CategoricalDtype(categories=[1,2,3,4,5,6,7,8,9,10], ordered=True),
'YearBuilt': int,
'YearRemodAdd': int,
'RoofStyle': 'category',
'RoofMatl': 'category',
'Exterior1st': 'category',
'Exterior2nd': 'category',
'MasVnrType': 'category',
'MasVnrArea': float,
'ExterQual': 'category',
'ExterCond': 'category',
'Foundation': 'category',
'BsmtQual': 'category',
'BsmtCond': 'category',
'BsmtExposure': 'category',
'BsmtFinType1': 'category',
'BsmtFinSF1': int,
'BsmtFinType2': 'category',
'BsmtFinSF2': int,
'BsmtUnfSF': int,
'TotalBsmtSF': int,
'Heating': 'category',
'HeatingQC': 'category',
'CentralAir': 'category',
'Electrical': 'category',
'1stFlrSF': int,
'2ndFlrSF': int,
'LowQualFinSF': int,
'GrLivArea': int,
'BsmtFullBath': int,
'BsmtHalfBath': int,
'FullBath': int,
'HalfBath': int,
'BedroomAbvGr': int,
'KitchenAbvGr': int,
'KitchenQual': 'category',
'TotRmsAbvGrd': int,
'Functional': 'category',
'Fireplaces': int,
'FireplaceQu': 'category',
'GarageType': 'category',
'GarageYrBlt': object,
'GarageFinish': 'category',
'GarageCars': int,
'GarageArea': int,
'GarageQual': 'category',
'GarageCond': 'category',
'PavedDrive': 'category',
'WoodDeckSF': int,
'OpenPorchSF': int,
'EnclosedPorch': int,
'3SsnPorch': int,
'ScreenPorch': int,
'PoolArea': int,
'PoolQC': 'category',
'Fence': 'category',
'MiscFeature': 'category',
'MiscVal': int,
'MoSold': int,
'YrSold': int,
'SaleType': 'category',
'SaleCondition': 'category'
}
# Convert columns to their correct data types in df_train and df_test
for column, dtype in correct_dtypes.items():
if column in df_train.columns:
if dtype == pd.to_datetime:
df_train[column] = pd.to_datetime(df_train[column],format='%Y')
#df_test[column] = pd.to_datetime(df_test[column])
else:
df_train[column] = df_train[column].astype(dtype)
#df_test[column] = df_test[column].astype(dtype)
# Print the updated dataframe with data types
print(df_train.dtypes)
#print(df_test.dtypes)
# save it to a file
(df_train.dtypes).to_csv("dtypes_train_clean.csv")
#(df_test.dtypes).to_csv("dtypes_test_clean.csv")
Id int32
MSSubClass category
MSZoning category
LotFrontage float64
LotArea int32
...
MoSold int32
YrSold int32
SaleType category
SaleCondition category
SalePrice int64
Length: 81, dtype: object
df_train[date_columns]
Clean the dataset by handling outliers, incorrect or inconsistent values, and duplicate entries. This may involve techniques like removing outliers, correcting errors, and removing or merging duplicate records.
From this moment onwards, we will only be working with the following columns in the dataset:
These columns have been selected based on their data types and their potential impact on the target variable, SalePrice
Now, it would be helpful to assess the current state of the dataset and identify any specific issues that need to be addressed. Based on the column types you provided, it appears that the dataset contains a mix of numerical and categorical variables.
To determine if data cleaning is necessary, we can consider the following:
Review the data quality: Examine the columns and their values to identify any obvious inconsistencies, missing values, or outliers. Look for potential errors or inconsistencies that might impact the analysis or modeling process.
Check for missing values: Assess the presence and extent of missing data in each column. Missing values may require imputation or removal depending on the percentage of missingness and the impact on the analysis.
Evaluate outliers: Identify potential outliers in the numerical columns. Outliers can affect the distribution and statistical properties of the data and may require handling to ensure accurate analysis.
Identify duplicate entries: Check for duplicate records in the dataset. Duplicate entries can distort the analysis and modeling results, and they should be handled by either removing or merging them.
let's start doing them:
categorical_columns = ['OverallQual', 'BsmtQual', 'KitchenQual']
numerical_columns = ['MasVnrArea', 'TotalBsmtSF', '1stFlrSF', 'GrLivArea', 'FullBath', 'TotRmsAbvGrd',
'Fireplaces', 'GarageCars', 'GarageArea', 'SalePrice']
date_columns = ['YearBuilt', 'YearRemodAdd','GarageYrBlt']
Here are some common approaches to handling missing values in a dataset:
Deletion: Remove rows or columns with missing values. This approach is suitable when the missing values are a small portion of the data and their removal does not significantly impact the analysis or modeling process. However, it can lead to loss of information if the missing values are informative.
Imputation: Fill in missing values with estimated or calculated values. Imputation can be done using various techniques:
Create an indicator variable: Create a new binary column that indicates the presence or absence of missing values in a particular feature. This allows the missingness to be captured as a separate category in the data.
Domain-specific imputation: Utilize domain knowledge or specific rules to impute missing values. For example, replacing missing values with "N/A" or "None" for categorical variables, or imputing missing dates with the average of adjacent dates.
Advanced techniques: More advanced techniques such as k-nearest neighbors imputation, probabilistic imputation, or using machine learning algorithms to predict missing values can be employed for more complex scenarios.
# Check for missing values in each column
missing_values = df_train.isnull().sum()
print("Missing Values:")
print(missing_values)
print()
Missing Values:
Id 0
MSSubClass 0
MSZoning 0
LotFrontage 259
LotArea 0
...
MoSold 0
YrSold 0
SaleType 0
SaleCondition 0
SalePrice 0
Length: 81, dtype: int64
We identified three important columns with missing values:
GarageYrBlt: There are 81 missing values. BsmtQual: There are 37 missing values. MasVnrArea: There are 8 missing values.
The "BsmtQual" column in the dataset represents the evaluation of the height of the basement in houses. The values in this column have the following meanings:
It is important to note that the missing values in the "BsmtQual" column are not really missing values in the traditional sense. Instead, they indicate that there is no basement in those houses. To retain this information and avoid confusion, you can represent these missing values as "NA" to clearly indicate the absence of a basement. This way, the reader can understand that the missing values in this column are intentional and carry a specific meaning.
Now, let's convert missing values in BsmtQual to "No Basement"
# Set categories for BsmtQual
df_train['BsmtQual'] = df_train['BsmtQual'].cat.add_categories('No Basement')
# Fill missing values in BsmtQual with "No Basement"
df_train['BsmtQual'] = df_train['BsmtQual'].fillna('No Basement')
df_train['BsmtQual'].value_counts()
TA 649 Gd 618 Ex 121 No Basement 37 Fa 35 Name: BsmtQual, dtype: int64
Now, let's address the missing values in the column GarageYrBlt. We'll impute the missing values with an appropriate strategy.
I noticed that there's another column called GarageQual: Garage quality. It has categories such as:
Based on these categories, I suspect that when "GarageYrBlt" is missing, it indicates that the garage was not built at all. To confirm this hypothesis, let's examine the relationship between "GarageYrBlt" and "GarageQual" columns.
Since we have dropped the "GarageQual" column, let's reload the dataset and assign it to a different variable.
df_train_reloaded = pd.read_csv(r'C:\Users\User\Desktop\GitHub-projects\projects\Data-Dives-Projects-Unleashed\dataSets\course 1\week 1\datasets\House_Price_Prediction\train _price_house.csv')
# Filter the rows where GarageYrBlt is missing
missing_garage = df_train_reloaded[df_train['GarageYrBlt'].isnull()]
# Count the unique values in GarageQual for the missing GarageYrBlt rows
garage_qual_table = df_train_reloaded[df_train_reloaded['GarageYrBlt'].isnull()][['GarageYrBlt','GarageQual']]
garage_qual_table
| GarageYrBlt | GarageQual | |
|---|---|---|
| 39 | NaN | NaN |
| 48 | NaN | NaN |
| 78 | NaN | NaN |
| 88 | NaN | NaN |
| 89 | NaN | NaN |
| ... | ... | ... |
| 1349 | NaN | NaN |
| 1407 | NaN | NaN |
| 1449 | NaN | NaN |
| 1450 | NaN | NaN |
| 1453 | NaN | NaN |
81 rows × 2 columns
As you saw, the hypothesis is correct. Based on the observation that the missing values in the "GarageYrBlt" column correspond to houses with "No Garage" in the "GarageQual" column, we can impute the missing values by setting the year built of the garage to the same year as the house was built. This assumes that if there is no garage, it was not built in a different year from the house.
Perhaps this approach may not be the optimal solution. Therefore, I will create two separate data sets to compare their performance and determine which one is better suited for our analysis.
Let's examine the current state of the df_train data set. Then, I will begin creating different versions of this data set to explore various cleaning and feature engineering techniques. By comparing these versions, we can determine which approaches yield the best results in terms of data quality and feature representation.
df_train
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1455 | 1456 | 60 | RL | 62.0 | 7917 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 175000 |
| 1456 | 1457 | 20 | RL | 85.0 | 13175 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 2 | 2010 | WD | Normal | 210000 |
| 1457 | 1458 | 70 | RL | 66.0 | 9042 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdPrv | Shed | 2500 | 5 | 2010 | WD | Normal | 266500 |
| 1458 | 1459 | 20 | RL | 68.0 | 9717 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal | 142125 |
| 1459 | 1460 | 20 | RL | 75.0 | 9937 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal | 147500 |
1460 rows × 81 columns
df_train_garage_impute: Impute missing values in "GarageYrBlt" based on "GarageQual" column. Set the garage year built to the same as the house year built for houses with "No Garage".
df_train_garage_impute = df_train.copy()
df_train_garage_impute
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1455 | 1456 | 60 | RL | 62.0 | 7917 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 175000 |
| 1456 | 1457 | 20 | RL | 85.0 | 13175 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 2 | 2010 | WD | Normal | 210000 |
| 1457 | 1458 | 70 | RL | 66.0 | 9042 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdPrv | Shed | 2500 | 5 | 2010 | WD | Normal | 266500 |
| 1458 | 1459 | 20 | RL | 68.0 | 9717 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal | 142125 |
| 1459 | 1460 | 20 | RL | 75.0 | 9937 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal | 147500 |
1460 rows × 81 columns
# Impute missing values in GarageYrBlt based on YearBuilt
df_train_garage_impute.loc[df_train_garage_impute['GarageYrBlt'].isna(), 'GarageYrBlt'] =df_train_garage_impute['YearBuilt']
# Verify the imputation
missing_garage_yrblt = df_train_garage_impute['GarageYrBlt'].isna().sum()
print(f"Missing values in GarageYrBlt after imputation: {missing_garage_yrblt}")
Missing values in GarageYrBlt after imputation: 0
df_train_garage_impute
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1455 | 1456 | 60 | RL | 62.0 | 7917 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 175000 |
| 1456 | 1457 | 20 | RL | 85.0 | 13175 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 2 | 2010 | WD | Normal | 210000 |
| 1457 | 1458 | 70 | RL | 66.0 | 9042 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdPrv | Shed | 2500 | 5 | 2010 | WD | Normal | 266500 |
| 1458 | 1459 | 20 | RL | 68.0 | 9717 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal | 142125 |
| 1459 | 1460 | 20 | RL | 75.0 | 9937 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal | 147500 |
1460 rows × 81 columns
df_train_garage_unknown: Replace missing values in "GarageYrBlt" with "Unknown" to indicate no available information.
df_train_garage_unknown = df_train.copy()
df_train_garage_unknown
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1455 | 1456 | 60 | RL | 62.0 | 7917 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 175000 |
| 1456 | 1457 | 20 | RL | 85.0 | 13175 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 2 | 2010 | WD | Normal | 210000 |
| 1457 | 1458 | 70 | RL | 66.0 | 9042 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdPrv | Shed | 2500 | 5 | 2010 | WD | Normal | 266500 |
| 1458 | 1459 | 20 | RL | 68.0 | 9717 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal | 142125 |
| 1459 | 1460 | 20 | RL | 75.0 | 9937 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal | 147500 |
1460 rows × 81 columns
# Replace missing values in GarageYrBlt with "Unknown"
df_train_garage_unknown.loc[df_train_garage_unknown['GarageYrBlt'].isna(), 'GarageYrBlt'] = df_train_garage_unknown['YearBuilt']
# Verify the replacement
missing_garage_yrblt = df_train_garage_unknown['GarageYrBlt'].isna().sum()
print(f"Missing values in GarageYrBlt after replacement: {missing_garage_yrblt}")
Missing values in GarageYrBlt after replacement: 0
df_train_garage_unknown
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1455 | 1456 | 60 | RL | 62.0 | 7917 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 175000 |
| 1456 | 1457 | 20 | RL | 85.0 | 13175 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 2 | 2010 | WD | Normal | 210000 |
| 1457 | 1458 | 70 | RL | 66.0 | 9042 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdPrv | Shed | 2500 | 5 | 2010 | WD | Normal | 266500 |
| 1458 | 1459 | 20 | RL | 68.0 | 9717 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal | 142125 |
| 1459 | 1460 | 20 | RL | 75.0 | 9937 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal | 147500 |
1460 rows × 81 columns
Now, let's address the missing values in the column MasVnrArea. We'll impute the missing values with an appropriate strategy.
# Drop the rows with missing values in MasVnrArea column
df_train = df_train.dropna(subset=['MasVnrArea'])
df_train_garage_impute = df_train_garage_impute.dropna(subset=['MasVnrArea'])
df_train_garage_unknown = df_train_garage_unknown.dropna(subset=['MasVnrArea'])
# Check for inconsistencies or errors in categorical columns
for column in categorical_columns:
unique_values = df_train[column].unique()
print(f"Column: {column}")
print(f" Unique values: {unique_values}")
print()
Column: OverallQual Unique values: [7, 6, 8, 5, 9, 4, 10, 3, 1, 2] Categories (10, int64): [1 < 2 < 3 < 4 ... 7 < 8 < 9 < 10] Column: BsmtQual Unique values: ['Gd', 'TA', 'Ex', 'No Basement', 'Fa'] Categories (5, object): ['Ex', 'Fa', 'Gd', 'TA', 'No Basement'] Column: KitchenQual Unique values: ['Gd', 'TA', 'Ex', 'Fa'] Categories (4, object): ['Ex', 'Fa', 'Gd', 'TA']
It's important to note that normalization or standardization scaling does not directly handle outliers. These techniques aim to scale the data without significantly affecting the distribution or altering the extreme values. Outliers can still have an impact on the model's performance, especially if the algorithm is sensitive to them.
# Identify potential outliers in numerical columns
for column in numerical_columns:
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_train[column])
plt.title(f"Boxplot of {column}")
plt.show()
As you saw, we have outliers and When dealing with outliers in numerical columns, there are several approaches you can consider:
Outlier Removal: One straightforward approach is to remove the outliers from the dataset. However, this should be done with caution, as it can result in loss of information and potential bias in the remaining data.
Robust Scaling: Instead of removing outliers, you can use robust scaling techniques that are less sensitive to extreme values. For example, you can use the RobustScaler from scikit-learn, which scales the data based on robust statistics like the median and interquartile range.
Binning: Another approach is to bin the numerical values into discrete categories. This can help handle outliers by grouping them with nearby values. However, this approach may result in information loss and may not be suitable for all types of data.
Machine Learning Models: Some machine learning algorithms, such as tree-based models (e.g., Random Forests, Gradient Boosting), are inherently robust to outliers. By using these models, you may not need to explicitly handle outliers, as the model can adapt and learn from the data effectively.
Domain Knowledge: It is crucial to leverage domain knowledge and consult with subject matter experts. They can provide valuable insights on the data and guide you in making informed decisions about how to handle outliers.
I will create another variable, df_train_outliers, to handle the outliers in the df_train dataset. We will then compare the results of the model using this new variable to determine the importance of handling outliers.
I will build two versions of df_train to handle outliers:
Version 1 (Using Isolation Forest and RobustScaler):
Note: When using RobustScaler, you don't necessarily need to apply additional normalization or standardization to your data. RobustScaler already performs a type of normalization by subtracting the median and dividing by the interquartile range (IQR), which makes it robust to outliers.
With this approach, it is not possible to use a dataset with missing values because the Isolation Forest and RobustScaler methods do not know how to handle missing values.
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import RobustScaler
# Identify and remove outliers using Isolation Forest
outlier_detector = IsolationForest(contamination='auto')
outliers = outlier_detector.fit_predict(df_train_garage_impute[numerical_columns])
df_train_outliers_1 = df_train_garage_impute[outliers == 1] # Keep only non-outlier instances
# Scale the numerical columns using RobustScaler
scaler = RobustScaler()
df_train_outliers_1[numerical_columns] = scaler.fit_transform(df_train_outliers_1[numerical_columns])
df_train_outliers_1
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 0.706134 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 0.320970 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 0.920114 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | -0.271041 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 1.298146 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1454 | 1455 | 20 | FV | 62.0 | 7500 | Pave | Pave | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 10 | 2009 | WD | Normal | 0.370899 |
| 1455 | 1456 | 60 | RL | 62.0 | 7917 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 0.228245 |
| 1456 | 1457 | 20 | RL | 85.0 | 13175 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 2 | 2010 | WD | Normal | 0.727532 |
| 1458 | 1459 | 20 | RL | 68.0 | 9717 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal | -0.240728 |
| 1459 | 1460 | 20 | RL | 75.0 | 9937 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal | -0.164051 |
1256 rows × 81 columns
Version 2 (Using Winsorization):
Note: It can be a good idea to use normalization or standardization for the df_train_outliers_2
you can have missing values with Winsorization
from scipy.stats import mstats
def handle_outliers(data, column):
# Calculate the lower and upper thresholds using Winsorization
data[column] = mstats.winsorize(data[column], limits=[0.05, 0.05])
df_train_outliers_2 = df_train_garage_impute.copy() # Create a copy of the original data
for column in numerical_columns:
handle_outliers(df_train_outliers_2, column)
df_train_outliers_2
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1455 | 1456 | 60 | RL | 62.0 | 7917 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 175000 |
| 1456 | 1457 | 20 | RL | 85.0 | 13175 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 2 | 2010 | WD | Normal | 210000 |
| 1457 | 1458 | 70 | RL | 66.0 | 9042 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdPrv | Shed | 2500 | 5 | 2010 | WD | Normal | 266500 |
| 1458 | 1459 | 20 | RL | 68.0 | 9717 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal | 142125 |
| 1459 | 1460 | 20 | RL | 75.0 | 9937 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal | 147500 |
1452 rows × 81 columns
# Check for inconsistencies or errors in date columns
for column in date_columns:
unique_dates = sorted((df_train[column].unique()).tolist(), reverse=True)
print(f"Column: {column}")
print(f"Unique dates: {unique_dates}")
print()
Column: YearBuilt Unique dates: [2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970, 1969, 1968, 1967, 1966, 1965, 1964, 1963, 1962, 1961, 1960, 1959, 1958, 1957, 1956, 1955, 1954, 1953, 1952, 1951, 1950, 1949, 1948, 1947, 1946, 1945, 1942, 1941, 1940, 1939, 1938, 1937, 1936, 1935, 1934, 1932, 1931, 1930, 1929, 1928, 1927, 1926, 1925, 1924, 1923, 1922, 1921, 1920, 1919, 1918, 1917, 1916, 1915, 1914, 1913, 1912, 1911, 1910, 1908, 1906, 1905, 1904, 1900, 1898, 1893, 1892, 1890, 1885, 1882, 1880, 1875, 1872] Column: YearRemodAdd Unique dates: [2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970, 1969, 1968, 1967, 1966, 1965, 1964, 1963, 1962, 1961, 1960, 1959, 1958, 1957, 1956, 1955, 1954, 1953, 1952, 1951, 1950] Column: GarageYrBlt Unique dates: [2010.0, 2008.0, 2007.0, 2006.0, 2005.0, 2004.0, 2003.0, 2002.0, 2001.0, 2000.0, 1998.0, 1991.0, nan, 2009.0, 1999.0, 1997.0, 1996.0, 1995.0, 1994.0, 1993.0, 1992.0, 1990.0, 1989.0, 1988.0, 1987.0, 1986.0, 1985.0, 1984.0, 1983.0, 1982.0, 1981.0, 1980.0, 1979.0, 1978.0, 1977.0, 1976.0, 1975.0, 1974.0, 1973.0, 1972.0, 1971.0, 1970.0, 1969.0, 1968.0, 1967.0, 1966.0, 1965.0, 1964.0, 1963.0, 1962.0, 1961.0, 1960.0, 1959.0, 1958.0, 1957.0, 1956.0, 1955.0, 1954.0, 1953.0, 1952.0, 1951.0, 1950.0, 1949.0, 1948.0, 1947.0, 1946.0, 1945.0, 1942.0, 1941.0, 1940.0, 1939.0, 1938.0, 1937.0, 1936.0, 1935.0, 1934.0, 1933.0, 1932.0, 1931.0, 1930.0, 1929.0, 1928.0, 1927.0, 1926.0, 1925.0, 1924.0, 1923.0, 1922.0, 1921.0, 1920.0, 1918.0, 1916.0, 1915.0, 1914.0, 1910.0, 1908.0, 1906.0, 1900.0]
# Check for missing values in each date column
missing_dates = df_train_garage_unknown[date_columns].isnull().sum()
print("Missing Dates:")
print(missing_dates)
print()
Missing Dates: YearBuilt 0 YearRemodAdd 0 GarageYrBlt 0 dtype: int64
# Check for valid dates in each date column
for column in date_columns:
invalid_dates = df_train[column].apply(lambda x: isinstance(x, datetime)).sum()
print(f"Column: {column}")
print(f"Invalid dates: {invalid_dates}")
print()
Checking the data type of the date columns.
df_train[date_columns].dtypes
YearBuilt int32 YearRemodAdd int32 GarageYrBlt object dtype: object
We have identified a problem where the df_train data frame is the only one that contains missing values in the GarageYrBLT column, while the rest do not. This prevents us from converting it to an integer data type
df_train_garage_impute['GarageYrBlt'] = df_train_garage_impute['GarageYrBlt'].astype(int)
df_train_garage_unknown['GarageYrBlt'] = df_train_garage_unknown['GarageYrBlt'].astype(int)
df_train_outliers_1['GarageYrBlt'] = df_train_outliers_1['GarageYrBlt'].astype(int)
df_train_outliers_2['GarageYrBlt'] = df_train_outliers_2['GarageYrBlt'].astype(int)
After data cleaning, we have created five different data frames for various purposes:
df_train: This is the main data frame that has undergone basic data cleaning steps such as handling missing values, removing outliers, and feature engineering. It will be used to build and train machine learning models.
df_train_outliers_1: This data frame is derived from df_train and specifically focuses on handling outliers using the Isolation Forest algorithm. The purpose of this data frame is to evaluate the impact of outlier removal on model performance.
df_train_outliers_2: This data frame is also derived from df_train and utilizes Winsorization to handle outliers. Its purpose is to compare the performance of models trained on data with Winsorized outliers against models trained on other data frames.
df_train_garage_impute: This data frame is created to handle missing values in the "GarageYrBlt" column by imputing the missing values based on the "GarageQual" column. The goal is to evaluate the impact of different strategies for handling missing values on model performance.
df_train_missing_unknown: This data frame is a modified version of df_train where missing values in the "GarageYrBlt" column are replaced with the value "Unknown" to indicate no available information. It serves the purpose of examining the impact of representing missing values differently on model performance.
Each of these data frames serves a specific purpose in evaluating different data cleaning techniques and their impact on machine learning model building. They will be used to train and evaluate various models to identify the best approach for our specific problem.
Encode categorical features into numerical representations that machine learning algorithms can process. This can be done using techniques like one-hot encoding or label encoding
Now we will deal with the process of converting categorical variables into a numerical format that machine learning algorithms can understand.
This is important because most algorithms only work with numerical data.
We have different techniques for this process, such as creating binary columns for each category (one-hot encoding), assigning unique numbers to each category (label encoding), or encoding categories based on their order or significance (ordinal encoding).
print(f"The categorical columns are: {categorical_columns}")
The categorical columns are: ['OverallQual', 'BsmtQual', 'KitchenQual']
to make sure that all the data frames have the same categorical columns:
def get_categorical_columns(dataframes):
categorical_columns_list = []
for data in dataframes:
categorical_columns = data.select_dtypes(include=[object, 'category']).columns
categorical_columns_list.append(categorical_columns)
return categorical_columns_list
dataframes = [df_train, df_train_garage_impute, df_train_garage_unknown, df_train_outliers_1, df_train_outliers_2]
categorical_columns_list = get_categorical_columns(dataframes)
print(f"The categorical columns in all the data frames are:")
categorical_columns_list
The categorical columns in all the data frames are:
[Index(['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond',
'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC',
'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu',
'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond',
'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType',
'SaleCondition'],
dtype='object'),
Index(['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond',
'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC',
'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu',
'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond',
'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType',
'SaleCondition'],
dtype='object'),
Index(['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond',
'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC',
'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu',
'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond',
'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType',
'SaleCondition'],
dtype='object'),
Index(['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond',
'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC',
'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu',
'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond',
'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType',
'SaleCondition'],
dtype='object'),
Index(['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond',
'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC',
'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu',
'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond',
'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType',
'SaleCondition'],
dtype='object')]
df_train[categorical_columns[2]]
0 Gd
1 TA
2 Gd
3 Gd
4 Gd
..
1455 TA
1456 TA
1457 Gd
1458 Gd
1459 TA
Name: KitchenQual, Length: 1452, dtype: category
Categories (4, object): ['Ex', 'Fa', 'Gd', 'TA']
There are several encoding techniques available for handling categorical variables. Here are some commonly used encoding techniques:
One-Hot Encoding: This technique converts each category in a categorical variable into a new binary column, indicating the presence or absence of that category.
Label Encoding: Label encoding assigns a unique numerical label to each category in the variable. It is suitable for ordinal variables where the order of the categories is meaningful.
Ordinal Encoding: Similar to label encoding, ordinal encoding assigns numerical labels to each category, but the labels are assigned based on the order of the categories. It is suitable for ordinal variables.
Binary Encoding: Binary encoding converts each category into binary code, representing the category's position. It reduces the number of columns compared to one-hot encoding.
Count Encoding: Count encoding replaces each category with the count of occurrences of that category in the dataset. It can be useful when the frequency of categories is informative.
Target Encoding: Target encoding replaces each category with the mean target value of that category. It is useful for classification tasks, but it can lead to overfitting if not applied carefully.
Frequency Encoding: Frequency encoding replaces each category with the frequency of that category in the dataset. It can be useful when the frequency of categories is informative.
Hashing Encoding: Hashing encoding uses a hash function to convert categories into a fixed number of bins. It can be useful when dealing with high-cardinality categorical variables.
from sklearn.preprocessing import LabelEncoder
def label_encode_dataframes(dataframes, column_name):
# Create an instance of LabelEncoder
label_encoder = LabelEncoder()
for df in dataframes:
# Fit and transform the specified column with LabelEncoder
df[column_name] = label_encoder.fit_transform(df[column_name])
return dataframes
# Define the column to be label encoded
column_name = 'OverallQual'
# Apply label encoding to the original dataframes
encoded_dataframes = label_encode_dataframes(dataframes, column_name)
from sklearn.preprocessing import OrdinalEncoder
# Define the order of the categories
categories = [['No Basement', 'Po', 'Fa', 'TA', 'Gd', 'Ex']]
# Create an instance of the OrdinalEncoder
ordinal_encoder = OrdinalEncoder(categories=categories)
# Perform ordinal encoding on the 'BsmtQual' column
for df in encoded_dataframes:
df['BsmtQual'] = ordinal_encoder.fit_transform(df[['BsmtQual']])
df_train["BsmtQual"]
0 4.0
1 4.0
2 4.0
3 3.0
4 4.0
...
1455 4.0
1456 4.0
1457 3.0
1458 3.0
1459 3.0
Name: BsmtQual, Length: 1452, dtype: float64
df_train_garage_impute["BsmtQual"].value_counts()
3.0 648 4.0 612 5.0 120 0.0 37 2.0 35 Name: BsmtQual, dtype: int64
from sklearn.preprocessing import OrdinalEncoder
# Define the order of the categories for KitchenQual
kitchen_qual_categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]
# Create an instance of the OrdinalEncoder for KitchenQual
kitchen_qual_encoder = OrdinalEncoder(categories=kitchen_qual_categories)
# Apply ordinal encoding to the 'KitchenQual' column in each data frame
for df in encoded_dataframes:
df['KitchenQual'] = kitchen_qual_encoder.fit_transform(df[['KitchenQual']])
# Define the list of data frame names
dataframe_names = ['df_train', 'df_train_garage_impute', 'df_train_garage_unknown', 'df_train_outliers_1', 'df_train_outliers_2']
# Iterate through each data frame and evaluate the encoded features
for df, df_name in zip(encoded_dataframes, dataframe_names):
print(f"Data Frame: {df_name}")
print("------------------------"*3)
# Print the unique values and their counts for each encoded feature
for column in categorical_columns:
print(f"Encoded Feature: {column}")
print(df[column].value_counts())
print()
print("------------------------")
Data Frame: df_train ------------------------------------------------------------------------ Encoded Feature: OverallQual 4 397 5 372 6 315 7 167 3 116 8 43 2 20 9 17 1 3 0 2 Name: OverallQual, dtype: int64 Encoded Feature: BsmtQual 3.0 648 4.0 612 5.0 120 0.0 37 2.0 35 Name: BsmtQual, dtype: int64 Encoded Feature: KitchenQual 2.0 734 3.0 580 4.0 99 1.0 39 Name: KitchenQual, dtype: int64 ------------------------ Data Frame: df_train_garage_impute ------------------------------------------------------------------------ Encoded Feature: OverallQual 4 397 5 372 6 315 7 167 3 116 8 43 2 20 9 17 1 3 0 2 Name: OverallQual, dtype: int64 Encoded Feature: BsmtQual 3.0 648 4.0 612 5.0 120 0.0 37 2.0 35 Name: BsmtQual, dtype: int64 Encoded Feature: KitchenQual 2.0 734 3.0 580 4.0 99 1.0 39 Name: KitchenQual, dtype: int64 ------------------------ Data Frame: df_train_garage_unknown ------------------------------------------------------------------------ Encoded Feature: OverallQual 4 397 5 372 6 315 7 167 3 116 8 43 2 20 9 17 1 3 0 2 Name: OverallQual, dtype: int64 Encoded Feature: BsmtQual 3.0 648 4.0 612 5.0 120 0.0 37 2.0 35 Name: BsmtQual, dtype: int64 Encoded Feature: KitchenQual 2.0 734 3.0 580 4.0 99 1.0 39 Name: KitchenQual, dtype: int64 ------------------------ Data Frame: df_train_outliers_1 ------------------------------------------------------------------------ Encoded Feature: OverallQual 2 364 3 351 4 288 5 123 1 95 6 17 0 16 7 2 Name: OverallQual, dtype: int64 Encoded Feature: BsmtQual 3.0 579 4.0 554 5.0 65 2.0 34 0.0 24 Name: BsmtQual, dtype: int64 Encoded Feature: KitchenQual 2.0 663 3.0 517 4.0 50 1.0 26 Name: KitchenQual, dtype: int64 ------------------------ Data Frame: df_train_outliers_2 ------------------------------------------------------------------------ Encoded Feature: OverallQual 4 397 5 372 6 315 7 167 3 116 8 43 2 20 9 17 1 3 0 2 Name: OverallQual, dtype: int64 Encoded Feature: BsmtQual 3.0 648 4.0 612 5.0 120 0.0 37 2.0 35 Name: BsmtQual, dtype: int64 Encoded Feature: KitchenQual 2.0 734 3.0 580 4.0 99 1.0 39 Name: KitchenQual, dtype: int64 ------------------------
Scale numerical features to a similar range to avoid biasing the model. Common techniques include standardization (mean normalization) or normalization (min-max scaling).
numerical_columns
len(numerical_columns)
10
df_train_garage_impute.select_dtypes(include=['int64', 'float64', 'int32']).columns
Index(['Id', 'LotFrontage', 'LotArea', 'OverallQual', 'YearBuilt',
'YearRemodAdd', 'MasVnrArea', 'BsmtQual', 'BsmtFinSF1', 'BsmtFinSF2',
'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
'Fireplaces', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
'MoSold', 'YrSold', 'SalePrice'],
dtype='object')
Standardization (mean normalization): Scale the features to have zero mean and unit variance. This technique is suitable when the distribution of the features is approximately Gaussian or when there are outliers in the data.
Normalization (min-max scaling): Scale the features to a specific range, typically between 0 and 1. This technique is suitable when the distribution of the features is not necessarily Gaussian and the range of the features is important for the model.
from sklearn.preprocessing import MinMaxScaler
# Create an instance of MinMaxScaler
scaler = MinMaxScaler()
# Loop through each dataframe and apply normalization to numerical columns
for df in encoded_dataframes:
# Apply normalization to numerical columns
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])
Create new features from existing ones that may enhance the predictive power of the model. This can involve techniques like creating interaction terms, polynomial features, or deriving new features based on domain knowledge.
When you already have columns that are correlated with the target variable, feature engineering can further enhance the predictive power of the model.
Here are some ideas for feature engineering in such scenarios:
Interaction Features: Create interaction features by performing mathematical operations (such as multiplication, division, or addition) on correlated columns. This can capture complex relationships between variables and provide additional information to the model.
Polynomial Features: Generate polynomial features by raising correlated columns to different powers. This can help capture non-linear relationships between variables.
Logarithmic Transformations: Apply logarithmic transformations to correlated columns to compress the range of values and make the distribution more symmetrical. This can help handle skewness in the data and improve model performance.
Binning and Bucketing: Group values of correlated columns into bins or buckets based on specific ranges. This can help capture non-linear relationships and reduce the impact of outliers.
Time-based Features: If you have time-related data, extract features such as day of the week, month, or year from date columns. These features can capture seasonal patterns or trends that might be relevant for the model.
Domain-specific Features: Incorporate domain knowledge to create new features. For example, if you have sales data, you could calculate metrics like sales growth rate or average sales per customer.
Based on the given dataset df_train, here are some feature engineering techniques that could make sense:
Interaction Features: Create interaction features by multiplying or dividing correlated numerical columns. For example, you can create a new feature by multiplying OverallQual with GrLivArea, indicating the overall quality of the house in relation to its living area.
Polynomial Features: Generate polynomial features by raising correlated numerical columns to different powers. For example, you can create squared or cubed features of columns like TotalBsmtSF to capture potential non-linear relationships.
Logarithmic Transformations: Apply logarithmic transformations to skewed numerical columns. For instance, you can take the logarithm of the SalePrice column to compress its range and make it more symmetrical.
Binning and Bucketing: Group values of numerical columns into bins or buckets based on specific ranges. This can be useful for variables like YearBuilt, where you can create bins for different decades or time periods.
Time-based Features: If there are date-related columns like YearBuilt or YearRemodAdd, extract features such as the age of the house or the time since the last remodel. These features can capture the impact of time on the target variable.
Feature Interactions: Create interaction features between categorical variables. For example, you can combine Neighborhood and Condition1 columns to create a new feature representing the combined neighborhood and condition of the property.
The provided code allows you to create scatter plots between two different columns in the df_train dataset. This feature is useful for exploring interaction effects between variables, which can be important in capturing complex relationships and improving the predictive power of a machine learning model.
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact, Dropdown
# Create a scatter plot function
def scatter_plot(column1, column2):
plt.figure(figsize=(8, 6))
sns.scatterplot(x=column1, y=column2, data=df_train)
plt.xlabel(column1)
plt.ylabel(column2)
plt.title("Scatter Plot: {} vs {}".format(column1, column2))
plt.show()
# Get the list of column names from df_train
column_names = df_train_outliers_1.columns.tolist()
# Create dropdown widgets for column selection
dropdown_column1 = Dropdown(options=column_names, description="Column 1:")
dropdown_column2 = Dropdown(options=column_names, description="Column 2:")
# Define the interaction function
@interact(column1=dropdown_column1, column2=dropdown_column2)
def plot_scatter(column1, column2):
scatter_plot(column1, column2)
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
from sklearn.preprocessing import OrdinalEncoder
# Define the bin ranges
bins = [1870, 1900, 1920, 1940, 1960, 1980, 2000, 2020]
# Create the corresponding labels for each bin
labels = ['1870s-1890s', '1900s-1910s', '1920s-1930s', '1940s-1950s', '1960s-1970s', '1980s-1990s', '2000s-2020s']
# Define the order of the categories
category_order = ['1870s-1890s', '1900s-1910s', '1920s-1930s', '1940s-1950s', '1960s-1970s', '1980s-1990s', '2000s-2020s']
# List of dataframes
dataframes = [df_train, df_train_garage_impute, df_train_garage_unknown, df_train_outliers_1, df_train_outliers_2]
# Iterate over each dataframe
for df in dataframes:
# Apply binning to the YearBuilt column
df['YearBuilt_Binned'] = pd.cut(df_train_garage_impute['YearBuilt'], bins=bins, labels=labels, right=False)
# Set the categories for the categorical column
df['YearBuilt_Binned'] = df['YearBuilt_Binned'].cat.set_categories(category_order)
# Fill missing values with a specific category
df['YearBuilt_Binned'] = df['YearBuilt_Binned'].fillna('Unknown')
# Create an instance of OrdinalEncoder with the specified category order
encoder = OrdinalEncoder(categories=[category_order])
# Fit-transform the column to perform ordinal encoding
df['YearBuilt_Binned_encoded'] = encoder.fit_transform(df['YearBuilt_Binned'].values.reshape(-1, 1))
# Print the modified dataframe
print(df)
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \
0 1 60 RL 65.0 8450 Pave NaN Reg
1 2 20 RL 80.0 9600 Pave NaN Reg
2 3 60 RL 68.0 11250 Pave NaN IR1
3 4 70 RL 60.0 9550 Pave NaN IR1
4 5 60 RL 84.0 14260 Pave NaN IR1
... ... ... ... ... ... ... ... ...
1455 1456 60 RL 62.0 7917 Pave NaN Reg
1456 1457 20 RL 85.0 13175 Pave NaN Reg
1457 1458 70 RL 66.0 9042 Pave NaN Reg
1458 1459 20 RL 68.0 9717 Pave NaN Reg
1459 1460 20 RL 75.0 9937 Pave NaN Reg
LandContour Utilities ... Fence MiscFeature MiscVal MoSold YrSold \
0 Lvl AllPub ... NaN NaN 0 2 2008
1 Lvl AllPub ... NaN NaN 0 5 2007
2 Lvl AllPub ... NaN NaN 0 9 2008
3 Lvl AllPub ... NaN NaN 0 2 2006
4 Lvl AllPub ... NaN NaN 0 12 2008
... ... ... ... ... ... ... ... ...
1455 Lvl AllPub ... NaN NaN 0 8 2007
1456 Lvl AllPub ... MnPrv NaN 0 2 2010
1457 Lvl AllPub ... GdPrv Shed 2500 5 2010
1458 Lvl AllPub ... NaN NaN 0 4 2010
1459 Lvl AllPub ... NaN NaN 0 6 2008
SaleType SaleCondition SalePrice YearBuilt_Binned \
0 WD Normal 0.241078 2000s-2020s
1 WD Normal 0.203583 1960s-1970s
2 WD Normal 0.261908 2000s-2020s
3 WD Abnorml 0.145952 1900s-1910s
4 WD Normal 0.298709 2000s-2020s
... ... ... ... ...
1455 WD Normal 0.194556 1980s-1990s
1456 WD Normal 0.243161 1960s-1970s
1457 WD Normal 0.321622 1940s-1950s
1458 WD Normal 0.148903 1940s-1950s
1459 WD Normal 0.156367 1960s-1970s
YearBuilt_Binned_encoded
0 6.0
1 4.0
2 6.0
3 1.0
4 6.0
... ...
1455 5.0
1456 4.0
1457 3.0
1458 3.0
1459 4.0
[1452 rows x 83 columns]
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \
0 1 60 RL 65.0 8450 Pave NaN Reg
1 2 20 RL 80.0 9600 Pave NaN Reg
2 3 60 RL 68.0 11250 Pave NaN IR1
3 4 70 RL 60.0 9550 Pave NaN IR1
4 5 60 RL 84.0 14260 Pave NaN IR1
... ... ... ... ... ... ... ... ...
1455 1456 60 RL 62.0 7917 Pave NaN Reg
1456 1457 20 RL 85.0 13175 Pave NaN Reg
1457 1458 70 RL 66.0 9042 Pave NaN Reg
1458 1459 20 RL 68.0 9717 Pave NaN Reg
1459 1460 20 RL 75.0 9937 Pave NaN Reg
LandContour Utilities ... Fence MiscFeature MiscVal MoSold YrSold \
0 Lvl AllPub ... NaN NaN 0 2 2008
1 Lvl AllPub ... NaN NaN 0 5 2007
2 Lvl AllPub ... NaN NaN 0 9 2008
3 Lvl AllPub ... NaN NaN 0 2 2006
4 Lvl AllPub ... NaN NaN 0 12 2008
... ... ... ... ... ... ... ... ...
1455 Lvl AllPub ... NaN NaN 0 8 2007
1456 Lvl AllPub ... MnPrv NaN 0 2 2010
1457 Lvl AllPub ... GdPrv Shed 2500 5 2010
1458 Lvl AllPub ... NaN NaN 0 4 2010
1459 Lvl AllPub ... NaN NaN 0 6 2008
SaleType SaleCondition SalePrice YearBuilt_Binned \
0 WD Normal 0.241078 2000s-2020s
1 WD Normal 0.203583 1960s-1970s
2 WD Normal 0.261908 2000s-2020s
3 WD Abnorml 0.145952 1900s-1910s
4 WD Normal 0.298709 2000s-2020s
... ... ... ... ...
1455 WD Normal 0.194556 1980s-1990s
1456 WD Normal 0.243161 1960s-1970s
1457 WD Normal 0.321622 1940s-1950s
1458 WD Normal 0.148903 1940s-1950s
1459 WD Normal 0.156367 1960s-1970s
YearBuilt_Binned_encoded
0 6.0
1 4.0
2 6.0
3 1.0
4 6.0
... ...
1455 5.0
1456 4.0
1457 3.0
1458 3.0
1459 4.0
[1452 rows x 83 columns]
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \
0 1 60 RL 65.0 8450 Pave NaN Reg
1 2 20 RL 80.0 9600 Pave NaN Reg
2 3 60 RL 68.0 11250 Pave NaN IR1
3 4 70 RL 60.0 9550 Pave NaN IR1
4 5 60 RL 84.0 14260 Pave NaN IR1
... ... ... ... ... ... ... ... ...
1455 1456 60 RL 62.0 7917 Pave NaN Reg
1456 1457 20 RL 85.0 13175 Pave NaN Reg
1457 1458 70 RL 66.0 9042 Pave NaN Reg
1458 1459 20 RL 68.0 9717 Pave NaN Reg
1459 1460 20 RL 75.0 9937 Pave NaN Reg
LandContour Utilities ... Fence MiscFeature MiscVal MoSold YrSold \
0 Lvl AllPub ... NaN NaN 0 2 2008
1 Lvl AllPub ... NaN NaN 0 5 2007
2 Lvl AllPub ... NaN NaN 0 9 2008
3 Lvl AllPub ... NaN NaN 0 2 2006
4 Lvl AllPub ... NaN NaN 0 12 2008
... ... ... ... ... ... ... ... ...
1455 Lvl AllPub ... NaN NaN 0 8 2007
1456 Lvl AllPub ... MnPrv NaN 0 2 2010
1457 Lvl AllPub ... GdPrv Shed 2500 5 2010
1458 Lvl AllPub ... NaN NaN 0 4 2010
1459 Lvl AllPub ... NaN NaN 0 6 2008
SaleType SaleCondition SalePrice YearBuilt_Binned \
0 WD Normal 0.241078 2000s-2020s
1 WD Normal 0.203583 1960s-1970s
2 WD Normal 0.261908 2000s-2020s
3 WD Abnorml 0.145952 1900s-1910s
4 WD Normal 0.298709 2000s-2020s
... ... ... ... ...
1455 WD Normal 0.194556 1980s-1990s
1456 WD Normal 0.243161 1960s-1970s
1457 WD Normal 0.321622 1940s-1950s
1458 WD Normal 0.148903 1940s-1950s
1459 WD Normal 0.156367 1960s-1970s
YearBuilt_Binned_encoded
0 6.0
1 4.0
2 6.0
3 1.0
4 6.0
... ...
1455 5.0
1456 4.0
1457 3.0
1458 3.0
1459 4.0
[1452 rows x 83 columns]
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \
0 1 60 RL 65.0 8450 Pave NaN Reg
1 2 20 RL 80.0 9600 Pave NaN Reg
2 3 60 RL 68.0 11250 Pave NaN IR1
3 4 70 RL 60.0 9550 Pave NaN IR1
4 5 60 RL 84.0 14260 Pave NaN IR1
... ... ... ... ... ... ... ... ...
1454 1455 20 FV 62.0 7500 Pave Pave Reg
1455 1456 60 RL 62.0 7917 Pave NaN Reg
1456 1457 20 RL 85.0 13175 Pave NaN Reg
1458 1459 20 RL 68.0 9717 Pave NaN Reg
1459 1460 20 RL 75.0 9937 Pave NaN Reg
LandContour Utilities ... Fence MiscFeature MiscVal MoSold YrSold \
0 Lvl AllPub ... NaN NaN 0 2 2008
1 Lvl AllPub ... NaN NaN 0 5 2007
2 Lvl AllPub ... NaN NaN 0 9 2008
3 Lvl AllPub ... NaN NaN 0 2 2006
4 Lvl AllPub ... NaN NaN 0 12 2008
... ... ... ... ... ... ... ... ...
1454 Lvl AllPub ... NaN NaN 0 10 2009
1455 Lvl AllPub ... NaN NaN 0 8 2007
1456 Lvl AllPub ... MnPrv NaN 0 2 2010
1458 Lvl AllPub ... NaN NaN 0 4 2010
1459 Lvl AllPub ... NaN NaN 0 6 2008
SaleType SaleCondition SalePrice YearBuilt_Binned \
0 WD Normal 0.478014 2000s-2020s
1 WD Normal 0.401418 1960s-1970s
2 WD Normal 0.520567 2000s-2020s
3 WD Abnorml 0.283688 1900s-1910s
4 WD Normal 0.595745 2000s-2020s
... ... ... ... ...
1454 WD Normal 0.411348 2000s-2020s
1455 WD Normal 0.382979 1980s-1990s
1456 WD Normal 0.482270 1960s-1970s
1458 WD Normal 0.289716 1940s-1950s
1459 WD Normal 0.304965 1960s-1970s
YearBuilt_Binned_encoded
0 6.0
1 4.0
2 6.0
3 1.0
4 6.0
... ...
1454 6.0
1455 5.0
1456 4.0
1458 3.0
1459 4.0
[1256 rows x 83 columns]
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \
0 1 60 RL 65.0 8450 Pave NaN Reg
1 2 20 RL 80.0 9600 Pave NaN Reg
2 3 60 RL 68.0 11250 Pave NaN IR1
3 4 70 RL 60.0 9550 Pave NaN IR1
4 5 60 RL 84.0 14260 Pave NaN IR1
... ... ... ... ... ... ... ... ...
1455 1456 60 RL 62.0 7917 Pave NaN Reg
1456 1457 20 RL 85.0 13175 Pave NaN Reg
1457 1458 70 RL 66.0 9042 Pave NaN Reg
1458 1459 20 RL 68.0 9717 Pave NaN Reg
1459 1460 20 RL 75.0 9937 Pave NaN Reg
LandContour Utilities ... Fence MiscFeature MiscVal MoSold YrSold \
0 Lvl AllPub ... NaN NaN 0 2 2008
1 Lvl AllPub ... NaN NaN 0 5 2007
2 Lvl AllPub ... NaN NaN 0 9 2008
3 Lvl AllPub ... NaN NaN 0 2 2006
4 Lvl AllPub ... NaN NaN 0 12 2008
... ... ... ... ... ... ... ... ...
1455 Lvl AllPub ... NaN NaN 0 8 2007
1456 Lvl AllPub ... MnPrv NaN 0 2 2010
1457 Lvl AllPub ... GdPrv Shed 2500 5 2010
1458 Lvl AllPub ... NaN NaN 0 4 2010
1459 Lvl AllPub ... NaN NaN 0 6 2008
SaleType SaleCondition SalePrice YearBuilt_Binned \
0 WD Normal 0.506303 2000s-2020s
1 WD Normal 0.392857 1960s-1970s
2 WD Normal 0.569328 2000s-2020s
3 WD Abnorml 0.218487 1900s-1910s
4 WD Normal 0.680672 2000s-2020s
... ... ... ... ...
1455 WD Normal 0.365546 1980s-1990s
1456 WD Normal 0.512605 1960s-1970s
1457 WD Normal 0.750000 1940s-1950s
1458 WD Normal 0.227416 1940s-1950s
1459 WD Normal 0.250000 1960s-1970s
YearBuilt_Binned_encoded
0 6.0
1 4.0
2 6.0
3 1.0
4 6.0
... ...
1455 5.0
1456 4.0
1457 3.0
1458 3.0
1459 4.0
[1452 rows x 83 columns]
check the data frames again:
i = 0
for df in dataframes:
print(f'DataFrame Name: {dataframe_names[i]}')
print(df['YearBuilt_Binned_encoded'].value_counts())
print("_"*50)
i +=1
DataFrame Name: df_train 6.0 381 4.0 363 3.0 245 5.0 224 2.0 151 1.0 73 0.0 15 Name: YearBuilt_Binned_encoded, dtype: int64 __________________________________________________ DataFrame Name: df_train_garage_impute 6.0 381 4.0 363 3.0 245 5.0 224 2.0 151 1.0 73 0.0 15 Name: YearBuilt_Binned_encoded, dtype: int64 __________________________________________________ DataFrame Name: df_train_garage_unknown 6.0 381 4.0 363 3.0 245 5.0 224 2.0 151 1.0 73 0.0 15 Name: YearBuilt_Binned_encoded, dtype: int64 __________________________________________________ DataFrame Name: df_train_outliers_1 6.0 320 4.0 320 3.0 222 5.0 195 2.0 130 1.0 60 0.0 9 Name: YearBuilt_Binned_encoded, dtype: int64 __________________________________________________ DataFrame Name: df_train_outliers_2 6.0 381 4.0 363 3.0 245 5.0 224 2.0 151 1.0 73 0.0 15 Name: YearBuilt_Binned_encoded, dtype: int64 __________________________________________________
Perform any additional necessary data transformations or normalization required by the specific machine learning algorithms you plan to use.
Here are some common techniques for data transformation and normalization:
Scaling/Normalization: Scale the numerical features to a specific range, typically between 0 and 1, using techniques like Min-Max scaling or Standardization (mean normalization). This ensures that all features have a similar scale and prevents certain features from dominating others.
Log Transform: If the data is skewed or contains outliers, applying a log transformation can help normalize the distribution and reduce the impact of extreme values.
One-Hot Encoding: For categorical features, one-hot encoding can be used to convert them into binary vectors. Each category becomes a separate binary feature, indicating the presence or absence of that category.
Feature Engineering: Create new features from existing ones that capture meaningful information. This can involve combining, transforming, or interacting existing features to provide additional insights to the model.
Handling Missing Values: Deal with missing values by either imputing them with appropriate values (e.g., mean, median, or mode) or using advanced imputation techniques such as regression-based imputation or k-nearest neighbors imputation.
Feature Selection: Select the most relevant features to improve the model's performance and reduce overfitting. This can be done using techniques like correlation analysis, feature importance from ensemble methods, or using domain knowledge.
Split the dataset into training and testing subsets. The training set is used to train the machine learning model, while the testing set is used to evaluate its performance.
The competition on Kaggle provides two datasets: the Test dataset and the Train dataset. However, in order to make the problem more challenging, we have decided to further split the Train dataset into a Train dataset and a Test dataset. It is important to note that this deviates from common practices, as the reason for this splitting is not provided by the competition organizers. Nonetheless, we have chosen to undertake this approach to add complexity to the problem.
from sklearn.model_selection import train_test_split
# Define the list of data frames
dataframes = [df_train, df_train_garage_impute, df_train_garage_unknown, df_train_outliers_1, df_train_outliers_2]
# Define the names of the data frames
dataframes_names = ['df_train', 'df_train_garage_impute', 'df_train_garage_unknown', 'df_train_outliers_1', 'df_train_outliers_2']
# Perform data splitting on each data frame
for df, df_name in zip(dataframes, dataframes_names):
# Separate the features (X) and the target variable (y)
X = df.drop(['SalePrice', 'YearBuilt_Binned'], axis=1)
y = df['SalePrice']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Assign custom variable names based on the data frame name
train_name = f"X_train_{df_name}"
test_name = f"X_test_{df_name}"
target_train_name = f"y_train_{df_name}"
target_test_name = f"y_test_{df_name}"
# Assign the data splits to the corresponding variables
globals()[train_name] = X_train
globals()[test_name] = X_test
globals()[target_train_name] = y_train
globals()[target_test_name] = y_test
# Print the shapes of the resulting datasets
print(f"Training set shape ({train_name}):", X_train.shape, y_train.shape)
print(f"Testing set shape ({test_name}):", X_test.shape, y_test.shape)
print("-----------------------------------")
Training set shape (X_train_df_train): (1161, 81) (1161,) Testing set shape (X_test_df_train): (291, 81) (291,) ----------------------------------- Training set shape (X_train_df_train_garage_impute): (1161, 81) (1161,) Testing set shape (X_test_df_train_garage_impute): (291, 81) (291,) ----------------------------------- Training set shape (X_train_df_train_garage_unknown): (1161, 81) (1161,) Testing set shape (X_test_df_train_garage_unknown): (291, 81) (291,) ----------------------------------- Training set shape (X_train_df_train_outliers_1): (1004, 81) (1004,) Testing set shape (X_test_df_train_outliers_1): (252, 81) (252,) ----------------------------------- Training set shape (X_train_df_train_outliers_2): (1161, 81) (1161,) Testing set shape (X_test_df_train_outliers_2): (291, 81) (291,) -----------------------------------
X_test_df_train
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | YearBuilt_Binned_encoded | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1042 | 1043 | 120 | RL | 34.0 | 5381 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2009 | WD | Normal | 6.0 |
| 1130 | 1131 | 50 | RL | 65.0 | 7804 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 12 | 2009 | WD | Normal | 2.0 |
| 1003 | 1004 | 90 | RL | NaN | 11500 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2007 | WD | Normal | 4.0 |
| 1324 | 1325 | 20 | RL | 75.0 | 9986 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2007 | New | Partial | 6.0 |
| 531 | 532 | 70 | RM | 60.0 | 6155 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal | 2.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 897 | 898 | 90 | RL | 64.0 | 7018 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2009 | WD | Alloca | 4.0 |
| 1334 | 1335 | 160 | RM | 24.0 | 2368 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2009 | WD | Normal | 4.0 |
| 1416 | 1417 | 190 | RM | 60.0 | 11340 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal | 0.0 |
| 718 | 719 | 60 | RL | 96.0 | 10542 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2008 | WD | Normal | 5.0 |
| 255 | 256 | 60 | RL | 66.0 | 8738 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Normal | 5.0 |
291 rows × 81 columns
The development of this project's next section is planned for the future. Currently, the notebook has significantly expanded, containing over 200 cells. If I were to complete the upcoming sections – 4. Model Training and Evaluation, 5. Model Optimization, and 6. Final Prediction and Interpretation – it would potentially pose readability challenges for the community. However, if you're interested in exploring my approach to building machine learning models, conducting evaluations, and performing optimizations, I recommend checking out my other projects, such as the wine quality prediction project.
Feel free to reach out to me through the following platforms:
Connect with me on these platforms to stay updated with my latest projects and articles. I look forward to connecting with you!