In this article, I will go deeper into the step “The data” from the Machine Learning Workflow. In the previous article, How do I work with data using a Machine Learning Model? I described three of the six steps.
All the data for the Machine Learning Model needs to be numerical. Preparing the data involves filling in missing values, and changing all non-numerical values into numbers, e.g.: text into categories or integers, string dates split into days, months, and years as integers, and boolean yes/no as 0 and 1.
In the previous articles, I was working with the perfect data where no preparation was needed. In the real world perfect data doesn’t exist, you will always have to work with the data. This step is named Exploratory Data Analysis (EDA) is the most important task to conduct at the beginning of every data science project.
Loading the data
Dealing with missing values
Identify missing values
Filling missing values
Numeric values
Non-numeric values
Convert non-numeric data into numeric
Text into numbers
Dates into numbers
Categories into numbers
The source code
We need to load the data for Exploratory Data Analysis (EDA). The data set used for this article is the “Apartment Prices in Poland” – https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-poland/data.
# Importing the toolsimport pandas as pd# Load the data into pandas DataFramedata_frame = pd.read_csv("apartments_rent_pl_2024_01.csv")# Let's see what data we havedata_frame.head()
As we can see even in the first five rows we have missing values, in the form of empty cells.
Dealing with missing values
First, we need to identify data types for each column in the loaded data set. We need numeric values, any data type other than object is good.
# Checking columns data types to know how to handle missing valuesdata_frame.dtypes
Below we have a list of column names and their data types, e.g. column id
is of type object
.
id objectcity objecttype objectsquareMeters float64rooms float64floor float64floorCount float64buildYear float64latitude float64longitude float64centreDistance float64poiCount float64schoolDistance float64clinicDistance float64postOfficeDistance float64kindergartenDistance float64restaurantDistance float64collegeDistance float64pharmacyDistance float64ownership objectbuildingMaterial objectcondition objecthasParkingSpace int64hasBalcony int64hasElevator int64hasSecurity int64hasStorageRoom int64price int64dtype: object
Identify missing values
Before we even start filling missing values we need to know in which columns and how many missing values we have.
# Checking data types vs NaN values - before, after filling missing datainfo_df = pd.DataFrame({ 'Data Type': data_frame.dtypes, 'Missing Values': data_frame.isna().sum()})print(info_df)
Below we have a list of columns with its Data Type
and number of Missing Values
for each column.
Data Type Missing Valuesid object 0city object 0type object 2203squareMeters float64 0rooms float64 0floor float64 1030floorCount float64 171buildYear float64 2492latitude float64 0longitude float64 0centreDistance float64 0poiCount float64 0schoolDistance float64 2clinicDistance float64 5postOfficeDistance float64 5kindergartenDistance float64 7restaurantDistance float64 24collegeDistance float64 104pharmacyDistance float64 13ownership object 0buildingMaterial object 3459condition object 6223hasParkingSpace object 0hasBalcony object 0hasElevator object 454hasSecurity object 0hasStorageRoom object 0price int64 0
As we can see we have many missing values, e.g. column buildYear
has 249
missing values.
When we try to create a Machine Learning Model based on the DataFrame …
X = data_frame.drop("price", axis=1)y = data_frame["price"]from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)np.random.seed(42)from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier()model.fit(X_train, y_train)
… we will get an exception.
ValueError Traceback (most recent call last)<ipython-input-15-345ee3a9038d> in <cell line: 17>() 15 model = RandomForestClassifier() 16 # 'fit()' - Build a forest of trees from the training set (X, y).---> 17 model.fit(X_train, y_train) 18 # 'predict()' - Predict class for X. 19 y_preds = model.predict(X_test)
/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in __array__(self, dtype) 1996 def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray: 1997 values = self._values-> 1998 arr = np.asarray(values, dtype=dtype) 1999 if ( 2000 astype_is_view(values.dtype, arr.dtype)ValueError: could not convert string to float: '1e1ec12d582075085f740f5c7bdf4091'
Filling missing values
Before we create a Machine Learning Model we need to fill in missing values even if they are numerics.
Numeric values
Filling missing numeric columns with mean() values isn’t the best idea, but as a starting point, it is good enough.
For this task, I’ve used two methods fillna()
and mean()
for specific columns, e.g.: column floor
, data_frame["floor"]
, and used the parameter inplace=True
to avoid reassigning value to the column.
# Dealing with missing values# Filling NaN valuesdata_frame["floor"].fillna(data_frame["floor"].mean(), inplace=True)data_frame["floorCount"].fillna(data_frame["floorCount"].mean(), inplace=True)data_frame["buildYear"].fillna(data_frame["buildYear"].mean(), inplace=True)# Without parameter inplate=True# data_frame["buildYear"] = data_frame["buildYear"].fillna(data_frame["buildYear"].mean())
Non-numeric values
When we deal with non-numeric values the worst thing we can do is to fill in missing values with the same value. What do I mean by that?
First, I check the unique values for a specific column.
# Checking non-numeric columns unique data to fill NaNprint(f"Condition: {data_frame['condition'].unique()}")
Condition: ['premium' 'low']
We don’t want all our apartments to be only premium or low. Filling missing values with a single value is a very bad idea.
That’s why I use the below code to find unique values for specific columns and then randomly apply this value to the column.
unique_conditions = data_frame["condition"].dropna().unique()data_frame["condition"] = data_frame["condition"].apply( lambda x: np.random.choice(unique_conditions) if pd.isna(x) else x)
# Convert non-numeric data into numeric# id column type 'str' into 'int'data_frame["id"] = data_frame["id"].apply( lambda x: int(x, 16) if isinstance(x, str) else x)# columns with 'str' yes/no into booldata_frame['hasParkingSpace'] = data_frame['hasParkingSpace'].map({'yes': 1, "no": 0})
Dates into numbers
Even dates are stored in text form, e.g.: 2024-06-10, we need to split each part, the year, the month, and the day into separate variables/columns. The column is in a different data set city_rentals_wro_2007_2023.csv from the same “Apartment Prices in Poland” – https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-poland/data.
# Convert non-numeric data into numeric# changing column 'date_listed' of type 'str' into separate numbersdata_frame['date_listed'] = pd.to_datetime(data_frame['date_listed'])# create new columns for year, month, and daydata_frame['year'] = data_frame['date_listed'].dt.yeardata_frame['month'] = data_frame['date_listed'].dt.monthdata_frame['day'] = data_frame['date_listed'].dt.day# drop the original 'date' column if you wishdata_frame = data_frame.drop('date_listed', axis=1)
IMG
The above table shows added columns after converting the column date_listed
.
Categories into numbers
Text data can be changed into categories and then into numbers, the below code does it very well. I’m not going into many details, I use existing libraries and their classes the OneHotEncoder and the ColumnTransformer, all available in scikit-learn.
How did I figure out which column may be treated as a category? It’s related to the process described earlier Filling missing values – Non-numeric values, and it’s a part of the Exploratory Data Analysis (EDA).
from sklearn.preprocessing import OneHotEncoderfrom sklearn.compose import ColumnTransformer# Turn the categories into numberscategorical_features = ["city", "type", "ownership", "buildingMaterial", "condition"]one_hot = OneHotEncoder()transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")transformed_X = transformer.fit_transform(X)transformed_df = pd.DataFrame(transformed_X)
Before transforming the DataFrame we had 28 columns now we have 44 columns without human-readable column names, instead we have only numbers as column names.
ALL the data is numerical, we accomplished EDA and ended up with the DataFrame ready to be used in the Machine Learning Model.
Below we can find all the source code necessary for preparing the data for using it with a Machine Learning Model.
Steps covered:
Loading the data
Dealing with missing values
Identify missing values
Filling missing values
Numeric values
Non-numeric values
Convert non-numeric data into numeric
Text into numbers
Dates into numbers
Categories into numbers
# Importing the toolsimport pandas as pdimport numpy as npdata_frame = pd.read_csv(csv_file_name)# Dealing with missing values# Filling NaN valuesdata_frame["floor"].fillna(data_frame["floor"].mean(), inplace=True)data_frame["floorCount"].fillna(data_frame["floorCount"].mean(), inplace=True)data_frame["buildYear"].fillna(data_frame["buildYear"].mean(), inplace=True)data_frame["schoolDistance"].fillna(data_frame["schoolDistance"].mean(), inplace=True)data_frame["clinicDistance"].fillna(data_frame["clinicDistance"].mean(), inplace=True)data_frame["postOfficeDistance"].fillna(data_frame["postOfficeDistance"].mean(), inplace=True)data_frame["kindergartenDistance"].fillna(data_frame["kindergartenDistance"].mean(), inplace=True)data_frame["restaurantDistance"].fillna(data_frame["restaurantDistance"].mean(), inplace=True)data_frame["collegeDistance"].fillna(data_frame["collegeDistance"].mean(), inplace=True)data_frame["pharmacyDistance"].fillna(data_frame["pharmacyDistance"].mean(), inplace=True)unique_types = data_frame["type"].dropna().unique()data_frame["type"] = data_frame["type"].apply(lambda x: np.random.choice(unique_types) if pd.isna(x) else x)data_frame["ownership"].fillna("condominium", inplace=True)unique_bms = data_frame["buildingMaterial"].dropna().unique()data_frame["buildingMaterial"] = data_frame["buildingMaterial"].apply( lambda x: np.random.choice(unique_bms) if pd.isna(x) else x)unique_conditions = data_frame["condition"].dropna().unique()data_frame["condition"] = data_frame["condition"].apply( lambda x: np.random.choice(unique_conditions) if pd.isna(x) else x)unique_hes = data_frame["hasElevator"].dropna().unique()data_frame["hasElevator"] = data_frame["hasElevator"].apply( lambda x: np.random.choice(unique_hes) if pd.isna(x) else x)# Convert non-numeric data into numeric# id column type 'str' into 'int'data_frame["id"] = data_frame["id"].apply(lambda x: int(x, 16) if isinstance(x, str) else x)# columns with 'str' yes/no into booldata_frame['hasParkingSpace'] = data_frame['hasParkingSpace'].map({'yes': 1, "no": 0})data_frame['hasBalcony'] = data_frame['hasBalcony'].map({'yes': 1, "no": 0})data_frame['hasElevator'] = data_frame['hasElevator'].map({'yes': 1, "no": 0})data_frame['hasSecurity'] = data_frame['hasSecurity'].map({'yes': 1, "no": 0})data_frame['hasStorageRoom'] = data_frame['hasStorageRoom'].map({'yes': 1, "no": 0})# X - training input samples, featuresX = data_frame.drop("price", axis=1)from sklearn.preprocessing import OneHotEncoderfrom sklearn.compose import ColumnTransformer# Turn the categories into numberscategorical_features = ["city", "type", "ownership", "buildingMaterial", "condition"]one_hot = OneHotEncoder()transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")transformed_X = transformer.fit_transform(X)transformed_df = pd.DataFrame(transformed_X)transformed_df.to_csv("saved_transformed_df.csv")# y - training input labels, the desired result, the target valuey = data_frame["price"]
Below we can find all the source code necessary for creating a Machine Learning Model based on the prepared data.
# Import 'train_test_split()' function# "Split arrays or matrices into random train and test subsets."from sklearn.model_selection import train_test_split# Split the data into training and test setsX_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)# Setup random seed - to have the same results, me and younp.random.seed(42)# Import the LinearRegression estimator classfrom sklearn.linear_model import LinearRegression# Instantiate LinearRegression to create a Machine Learning Modelmodel = LinearRegression()# 'fit()' - Build a forest of trees from the training set (X, y).model.fit(X_train, y_train)# 'predict()' - Predict class for X.y_preds = model.predict(X_test)
NOTE: In this article, I’m just barely scratching the surface. This topic needs more reading and research on your own. I’m still at the beginning of my learning process of AI & ML!
Image generated with Midjourney, edited in GIMP. Screenshots made by the author.