How to create a data set for machine learning?

Steps to Constructing Your Dataset Collect the raw data. Identify feature and label sources. Select a sampling strategy. Split the data. Jul 18, 2022

What is the process of preparing data for use in a machine learning model called?

Data preprocessing : Prepare the data for use in the model by normalizing or scaling the data, and transforming it into a format that the model can understand.

How do I feed new data to ML model?

For example, if the model requires numerical input, categorical data such as text or images will need to be converted to a numerical format. Once the data is prepared, it can be fed to the machine learning model in several ways. One common method is batch learning , where the model is trained on a fixed set of data.

How to get data ready for AI?

How Does Data Preparation Work? Data Export. The source of the problem is well known, especially in marketing, where data from different providers is available from a wide variety of sources. ... Data Cleansing. ... Storage. ... Compatibility. ... Scope. ... Data Quality. ... Data Transformation. ... Model Training. More items...

How to process data for machine learning?

Data Preprocessing in Machine Learning: 7 Easy Steps To Follow Acquire the dataset. Import all the crucial libraries. Import the dataset. Identifying and handling the missing values. Encoding the categorical data. Handling Outliers in Data Preprocessing. More items... Feb 17, 2024

What is the format of data in machine learning?

Tabular data is the most common and familiar format for machine learning. It consists of rows and columns, where each row represents an observation or a sample, and each column represents a feature or a variable.

How much dataset is required for machine learning?

The rule-of-thumb rule is that you need at least ten times as many data points as there are features in your dataset . For example, if your dataset has 10 columns or features, you should have at least 100 rows. The rule-of-thumb approach ensures that enough high-quality input exists.

How to get data for machine learning?

By utilizing well-known sources, for example, Kaggle, UCI Machine Learning Repository, AWS, Google's Dataset Search, Microsoft Datasets, and government datasets , data researchers and specialists can get to an extensive variety of datasets for their machine learning projects.

What should you do after preparing a dataset and before training the machine learning model?

After cleaning the dataset, you'll want to select the features you'll use to train your model . Some features can negatively impact the performance of the model, so you'll want to identify and remove them.

How to process data in ML?

7 Data Preprocessing Steps in Machine Learning Acquire the Dataset. ... Import Libraries. ... Import Datasets. ... Check for Missing Values. ... Encode the Data. ... Scaling. ... Split Dataset Into Training, Evaluation and Validation Sets. Apr 30, 2024

How do I prepare data for the Machine Learning Model? Make it all numbers! (2024)

In this article, I will go deeper into the step “The data” from the Machine Learning Workflow. In the previous article, How do I work with data using a Machine Learning Model? I described three of the six steps.

All the data for the Machine Learning Model needs to be numerical. Preparing the data involves filling in missing values, and changing all non-numerical values into numbers, e.g.: text into categories or integers, string dates split into days, months, and years as integers, and boolean yes/no as 0 and 1.

How do I prepare data for the Machine Learning Model? Make it all numbers! (1)

In the previous articles, I was working with the perfect data where no preparation was needed. In the real world perfect data doesn’t exist, you will always have to work with the data. This step is named Exploratory Data Analysis (EDA) is the most important task to conduct at the beginning of every data science project.

Loading the data
Dealing with missing values
1. Identify missing values
2. Filling missing values
  1. Numeric values
  2. Non-numeric values
Convert non-numeric data into numeric
1. Text into numbers
2. Dates into numbers
3. Categories into numbers
The source code
See Also
Endless calls to xas_split_alloc() due to corrupted xarray entry Excel Integration: Bridging Excel and Python for Enhanced Financial Modeling - FasterCapital how to change a value of a cell that contains nan to another specific value?Python-Dask-扩展指南-早期发布--全- - 绝不原创的飞龙 - 博客园

We need to load the data for Exploratory Data Analysis (EDA). The data set used for this article is the “Apartment Prices in Poland” – https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-poland/data.

# Importing the toolsimport pandas as pd# Load the data into pandas DataFramedata_frame = pd.read_csv("apartments_rent_pl_2024_01.csv")# Let's see what data we havedata_frame.head()

How do I prepare data for the Machine Learning Model? Make it all numbers! (2)

As we can see even in the first five rows we have missing values, in the form of empty cells.

Dealing with missing values

First, we need to identify data types for each column in the loaded data set. We need numeric values, any data type other than object is good.

# Checking columns data types to know how to handle missing valuesdata_frame.dtypes

Below we have a list of column names and their data types, e.g. column id is of type object.

id objectcity objecttype objectsquareMeters float64rooms float64floor float64floorCount float64buildYear float64latitude float64longitude float64centreDistance float64poiCount float64schoolDistance float64clinicDistance float64postOfficeDistance float64kindergartenDistance float64restaurantDistance float64collegeDistance float64pharmacyDistance float64ownership objectbuildingMaterial objectcondition objecthasParkingSpace int64hasBalcony int64hasElevator int64hasSecurity int64hasStorageRoom int64price int64dtype: object

Identify missing values

Before we even start filling missing values we need to know in which columns and how many missing values we have.

# Checking data types vs NaN values - before, after filling missing datainfo_df = pd.DataFrame({ 'Data Type': data_frame.dtypes, 'Missing Values': data_frame.isna().sum()})print(info_df)

Below we have a list of columns with its Data Type and number of Missing Values for each column.

 Data Type Missing Valuesid object 0city object 0type object 2203squareMeters float64 0rooms float64 0floor float64 1030floorCount float64 171buildYear float64 2492latitude float64 0longitude float64 0centreDistance float64 0poiCount float64 0schoolDistance float64 2clinicDistance float64 5postOfficeDistance float64 5kindergartenDistance float64 7restaurantDistance float64 24collegeDistance float64 104pharmacyDistance float64 13ownership object 0buildingMaterial object 3459condition object 6223hasParkingSpace object 0hasBalcony object 0hasElevator object 454hasSecurity object 0hasStorageRoom object 0price int64 0

As we can see we have many missing values, e.g. column buildYear has 249 missing values.

When we try to create a Machine Learning Model based on the DataFrame …

X = data_frame.drop("price", axis=1)y = data_frame["price"]from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)np.random.seed(42)from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier()model.fit(X_train, y_train)

… we will get an exception.

ValueError Traceback (most recent call last)<ipython-input-15-345ee3a9038d> in <cell line: 17>() 15 model = RandomForestClassifier() 16 # 'fit()' - Build a forest of trees from the training set (X, y).---> 17 model.fit(X_train, y_train) 18 # 'predict()' - Predict class for X. 19 y_preds = model.predict(X_test)

/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in __array__(self, dtype) 1996 def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray: 1997 values = self._values-> 1998 arr = np.asarray(values, dtype=dtype) 1999 if ( 2000 astype_is_view(values.dtype, arr.dtype)ValueError: could not convert string to float: '1e1ec12d582075085f740f5c7bdf4091'

Filling missing values

Before we create a Machine Learning Model we need to fill in missing values even if they are numerics.

Numeric values

Filling missing numeric columns with mean() values isn’t the best idea, but as a starting point, it is good enough.

For this task, I’ve used two methods fillna() and mean() for specific columns, e.g.: column floor, data_frame["floor"], and used the parameter inplace=True to avoid reassigning value to the column.

# Dealing with missing values# Filling NaN valuesdata_frame["floor"].fillna(data_frame["floor"].mean(), inplace=True)data_frame["floorCount"].fillna(data_frame["floorCount"].mean(), inplace=True)data_frame["buildYear"].fillna(data_frame["buildYear"].mean(), inplace=True)# Without parameter inplate=True# data_frame["buildYear"] = data_frame["buildYear"].fillna(data_frame["buildYear"].mean())

Non-numeric values

When we deal with non-numeric values the worst thing we can do is to fill in missing values with the same value. What do I mean by that?

First, I check the unique values for a specific column.

# Checking non-numeric columns unique data to fill NaNprint(f"Condition: {data_frame['condition'].unique()}")

Condition: ['premium' 'low']

We don’t want all our apartments to be only premium or low. Filling missing values with a single value is a very bad idea.

That’s why I use the below code to find unique values for specific columns and then randomly apply this value to the column.

unique_conditions = data_frame["condition"].dropna().unique()data_frame["condition"] = data_frame["condition"].apply( lambda x: np.random.choice(unique_conditions) if pd.isna(x) else x)

# Convert non-numeric data into numeric# id column type 'str' into 'int'data_frame["id"] = data_frame["id"].apply( lambda x: int(x, 16) if isinstance(x, str) else x)# columns with 'str' yes/no into booldata_frame['hasParkingSpace'] = data_frame['hasParkingSpace'].map({'yes': 1, "no": 0})

Dates into numbers

Even dates are stored in text form, e.g.: 2024-06-10, we need to split each part, the year, the month, and the day into separate variables/columns. The column is in a different data set city_rentals_wro_2007_2023.csv from the same “Apartment Prices in Poland” – https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-poland/data.

# Convert non-numeric data into numeric# changing column 'date_listed' of type 'str' into separate numbersdata_frame['date_listed'] = pd.to_datetime(data_frame['date_listed'])# create new columns for year, month, and daydata_frame['year'] = data_frame['date_listed'].dt.yeardata_frame['month'] = data_frame['date_listed'].dt.monthdata_frame['day'] = data_frame['date_listed'].dt.day# drop the original 'date' column if you wishdata_frame = data_frame.drop('date_listed', axis=1)

IMG

The above table shows added columns after converting the column date_listed.

Categories into numbers

Text data can be changed into categories and then into numbers, the below code does it very well. I’m not going into many details, I use existing libraries and their classes the OneHotEncoder and the ColumnTransformer, all available in scikit-learn.

How did I figure out which column may be treated as a category? It’s related to the process described earlier Filling missing values – Non-numeric values, and it’s a part of the Exploratory Data Analysis (EDA).

How do I prepare data for the Machine Learning Model? Make it all numbers! (3)

from sklearn.preprocessing import OneHotEncoderfrom sklearn.compose import ColumnTransformer# Turn the categories into numberscategorical_features = ["city", "type", "ownership", "buildingMaterial", "condition"]one_hot = OneHotEncoder()transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")transformed_X = transformer.fit_transform(X)transformed_df = pd.DataFrame(transformed_X)

How do I prepare data for the Machine Learning Model? Make it all numbers! (4)

Before transforming the DataFrame we had 28 columns now we have 44 columns without human-readable column names, instead we have only numbers as column names.

ALL the data is numerical, we accomplished EDA and ended up with the DataFrame ready to be used in the Machine Learning Model.

Below we can find all the source code necessary for preparing the data for using it with a Machine Learning Model.

Steps covered:

Loading the data
Dealing with missing values
Identify missing values
Filling missing values
- Numeric values
- Non-numeric values
Convert non-numeric data into numeric
Text into numbers
Dates into numbers
Categories into numbers

# Importing the toolsimport pandas as pdimport numpy as npdata_frame = pd.read_csv(csv_file_name)# Dealing with missing values# Filling NaN valuesdata_frame["floor"].fillna(data_frame["floor"].mean(), inplace=True)data_frame["floorCount"].fillna(data_frame["floorCount"].mean(), inplace=True)data_frame["buildYear"].fillna(data_frame["buildYear"].mean(), inplace=True)data_frame["schoolDistance"].fillna(data_frame["schoolDistance"].mean(), inplace=True)data_frame["clinicDistance"].fillna(data_frame["clinicDistance"].mean(), inplace=True)data_frame["postOfficeDistance"].fillna(data_frame["postOfficeDistance"].mean(), inplace=True)data_frame["kindergartenDistance"].fillna(data_frame["kindergartenDistance"].mean(), inplace=True)data_frame["restaurantDistance"].fillna(data_frame["restaurantDistance"].mean(), inplace=True)data_frame["collegeDistance"].fillna(data_frame["collegeDistance"].mean(), inplace=True)data_frame["pharmacyDistance"].fillna(data_frame["pharmacyDistance"].mean(), inplace=True)unique_types = data_frame["type"].dropna().unique()data_frame["type"] = data_frame["type"].apply(lambda x: np.random.choice(unique_types) if pd.isna(x) else x)data_frame["ownership"].fillna("condominium", inplace=True)unique_bms = data_frame["buildingMaterial"].dropna().unique()data_frame["buildingMaterial"] = data_frame["buildingMaterial"].apply( lambda x: np.random.choice(unique_bms) if pd.isna(x) else x)unique_conditions = data_frame["condition"].dropna().unique()data_frame["condition"] = data_frame["condition"].apply( lambda x: np.random.choice(unique_conditions) if pd.isna(x) else x)unique_hes = data_frame["hasElevator"].dropna().unique()data_frame["hasElevator"] = data_frame["hasElevator"].apply( lambda x: np.random.choice(unique_hes) if pd.isna(x) else x)# Convert non-numeric data into numeric# id column type 'str' into 'int'data_frame["id"] = data_frame["id"].apply(lambda x: int(x, 16) if isinstance(x, str) else x)# columns with 'str' yes/no into booldata_frame['hasParkingSpace'] = data_frame['hasParkingSpace'].map({'yes': 1, "no": 0})data_frame['hasBalcony'] = data_frame['hasBalcony'].map({'yes': 1, "no": 0})data_frame['hasElevator'] = data_frame['hasElevator'].map({'yes': 1, "no": 0})data_frame['hasSecurity'] = data_frame['hasSecurity'].map({'yes': 1, "no": 0})data_frame['hasStorageRoom'] = data_frame['hasStorageRoom'].map({'yes': 1, "no": 0})# X - training input samples, featuresX = data_frame.drop("price", axis=1)from sklearn.preprocessing import OneHotEncoderfrom sklearn.compose import ColumnTransformer# Turn the categories into numberscategorical_features = ["city", "type", "ownership", "buildingMaterial", "condition"]one_hot = OneHotEncoder()transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")transformed_X = transformer.fit_transform(X)transformed_df = pd.DataFrame(transformed_X)transformed_df.to_csv("saved_transformed_df.csv")# y - training input labels, the desired result, the target valuey = data_frame["price"]

Below we can find all the source code necessary for creating a Machine Learning Model based on the prepared data.

# Import 'train_test_split()' function# "Split arrays or matrices into random train and test subsets."from sklearn.model_selection import train_test_split# Split the data into training and test setsX_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)# Setup random seed - to have the same results, me and younp.random.seed(42)# Import the LinearRegression estimator classfrom sklearn.linear_model import LinearRegression# Instantiate LinearRegression to create a Machine Learning Modelmodel = LinearRegression()# 'fit()' - Build a forest of trees from the training set (X, y).model.fit(X_train, y_train)# 'predict()' - Predict class for X.y_preds = model.predict(X_test)

NOTE: In this article, I’m just barely scratching the surface. This topic needs more reading and research on your own. I’m still at the beginning of my learning process of AI & ML!

Image generated with Midjourney, edited in GIMP. Screenshots made by the author.