The AI Process
Guide to the AI engineering process
20 min read
Table of contents
- What is AI
- The AI Process
- Define the problem
- Data Preparation
- How to Choose an AI Model
- Why Simple Models
- Multinomial Logistic Regression
- Understand AI Algorithms
- Feature Engineering Tools
- AutoML Tools
Cover photo: Jason Leung on Unsplash
AI is still considered a relatively new field, so there are really no guides or standards such as SWEBOK. In fact, AI/ML graduate textbooks do not provide a clear and consistent description of the AI software engineering process. Therefore, I thought it would be helpful to give a complete description of the AI Engineering Process or AI Process which is described in most AI/ML textbooks .
85% or more of AI projects fail .
34% of scientists and researchers admitted to questionable research practices .
In general, the results of current journal articles on AI (even peer-reviewed) are irreproducible.
What is AI
Artificial intelligence (AI) focuses on the design and implementation of intelligent systems that perceive, act, and learn in response to their environment.
In AI, an agent is something that acts . All computer programs can be considered to do something but computer agents are expected to do more complex tasks: operate autonomously, perceive their environment, persist over a prolonged period of time, adapt to change, and create and pursue goals. In fact, a rational agent is one that acts so as to achieve the best outcome or the best-expected outcome when there is uncertainty.
In a nutshell, AI is focused on the study and construction of agents that act rationally or do the right thing as defined by the objective given to the agent. In fact, the standard model is defined in terms of rational agents . However, there are limitations to this model such as the issue of limited rationality and the value alignment problem which leads to the concept of agents that are provably beneficial to humans, but the standard model is a good reference point for theoretical analysis .
Figure 1: An agent interacts with its environment through its sensors and actuators. Source: Gungor Basa Technology of Me
There is often confusion between the terms artificial intelligence and machine learning. An agent is learning if it improves its performance based on previous experience. When the agent is a computer, the learning process is called machine learning (ML) [6, p. 651].
Thus, machine learning is a subfield of AI. Some AI systems use machine learning methods and some do not .
In fact, AI engineering is the discipline focused on developing tools, systems, and processes to enable the application of artificial intelligence in real-world contexts which combines the principles of systems engineering, software engineering, and computer science to create AI systems.
Model-centric vs Data-centric
There are two approaches to AI/ML (model-centric vs data-centric) that are mutually exclusive. Either you are letting the dataset drive model selection (data-centric) or you are not (model-centric). We can apply a data-centric approach by using AutoML or a coding a custom test harness to evaluate many algorithms (say 20–30) on the dataset and then choose the top performers (perhaps the top 3) for further study, being sure to give preference to simpler algorithms (Occam’s Razor).
Thus, we would choose more complex SOTA algorithms only if all simpler algorithms failed miserably. In a research project, we would likely be using a model-centric approach to evaluate new algorithms and compare the results to previous results on the same toy dataset, assuming that previous research has already obtained baselines for simpler models.
The AI Process
We can define an AI Engineering Process or AI Process (AIP) which can be used to solve almost any AI problem :
Define the problem: This step includes the following tasks: defining the scope, value definition, timelines, governance, and resources associated with the deliverable.
Dataset selection: This step can take a few hours or a few months depending on the project. It is crucial to obtain the correct and reliable dataset for an AI/ML project.
Data description: This step includes the following tasks: describe the dataset including the input features and target feature(s); include summary statistics of the data and counts of any discrete or categorical features including the target feature.
Data preparation: This step includes the following tasks: data preprocessing, data cleaning, and exploratory data analysis (EDA). For image data, we would resize images to a lower dimension such as (299 x 299) to allow mini-batch learning and also to keep up the compute limitations. For text data, we would Remove newlines and tabs; Strip HTML Tags; Remove Links; Remove Whitespaces and other possible steps listed in NLP Text Preprocessing on my GitHub repo.
Feature engineering: This step includes the following tasks: quantization or binning; mathematical transforms; scaling and normalization; modify and/or create new features. For image data, we would perform image augmentation which is described in Image Augmentation on my GitHub repo. For text data, we would convert text data features into vectors and perform Tokenization, Stemming, and Lemmatization as well as other possible steps described in Natural Language Processing on my GitHub repo.
Design: This step includes the following tasks: feature selection, decomposing the problem, and building and evaluating models. We can use AutoML or create a custom test harness to build and evaluate many models to determine what algorithms and views of the data should be chosen for further study.
Training: This step includes building the model which may include cross-validation.
Evaluation: This step includes the evaluation of well-performing models on a hold-out test dataset and model selection.
Tuning: This step involves algorithm tuning of the few selected well-performing models which may include evaluation of ensembles of models to obtain further improvement in accuracy.
Finalize: This step is to finalize the chosen model by training using the entire dataset and making sure that the final solution meets the original business requirements for model accuracy and other performance metrics.
Deployment: The model is now ready for deployment. There are two common approaches to the deployment of ML models to production: embed models into a web server or offload the model to an external service. Both model serving approaches have pros and cons.
Monitoring: This is the post-deployment phase which involves observing the model and pipelines, refreshing the model with new data, and tracking success metrics in the context of the original problem.
More detail on each step is given below; you can also refer to the AI Checklist, Applied ML Checklist, Data Preparation, and Feature Engineering on my LearnAI GitHub repo.
Define the problem
The first step in an AI project is to define the problem . In a few sentences, describe the following:
Describe the problem to be solved?
Describe the part(s) of the problem that can be solved by machine learning.
Describe the goal of the project.
Describe the goal of the model: classify, predict, detect, translate, etc.
Define the loss function and/or performance and error metrics for the project.
This step should include an extensive literature review of the same or very similar AI problems. If you cannot find any scholarly research studies of the problem then you really have a research project rather than an AI project. Keep in mind that AI is not a good field for the Star Trek approach.
When designing an agent, one of the first steps is to specify the task environment which is called the PEAS description (Performance, Environment, Actuators, Sensors) . In a nutshell, the task environment is the “problem” and the rational agent(s) are the “solution”.
PEAS description for robot vacuum
Consider the classic toy example of a simple robot vacuum.
What is the performance measure? cleanness, efficiency: distance traveled to clean, battery life, security
What is known about the environment? room, table, wood floor, carpet, different obstacles
What actuators does the agent have? wheels, different brushes, vacuum extractor
What sensors does the agent have? camera, dirt detection sensor, cliff sensor, bump sensors, infrared wall sensors
In addition, we can define seven dimensions along which task environments can be categorized :
Fully observable vs Partially observable
Single-agent vs Multiagent
Deterministic vs nondeterministic
Episodic vs Sequential
Static vs Dynamic
Discrete vs Continuous
Known vs Unknown
After we decompose the problem into parts (subproblems), there may be multiple components that can be handled using traditional software engineering rather than machine learning. We could develop the overall system and then go back later and optimize it, replacing some components with more sophisticated machine learning models.
Part of problem formulation is deciding whether we are dealing with supervised, unsupervised, or reinforcement learning. However, the distinctions are not always so definite.
The data preparation stage actually involves three steps that may overlap.
Data preprocessing: format adjustments; correct inconsistencies; handle errors in variables.
Exploratory data analysis and visualization: check if data is normally distributed or heavy-tailed; check for outliers; check if clustering of the data will help; check for imbalanced data.
Data cleaning: check data types; handle missing or invalid values; handle outliers; handle categorical values; encoding class labels; parsing dates; character encodings; handle imbalanced data.
Split first, normalize later which means that we should perform the train-test split first then normalize the datasets.
Remove leading and trailing spaces
Standardize types (decimal separators, date formats, or measurement units)
Replace unrecognizable or corrupted characters
Check for truncated entries (data entries that are cut off at a certain position)
Check for invalid values (age is 200 or negative)
Check for wrong categories in categorical data (similar products should not be put into different categories)
Handle errors in variables
High Cardinality: the number of different labels in categorical data is very high which causes problems for the model to learn.
Outliers: the extreme cases that may be due to error but not in every case.
How to Choose an AI Model
Every new AI engineer finds that they need to decide what model to use for a problem.
There are many models to choose from, but there are usually only slight alterations needed to change a regression model into a classification model and vice versa.
First, remember to take a data-centric approach, so avoid asking “what model should I use”. Thus, the first step in AI/ML process would be to perform EDA to understand the properties of your model such as balanced (classification) or Gaussian (regression).
There are two approaches to model selection: data-centric and model-centric. Either you are letting the data drive model selection (model-centric) or you are not (model-centric).
In a model-centric approach, you are basically throwing models at the dataset and hoping something will work. Similar to throwing bologne at the wall hoping it will stick, model-centric is an unscientific approach with a low probability of success.
The second step to solving an AI problem is to try simple algorithms (such as Linear or Logistic Regression) as baseline models which are used later to evaluate your model choice(s) which should perform better than all baseline models.
There are a lot of models to choose from, so consider starting with classification/regression models which can be done easily using scikit-learn.
Next, the best practice is to evaluate many algorithms (say 10–20) using an AutoML tool such as Orange, PyCaret, or AutoGluon and narrow the choices to a few models based on accuracy and error metrics. Then, create a test harness  to fully explore the candidates.
In general, you should have evaluated many models before trying to evaluate more complex models such as neural networks. A similar approach is used to evaluate and compare algorithms in mathematics, engineering, and other fields.
The rule of thumb is that a deep learning model should be your last choice (Occam’s Razor).
Keep in mind that an accuracy of 50% is equivalent to random guessing (coin toss). Thus, your models should have an accuracy of at least 75–80% or better before hypertuning. Otherwise, you need to select a different model and/or spend more time on data preparation and feature engineering.
A more detailed discussion of the AI engineering process can be found in .
The goal of ML is to conduct experiments and analyze the results to be able to eliminate the effect of chance and obtain conclusions that we can consider statistically significant .
Thus, we want to find a learner with the highest generalization accuracy and minimal complexity (the implementation is cheap in time and space) and is robust (unaffected by external sources of variability) .
There are three basic principles of experimental design :
Randomization requires that the order in which the runs are carried out should be randomly determined so that the results are independent. However, order is usually not a problem in software experiments.
Replication implies that for the same configuration of (controllable) factors, the experiment should be run a number of times to average over the effect of uncontrollable factors.
In machine learning, replication is typically done by running the same algorithm on a number of resampled versions of the same dataset which is called cross-validation.
3. Blocking is used to reduce or eliminate the variability due to nuisance factors that influence the response but in which we are not interested.
When we are comparing learning algorithms, we need to make sure the algorithms all use the same resampled subsets of data. Therefore, the different training sets in replicated runs should be identical which is what we mean by blocking . In statistics, if there are two populations, this approach is called pairing which is used in paired testing.
Model Selection Process
In model selection, we are concerned with two questions about learning algorithms :
How can we assess the expected error of a learning algorithm on a problem?
How can we say one model has less error than the other for a given application?
The error rate on the training set is always smaller (by definition) than the error rate on a test set containing instances unseen during training. Thus, we cannot choose between algorithms based on training set errors. Therefore, we need a validation set that is distinct from the training set.
We also need to have several runs on the validation set to compute the average error rates since noise, outliers, and other random factors will affect generalization. Then, we base our evaluation of the learning algorithm on the distribution of these validation errors to assess the expected error of the learning algorithm for the given problem or compare it to the error rate distribution of another learning algorithm.
During model selection, it is important to keep in mind several important points :
1. Whatever conclusion we draw from our analysis is conditioned on the dataset we are given.
As stated by the No Free Lunch Theorem, there is no such thing as the best learning algorithm; For any learning algorithm, there is a dataset where it is very accurate and another dataset where it is very poor.
2. The division of a given dataset into a number of training and validation set pairs is only for testing purposes.
Once all the tests are complete and we have made our decision as to the final method or hyperparameters, we can use all the labeled data that we have previously used for training or validation to train the final learner which is called finalizing the model**.**
3. Since we also use the validation set(s) for testing purposes (such as choosing the better of two learning algorithms or deciding where to stop learning), it becomes part of the data we use.
Therefore, given a dataset, we should first leave some part of it aside as the test set and then use the rest for training and validation.
4. In general, we compare learning algorithms by their error rates, but it should be kept in mind that in real life, the error is only one of the criteria that will affect our decision.
Some other criteria for comparing learning algorithms :
risks when errors are generalized using loss functions, instead of 0/1 loss
training time and space complexity
testing time and space complexity
interpretability which means whether the method allows knowledge extraction which can be checked and validated by experts
However, the relative importance of these factors changes depending on the application.
When we train a learner on a dataset using a training set and test its accuracy on a validation set and try to draw conclusions, what we are doing is experimentation. Statistics defines a methodology to design experiments correctly and analyze the collected data in a manner so as to be able to extract significant conclusions .
Model Selection Criteria
The following seven criteria can help in selecting a model :
There is a trade-off between explainability and model performance.
Using a more complex model will often increase the performance but it will be more difficult to interpret.
If there is no need to explain the model and its output to a non-technical audience, more complex models could be used such as ensemble learners and deep neural networks.
2. In memory vs out memory
It is important to consider the size of your data and the amount of RAM available on the computer where training will occur on.
If the RAM can handle all of the training data, you can choose from a wide variety of machine learning algorithms.
If the RAM cannot handle the training data, you can explore incremental learning algorithms which can improve the model by gradually adding more training data.
3. Number of features and examples
The number of training samples and the number of features per sample is also important in model selection.
If you have a small number of examples and features, a simple learner would be a great choice such as a decision tree or k-nearest neighbors.
If you have a small number of examples and a large number of features, SVM and gaussian processes would be a good choice since they can handle a large number of features but require fewer resources.
If you have a large number of examples then deep neural networks and boosting algorithms would be a good choice since they can handle millions of samples and features.
4. Categorical vs numerical features
The type of features is important when choosing a model.
Some machine learning algorithms cannot handle categorical features such as linear regression so you have to convert them into numerical features while other algorithms can handle categorical features such as decision trees and random forests.
5. Normality of data
If your data is normally distributed, SVM with linear kernel, logistic regression, or linear regression could be used.
If your data is not normally distributed, deep neural networks or ensemble learners would be a good choice.
6. Training speed
The available time for training is important when choosing a model.
Simple algorithms such as logistic/linear regression or decision trees can be trained in a short time.
Complex algorithms such as neural networks and ensemble learners are slow to train.
If you have access to a multi-core machine, this could significantly reduce the training time of more complex algorithms.
7. Prediction speed
The speed of generating the results is another important criterion for choosing a model.
If your model will be used in a real-time or production environment, it should be able to generate the results with very low latency.
Algorithms such as SVMs, linear/logistic regression, and some types of neural networks are extremely fast at prediction time.
You should also consider where you will deploy your model. If you are using the models for analysis or theoretical purposes, your prediction time can be longer which means you could use ensemble algorithms and very deep neural networks.
Why Simple Models
The two most common regression algorithms are:
Linear Regression (Regression)
Logistic Regression (Classification)
You should start with these simple models because :
It is likely that your problem does not need a complex algorithm
These two models have been studied thoroughly and are some of the most well-understood models in ML.
They are easy to implement and test.
They are easily interpretable since they are linear models.
To convert a regression problem to a classification problem, there are two common solutions:
Logistic Regression: binary classification
Softmax Regression: multiclass classification
In fact, I have recently worked on many projects in which the developers spent weeks or months trying to implement state-of-the-art DL algorithms from research papers only to have me show how Linear Regression and/or XGBoost outperformed all their complex models (in many cases achieving 95-98% accuracy on the test dataset). You should evaluate many algorithms to obtain baselines for comparison to justify your final model selection, so you should always know how simpler models perform on your dataset.
If you are doing research, a model-centric approach is acceptable provided someone has done an extensive evaluation of various models (including simpler models) on the same toy dataset. When you are using a custom dataset and/or solving real-world problems then you are performing AI Engineering (not research), so the rule of thumb is Occam’s Razor (“simpler is better” or “there is no such thing as best, just good enough”).
Multinomial Logistic Regression
Multinomial Logistic Regression (MLR) is a classification algorithm used to perform multiclass classification which is an extension of logistic regression that adds support for multi-class classification problems.
The primary assumptions of linear regression (multiple and singular) are :
Linearity: There is a linear relationship between the outcome and predictor variable(s).
Normality: The residuals (error calculated by subtracting the predicted value from the actual value) follow a normal distribution.
Homoscedasticity: The variability in the dependent variable is equal for all values of the independent variable(s).
With many independent variables, we often encounter other problems such as multicollinearity were variables that are supposed to be independent vary with each other, and the presence of categorical variables such as an ocean temperature being classified as cool, warm, or hot instead of quantified in degrees.
Here are some tips for working with MLR :
When your MLR models get complicated, avoid trying to use coefficients to interpret changes in the outcome versus changes in individual predictors.
Create predictions while varying a sole predictor and observe how the prediction changes and use these changes to form your conclusions.
Some good tutorials on MLR are given in  and .
Understand AI Algorithms
You need to know what algorithms are available for a given problem, how they work, and how to get the most out of them. However, this does not mean you need to hand-code the algorithms from scratch.
Even if you are an experienced AI/ML engineer, you should know the performance of simpler models on your dataset/problem.
Here are some more topics that should be considered for model selection:
Parametric vs Nonparametric Algorithms
Supervised vs Unsupervised Algorithms
The Bias-Variance Trade-Off
How to Diagnose/Fix Overfitting and Underfitting?
How to create a data pipeline?
How to deal with small datasets?
How to deal with imbalanced datasets?
Feature Engineering Tools
Feature engineering (FE) techniques for ML are a fundamental ML topic but one that is often overlooked or deceptively simple.
There are many tools that will help you to automate the entire FE process and produce a large pool of features in a short period of time for both classification and regression tasks.
Automated Machine Learning (AutoML) is an emerging field in which the process of building machine learning models to model data is automated.
A good complete example using PyCaret is given in A Beginner’s Guide to End to End Machine Learning
There are a plethora of AutoML Tools and ML Tools such as Orange, AutoGluon, and PyCaret that can be used to easily and quickly evaluate many models on a dataset.
The AI process discussed here can be used for solving almost any AI problem with some modifications, of course. There does not currently seem to be a clearly defined approach to solving AI problems, so this article attempts to present a consolidated approach from several textbooks and articles as well as discuss some of the issues involved such as model selection criteria and simpler models as well as provide some guidance to understanding AI algorithms. I plan to write some followup articles with end-to-end examples using the AI process. I also have a GitHub repo called LearnAI that some students and practitioners of AI may find useful.
 Nedgu BM, “Why 85% of AI projects fail,” Towards Data Science, Nov. 11, 2020.
 S. Reisner, “Why most AI implementations fail and what enterprises can do to beat the odds,” Venture Beat, June 28, 2021.
 J. F. DeFranco and J. Voas, “Reproducibility, Fabrication, and Falsification,” IEEE Computer, vol. 54 no. 12, 2021.
 T. Shin, “4 Reasons Why You Shouldn’t Use Machine Learning,” Towards Data Science, Oct. 5, 2021.
 E. Alpaydin, “Design and Analysis of Machine Learning Experiments”, in Introduction to Machine Learning, 3rd ed., MIT Press, ISBN: 978–0262028189, 2014.
 S. Russell and P. Norvig, “Developing Machine Learning Systems,” in Artificial Intelligence: A Modern Approach, 4th ed. Upper Saddle River, NJ: Prentice Hall, ISBN: 978–0–13–604259–4, 2021.
 S. Raschka. and V. Mirjalili, Python Machine Learning, 2nd ed. Packt, ISBN: 978–1787125933, 2017.
 W. McKinney, Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, 2nd ed., O’Reilly Media, ISBN: 978–1491957660, 2017.
 J. Brownlee, “Applied Machine Learning Process,” Machine Learning Mastery, Feb. 12, 2014.
 J. Brownlee, “How to Evaluate Machine Learning Algorithms,” Machine Learning Mastery, Aug. 16, 2020.
 Y. Hosni, “Brief Guide for Machine Learning Model Selection,” MLearning.ai, Dec. 4, 2021.
 Z. Warnes “How to Select an ML Model,” KD Nuggets, Aug. 2021.
 M. LeGro, “Interpreting Confusing Multiple Linear Regression Results,” Towards Data Science, Sep. 12, 2021.
 J. Brownlee, “Multinomial Logistic Regression With Python,” Machine Learning Mastery, Jan, 1, 2021.
 W. Xie, “Multinomial Logistic Regression in a Nutshell,” Data Science Student Society @ UC San Diego, Dec. 8, 2020.
 P. Bourque and R. E. Fairley, Guide to the Software Engineering Body of Knowledge, v. 3, IEEE, 2014.
 J. S. Damji and M. Galarnyk, “Considerations for Deploying Machine Learning Models in Production,” Towards Data Science, Nov. 19, 2021.
 J. Rodriguez, “7 Dimensions to Evaluate an AI Environment,” Towards AI, May 17, 2022.