This is The End-Game - Automating the automation process itself !!
AutoML is the next major disruptor in the AI-ML world. With its ability to perform data pre-processing, ETL tasks, and transformations, it is likely to become the most popular trend in the coming years.
In this blog post, we will explore the need for AutoML based solutions and their capabilities.
What is AutoML?
A Machine Learning cycle has the following sub-processes:
- Data pre-processing
- Feature engineering
- Feature selection
- Feature extraction
- Algorithm selection
- Hyperparameter tuning
AutoML is the process of automating the above sub-processes and finding the best combination of parameters to create an efficient & optimized ML Model.
Why is there a need for AutoML?
Different stages/sub-process (as mentioned above) of the Machine Learning process requires optimization of values & configurations, which is a highly time-intensive task. For a high-performing and medium complexity model, the optimization process may take several weeks to months. AutoML reduces time and effort substantially for some (but not all) phases of the above pipeline.
In the past few years, the landscape of data science-related libraries and methodologies has increased substantially. With this, there is a greater need for a more robust and scalable solution for data science-related research activities as well. AutoML can play a major role in this area.
Additionally, AutoML also promotes the democratization of AI-ML space - by introducing ML capabilities for companies with less data science skills. Such organizations can leverage AutoML to set up their processes without deep knowledge of Machine Learning.
Is AutoML worth your time?
With the increasing need for more insights from big data, organizations are trying to improve their predictive power by leveraging the abilities of complex automated machine learning.
Let us explore the use of AutoML tools based on various roles in the data science domain to understand who can better leverage the capabilities of these tools.
1. Business Analyst:
In a usual Machine learning pipeline, the role of a business analyst is to :
- Identify business use cases
- Collect data
- Hypothesize features based on subject matter expertise
- Validate the visualization outcomes of the model
Consider a typical problem where an e-commerce organization wants to use past Sales data to predict the sales volumes of different products during a year.
Typically a business analyst would spend a significant amount of time learning the features that influence the result, visualizing and creating new features from existing ones, calculating the correlation of features, and coming up with a combination that can give good results. This process could take a significant amount of time and still would not guarantee the best results as the solution would be limited by the expertise of the analyst.
Here, AutoML solutions can make a huge difference. AutoML tools can provide a business analyst with automatic feature selection, feature extraction, relevant feature combinations mixed with the most suited algorithm for the use case – all leading to the generation of the best results in the least amount of time.
AutoML tools have been revolutionary for business analysts and subject matter experts, who cannot dedicate adequate time for diving deep into ML algorithmic implementations.
2. Data Scientist:
Although data scientists have the knowledge to build basic as well as advanced machine learning models spanning from statistical models to deep learning ones, there are certain constraints that affect the final performance of these models:
- Time constraints: Finding the best algorithm for your problem in the pool of algorithms and then determining the best set of hyperparameters for those algorithms is a time-intensive task. Basic models can take days, but deep learning models can span from weeks to months.
- Lack of knowledge of all algorithms: For the hyperparameter tuning of these algorithms, one should have in-depth knowledge of every basic and advanced state-of-the-art algorithm. Since algorithms in machine learning domain keep on improving, keeping abreast of the latest solutions is equally important.
- Lack of expertise in Feature Engineering: Though model selection and hyperparameter tuning are equally important, we know that a model is only as good as its data. Feature engineering demands sufficient time and expertise so that algorithms can make the best use of the data and their results are more relatable to the real world.
AutoML can work as a helper tool to a Data scientist, assisting with:
- Necessary data pre-processing for structured and unstructured data
- Automating the process of model selection from a pool
- Creating an end-to-end pipeline quickly for baselining the initial results after receiving the data.
After a baseline model is created, it can be further tuned manually for more refined results.
The benefits of AutoML tools can be gauged from the fact that H2O (A very popular AutoML tool) can create and tune approx. 400 models in 1 hour for a medium complexity dataset.
Benchmarking Study by Sopra Steria on AutoML tools:
We did an extensive study of AutoML tools with the aim to be able to identify the most appropriate tool based on the use case. The following parameters were considered to compare the tools :
- Cost efficiency
- Time Efficiency
- Impact on Productivity
- Manual Modelling vs AutoML
- Usability for Data Citizens
- Usability for Data Scientists
- Data Preprocessing
- Feature Engineering
- Feature Selection
- Model Selection
- Hyperparameter Optimisation
- Model Interpretability
- Model Explainability
- Offline Deployment
The study was conducted on several popular AutoML tools and gave us an insight into each of them :
- Amazon SageMaker
- Azure AutoML
- IBM AutoML
- GCP AutoML
Usage of AutoML tools at Sopra Steria:
At Sopra Steria, we are developing solutions for different business problems using AI-ML technologies. For this, we are using AutoML tools to enable us to arrive at baseline conclusions quickly.
Selecting a suitable AutoML tool is a use case-specific task, though few tools like H2O and H2O driverless AI are suitable for most use cases.
For example, our product for Automatic Ticket Resolution uses a text classification approach to predict solutions. We decided to use both H2O and TPOT to get an optimised AutoML pipeline for quick prototyping.
H2O provides extensive features for creating state-of-the-art ensemble models and cross validation techniques. TPOT is a great tool for automatic pipeline creation for different pre-processing and feature engineering steps. Using both these, we quickly experimented over hundreds of state-of-the-art classification models to arrive at the best possible combination of algorithm and results.
Similarly, for a project that required Delivery date prediction, we used a deep learning-based approach to predict the appropriate delivery date. For this use case, using AutoKeras we created a problem specific Deep learning architecture (Neural Network) and optimised it for best results. Here, AutoKeras was the best fit since it uses advanced Network Architecture Search Algorithms and saved us a tremendous amount of time and manual effort.
AutoML tools are a big asset for the research and development activities being performed at Sopra Steria.
AutoML holds great promise in helping companies with less data science expertise to build their ML applications. Since it reduces the time to implement ML processes, data scientists can automate mundane tasks and spend their time on more complex ones.
So, yes, it is the right time to start Automating the Automation Process!
Senior Software Engineer