How walnut app works?

Walnut app offers many features including expense tracking, instant personal loans , money transfer and so on. There are many apps in this domain each with some specific focus areas. For example…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Is AutoML Useful for Professional Data Scientists?

With the context above in mind, let us review the practical cases where certain AutoML tools can be helpful in automating the daily routines of Data Science / ML Engineering professionals. The practical case studies will be provided as illustrations to the points to make, as applied to the public datasets of Kaggle Tabular Playground competitions in Jan — Mar 2021.

Since there is a huge variety of AutoML tools (both freeware and commercial ones) that exist in the market nowadays, the question about AutoML being useful for ML/Data Science professionals shall read, “What kind of automation is helpful to the Data Science and ML professionals in completing their daily routines?”

The practical classification designates AutoML tools by the specific tasks they are going to automate. From that standpoint, we can talk about

In the sections below, I am going to share my own experience with using some of the AutoML tools. As I go through, I will drop my opinions on what can be really helpful for professionals in the field as oppose to the less than ML-savvy users.

Rapid EDA is one of the low-hanging fruits for AutoML tools to help professional Data Scientists in their daily routines. Obviously, good enough Rapid EDA tool should meet a few basic criteria below

I have already completed several solid case studies to prove usefulness of various freeware Rapid EDA tools.

As you can see, invoking AutoViz is as simple as writing the lines of code below

AutoViz has a few parameters to tune before you run it on your dataset. However, there is a comprehensive guide on how to set it up on various datasets/problems to achieve the best exploratory insights at the end.

Note: to see AutoViz in action end-to-end, you can refer to the source code in the following notebooks

Automated Feature selection and feature importance detection tools are extremely helpful when you try to explain your ML model or address the curse of dimensionality with your data (when you have too many features in your dataset, and the model loses its focus to deliver the most accurate prediction).

At the same time, its simplicity and power made it more efficient tool to use vs. the analytical feature importance detection algorithms (see one of my earlier blog posts for more details on the analytical feature importance detection algorithms).

The additional bonus of using featurewiz is its relatively new capabilities to automate the routines in feature engineering. Not only will it help you save some time developing your feature engineering pipelines, but it will also ensure the fast execution of your data preprocessing and feature engineering flows.

Below are the step-by-step descriptions of two Featurewiz-backed feature selection experiments for the dataset of Mar 2021 Tabular Playground competition. One of them will be a considerably basic feature importance detection (without any heavy feature engineering and data preprocessing invoked against the raw dataset). The second experiment will demonstrate how featurewiz can be an instrumental assistant in the situations where you must implement complex data preprocessing and feature engineering flows.

In this experiment, we are going to detect the feature importance for raw feature variables only.

First, we are going to read the competition datasets into the memory.

As a next step, we will pass the datasets with the raw features only to featurewiz to detect feature importance

It took it less then 3 min to run it on my local machine. The featurewiz was quite instrumental at detecting the important features as quickly as that.

As a side note, featurewiz allows you to quickly assess the impact of various category variable encoding techniques on the resulted feature importance.

In a series of feature importance experiments (with AutoViz engaged in a way described above), we got the following insights regarding the feature engineering and data preprocessing for the raw dataset of Mar 21 Tabular Playground Competition in Kaggle

1/ The following continual variables to be useful in feature importance experiment, to be left ‘as is’

2/ The rest of the continual variables to be binned as follows

3/ New groupby features to be added

4/ New interaction cat variables (feature crosses) to be added

5/ New interaction continual variables, then to be binned as follows

6/ Boolean cat variables to be as is

7/ log transform cont5, cont8, and cont7

As we are going to see, the above-mentioned feature engineering pipeline can be easily facilitated with featurewiz. In its recent versions (namely, in version 0.0.33, as of the moment of writing this blog post) there are powerful functions to help you automate all of the above-mentioned preprocessing and feature engineering steps, with just a few lines of code to compose. Let’s see it in action.

We can add the new interaction features as follows

Binning the continuous variables can be achieved as follows

Categorical feature crosses can be added as new features in the following manner

The groupby aggregate features can be added as follows

Last but not least, the easy-to-apply log transform (with all these little tricks like “plus one” on negative values etc.) can also be easily facilitated with featurewiz just with a few lines of the code

As we can see, featurewiz is quite instrumental at automating the common routinous steps in data preprocessing and feature engineering, along with its core mission to detect the important features for the ML modelling down the road.

After the above-mentioned feature engineering and data preprocessing, we are ready to launch feature importance detection with featurewiz, similar to what has been demonstrated in the basic experiment above.

It took less then 9 min on my local computer to get this pipeline and feature selection experiment running end-to-end. I should say, it is amazingly short time, factored in the complexity of the data transformations and the set of the new features we obtained.

Automated hyperparameter tuning tools designate one more area of the useful automation of daily routines for professional Data Scientists. They help to shrink the time spending on the tedious model parameter tuning while being more effective compared to the classical grid search algorithms.

It is especially helpful when you tune GBDT models (like lightgbm, xgboost, catboost etc.) where there is a huge number of essential hyper-parameters to tweak. Therefor doing it manually or via a classic grid search may not be the best way to spend your time.

I found it equally helpful to use hyperopt and optuna in this capacity.

Tools like hyperopt or optuna can be successfully leveraged to speed parameter tuning of both individual models and the ensembles of different learners.

Below is the case study to demonstrate how you can combine the power of ensemble learning with leveraging hyperopt to tune hyperparameters of each of three models in the ensemble.

First, let’s review the targets of this ML experiment. They are as follows

This will allow us to pass the instance of the ensemble modeler class wherever a scikit-learn Classifier object can be passed (inclusive of the search function for hyperopt).

We will rely on scikit-learn Classifier interfaces provided by maintainers of lightgbm, xgboost, and catboost libraries (rather than their native interfaces), for unification reasons and simplicity.

Our next step would be building the hyperopt search function as per the code fragment below

There are several highlights on the search function implemented

The appropriate parameter space is facilitated via the special naming convention we adopt. Such a naming convention is supported / ensured by the constructor of our custom ensemble class (see above). As a result, we have to prepare the dictionary with the hyperopt search space attributes in a special way, as displayed below

Once hyperopt detects the optimal set of hyperparameters for each model in the ensemble, we do the usual model prediction as per the scikit-learn contract.

So far so good? Does it sound like AutoML is going to be the paramount to the professional Data Scientists and ML Engineers? Unfortunately, not everything seems to be so bright. We can see certain drawbacks if we touch two more types of AutoML products

I am going to review it in the next blog post of this series. At the same time, the next post will cover the eternal myth on AutoML — the one that tells about AI and AutoML to replace the highly skilled (and often highly paid) ML Engineers and Data Science professionals in the middle or long run.

Despite the original intention of AutoML to serve the needs of non-Machine Learning professionals, it will affect (and is affecting) the experts in the field as well.

I tend to second the second point of view myself. I therefore use some kinds of AutoML instruments for what they are (that is, auxiliary tools), with no expectation to have a magic wand to solve any ML/DL problems for me. I also put the continual effort to fill my own toolbox with additional knowledge.

As shown in the sections above, some of the use cases where you can leverage AutoML tools embrace the scenarios below

Obviously, junior Data Scientists / ML Engineers still must learn what is under the hood with the methods being automated with AutoML tools above. In particular, it implies trying to do everything mentioned above manually, using the standard tools offered by pandas, scikit-learn and other lower-level Python libraries for ML and Data Science couple of times. In such a way, you will actually comprehend what’s going on behind the AutoML magic as well as control what tools you use to tackle a specific problem on your side.

After that you can fearlessly leverage AutoML capabilities to safe some time on doing routine operations for the sake of doing more complex problem resolutions or unstructured problem solving.

However, regardless the involvement of model and pipeline automation tools beyond the model training, the ML field is going through an explainability crisis now. That is where the real opportunity for ML professionals/Data Scientist is.

If you can establish yourself as a data professional that can understand the datasets you are working with (data cleaning, identifying data leaks etc.) as well as create models that are explainable (or/and statistically sound), you can find yourself on the right track. Creating explainable ML models, however, embraces not just using the ‘explainable’ ML algorithms alone. Nowadays it is also about using the analytical feature importance methods or tools like SHAP that can bring explainability to any modern model trained with ‘non-explainable’ yet powerful algorithms (like modern neural networks, GBDT variations etc.).

If you like to delve deeper in AutoML technologies deeper, you are welcome to continue your research with the resources below

You can refer to the public datasets for the respective Tabular Playground competitions in Kaggle in Jan — Mar 2021 per the links below

You can find my repositories with the code for the various experiments with the datasets of these competitions by navigating to the github repos below

Add a comment

Related posts:

How do determine the NFT marketplace development cost in the first place?

NFT marketplaces are holding the hearts of the people as they have their own market, bringing in world-class assets in their own form. However, OpenSea and Rarible are examples of NFT marketplaces…

The Logic of Self Acceptance

If you wander to the book store and pop open a self-help book, you’ll probably make it a few pages before you stumble on the following advice: Indeed, the idea of ‘self acceptance’ is one of the most…

Developing a Code for Collaboration

On a rainy Cape Town day, several busy bodies, hopped, skipped and jumped into our second Creative Dramatics workshop. Some new nervous faces joined our group which was very exciting but also a…