Skip to main content

Hyperparameter tuning

Ml Stock

What is Hyperparameter Tuning ?

When creating a machine learning model, you'll be presented with design choices as to how to define your model architecture. Often times, we don't immediately know what the optimal model architecture should be for a given model, and thus we'd like to be able to explore a range of possibilities. In true machine learning fashion, we'll ideally ask the machine to perform this exploration and select the optimal model architecture automatically. Parameters which define the model architecture are referred to as hyperparameters and thus this process of searching for the ideal model architecture is referred to as hyperparameter tuning.

Benefits of tuning the paramaters

Hyperparameters are the knobs or settings that can be tuned before running a training job to control the behavior of an ML algorithm. They can have a big impact on model training as it relates to training time, infrastructure resource requirements (and as a result cost), model convergence and model accuracy. Model parameters are learnt as part of training process, whereas the values of hyperparameters are set before running the training job and they do not change during the training.

Questions like the following can be clear if you tune your parameters :

  • What degree of polynomial features should I use for my linear model?
  • What should be the maximum depth allowed for my decision tree?
  • What should be the minimum number of samples required at a leaf node in my decision tree?
  • How many trees should I include in my random forest?
  • How many neurons should I have in my neural network layer?
  • How many layers should I have in my neural network?
  • What should I set my learning rate to for gradient descent?

Points to keep in mind

  • be absolutely clear, hyperparameters are not model parameters and they cannot be directly trained from the data. Model parameters are learned during training when we optimize a loss function using something like gradient descent.The process for learning parameter values is shown generally below.
  • It is important to avoid optimizing the hyperparameters with the same data you train on because this can lead to overfitting both tuning steps of your model to the same source of data.
  • Be careful on how you sample the original datasource. When dealing with highly skewed categorical features, random sampling can lead to categories in the test set which are not observed during training, which can cause some models to break. Also, numerical features should have a similar distribution between the training and the test set

Different types of Hyperparameters

Hyperparameters can be broadly divided into 3 categories —

  1. Model hyperparameters : defines the fundamental construct of a model itself for ex. attributes of a neural network architecture like filter size, pooling, stride, padding.

  2. Optimizer hyperparameters : are related to how the model learn the patterns based on data. These types of hyperparameters include optimizers like gradient descent and stochastic gradient descent (SGD), Adam, RMSprop, Adadelta and so on. More details are available here on Keras Optimizer page.

  3. Data hyperparameters : are related to the attributes of the data, often used when you don’t have enough data or enough variation in data. In such cases data augmentation techniques like cropping, resizing, binarization etc.. are involved.

Hyperparameter tuning methods

hyperparameter tuning methods relate to how we sample possible model architecture candidates from the space of possible hyperparameter values. This is often referred to as "searching" the hyperparameter space for the optimum values.

Grid search is arguably the most basic hyperparameter tuning method. With this technique, we simply build a model for each possible combination of all of the hyperparameter values provided, evaluating each model, and selecting the architecture which produces the best results. Each model would be fit to the training data and evaluated on the validation data. this is an exhaustive sampling of the hyperparameter space and can be quite inefficient.

Random search differs from grid search in that we longer provide a discrete set of values to explore for each hyperparameter; rather, we provide a statistical distribution for each hyperparameter from which values may be randomly sampled. One of the main theoretical backings to motivate the use of random search in place of grid search is the fact that for most cases, hyperparameters are not equally important. this search method works best under the assumption that not all hyperparameters are equally important. While this isn't always the case, the assumption holds true for most datasets.

Bayesian Optimization

The previous two methods performed individual experiments building models with various hyperparameter values and recording the model performance for each. Because each experiment was performed in isolation, it's very easy to parallelize this process. However, because each experiment was performed in isolation, we're not able to use the information from one experiment to improve the next experiment. Bayesian optimization belongs to a class of sequential model-based optimization (SMBO) algorithms that allow for one to use the results of our previous iteration to improve our sampling method of the next experiment. We can then choose the optimal hyperparameter values according to this posterior expectation as our next model candidate. We iteratively repeat this process until converging to an optimum.

Libraries used for tuning the parameters

There are many wonderful libraries which are open source to experiment with . Try out the following libraries and see if these fit your requirements

  1. Ray.tune:Hyperparameter Optimization Framework
  2. Optuna
  3. Bayesian Optimization
  4. scikit-optimize

Get the power of futuristic Data & AI Platform for your enterprise.