Application Optimization

In this tutorial, we show how to optimize the performance of a toy application (but it could be a real-life application like Pl@ntNet, see our article). The optimization algorithm aims to find the the infrastructure parameters (e.g., the number of workers/machines to process the application) and the application parameters (e.g., the number of cores per worker and the memory available) to minimize the execution time.

In this example you will learn how to:

Define an optimization problem (mathematical definition) and then express it in E2Clab as a User-Defined Optimization
Use Bayesian Optimization method and the Extra Trees Regressor algorithm provided by scikit-optimize (users can use other libraries such as Ax, BayesOpt, BOHB, Dragonfly, etc.). Which search algorithm to choose?
Define the parallelism level of the application deployment on Grid’5000 (but it could be in FIT IoT LAB, Chameleon, or combining resources from various testbeds)
Use User-Defined Optimization to manage the optimization, for instance, change the infrastructure (layers_services.yaml) and application (my_application.py) parameters
Execute experiments and analyze the optimization results

The optimization problem

What is the infrastructure configuration and software configuration that minimizes the user response time?

The optimization problem to be solved can be stated as follows (Equation 1):

Find \((num\_workers, cores\_per\_worker, memory\_per\_worker)\), in order to

Minimize UserResponseTime

Subject to

\(1 \leq num\_workers \leq 10\)
\(20 \leq cores\_per\_worker \leq 50\)
\(1 \leq memory\_per\_worker \leq 3\)

Experiment Artifacts

$ cd ~/git/
$ git clone https://gitlab.inria.fr/E2Clab/examples/workflow_optimization

In this repository you will find:

the E2Clab configuration files such as layers_services.yaml, network.yaml, and workflow.yaml, as well as, the UserDefinedOptimization.py
the toy application my_application.py

Defining the Experimental Environment

Layers & Services Configuration

This configuration file presents the layers and services that compose this example. We request resources from Grid’5000 environment: g5k. We define the cloud layer and add a service to it myapplication. The service runs in a single machine quantity: 1. In our optimization problem num_workers will change quantity: to deploy the application on multiple machines \(1 \leq num\_workers \leq 10\).

environment:
  job_name: optimization
  walltime: "00:05:00"
  g5k:
    job_type: ["allow_classic_ssh"]
    cluster: ecotype
layers:
- name: cloud
  services:
  - name: myapplication
    quantity: 1

The toy application

All the optimization variables, such as \(1 \leq num\_workers \leq 10\), \(20 \leq cores\_per\_worker \leq 50\), and \(1 \leq memory\_per\_worker \leq 3\) are passed to the application as follows:

$ python my_application.py --config "{{ optimization_config }}"

To emulate the application behavior based on the infrastructure configuration and software configuration, we defined the equation presented in lines 20 to 23. The workload_size = 100 and the communication_cost = 2 (communication between workers).

import time
import argparse
import ast

parser = argparse.ArgumentParser()
parser.add_argument(
    "--config",
    type=str,
    required=True,
    help="Application configuration suggested by the optimization algorithm",
)
args = parser.parse_args()

_config = ast.literal_eval(args.config)


print(f" ******* optimization config = {_config}")
workload_size = 100
communication_cost = 2
user_response_time = \
    _config['num_workers'] * communication_cost + \
    workload_size/(_config['cores_per_worker']*_config['num_workers']) + \
    workload_size/(_config['memory_per_worker']*_config['num_workers'])

print(f" Running...")
time.sleep(user_response_time)
print(f" ******* user_response_time = {user_response_time}")

with open('results.txt', 'w') as f:
    f.write(f'user_response_time,{user_response_time},{args.config}')

Network Configuration

In this example, we do not have an optimizaiton variable related to the network configuration. But we could have, like we did for the layers_services.yaml. In this case, no changes are required in the network.yaml file.

1networks:

Workflow Configuration

This configuration file presents the application workflow configuration.

The Cloud application cloud.*:

prepare copies from the local machine to the remote machine the application.

launch executes the application using the configuration suggested by the algorithm.

finalize after experiment ends, copies the result from the remote to the local machine. The result.txt file contains the user_response_time (value depends on the infrastructure and software configuration).

- hosts: cloud.*
  prepare:
    - debug:
        msg: "Copying files"
    - copy:
        src: "{{ working_dir }}/my_application.py"
        dest: "/tmp/my_application.py"
  launch:
    - debug:
        msg: "Starting app: optimization_config = {{ optimization_config }}"
    - shell: cd /tmp/ && python my_application.py --config "{{ optimization_config }}"
  finalize:
    - debug:
        msg: "Backuping data"
    - fetch:
        src: "/tmp/results.txt"
        dest: "{{ working_dir }}/results/"
        flat: true
        validate_checksum: no

User-Defined Optimization

run() function:

We use Bayesian Optimization as the optimization method and the Extra Trees Regressor algorithm, see line 12 algo = SkOptSearch()
3 is the parallelism level of the workflow deployments, see line 13 algo = ConcurrencyLimiter(algo, max_concurrent=3)
We define the optimization problem (Equation 1) in lines 15 to 27, see objective = tune.run(…)

run_objective()

prepare() creates an optimization directory. Each application deployment evaluation has its own directory. In lines 40 to 47 we update the layers_services.yaml file to add _config["num_workers"] in quantity:.
launch(optimization_config=_config) makes a new deployment with the infrastructure and application configurations suggested by the search algorithm. It executes all the E2Clab commands for the deployment, such as layers_services, network, workflow (prepare, launch, finalize) and finalize. In this example, we have 3 parallel deployments and the search algorithm is trained asynchronously.
finalize() saves the optimization results in the optimization directory.

from e2clab.optimizer import Optimization
from ray import tune
from ray.tune.search import ConcurrencyLimiter
from ray.tune.schedulers import AsyncHyperBandScheduler
from ray.tune.search.skopt import SkOptSearch
import yaml


class UserDefinedOptimization(Optimization):

    def run(self):
        algo = SkOptSearch()
        algo = ConcurrencyLimiter(algo, max_concurrent=3)
        scheduler = AsyncHyperBandScheduler()
        objective = tune.run(
            self.run_objective,
            metric="user_response_time",
            mode="min",
            name="my_application",
            search_alg=algo,
            scheduler=scheduler,
            num_samples=9,
            config={
                'num_workers': tune.randint(1, 10),
                'cores_per_worker': tune.randint(20, 50),
                'memory_per_worker': tune.randint(1, 3)
            },
            fail_fast=True
        )

        print("Hyperparameters found: ", objective.best_config)

    def run_objective(self, _config):
        # '_config' is the configuration suggested by the algorithm
        # create an optimization directory using "self.prepare()"
        self.prepare()
        # update the parameters of your application configuration files
        # using 'self.optimization_dir' you can locate your files
        # update your files with the values in '_config' (suggested by the algorithm)
        with open(f'{self.optimization_dir}/layers_services.yaml') as f:
            config_yaml = yaml.load(f, Loader=yaml.FullLoader)
        for layer in config_yaml["layers"]:
            for service in layer["services"]:
                if service["name"] in ["myapplication"]:
                    service["quantity"] = _config["num_workers"]
        with open(f'{self.optimization_dir}/layers_services.yaml', 'w') as f:
            yaml.dump(config_yaml, f)

        # deploy the configurations using 'self.launch()'
        self.launch(optimization_config=_config)

        # after the application ends the execution, save the optimization results
        # using 'self.finalize()'
        self.finalize()
        # get the metric value generated by your application after its execution
        # this metric is what you want to optimize
        # for instance, the 'user_response_time' is saved in the 'self.experiment_dir'
        user_response_time = 0
        with open(f'{self.experiment_dir}/results/results.txt') as file:
            for line in file:
                user_response_time = float(line.rstrip().split(',')[1])

        # report the metric value to Ray Tune, so it can suggest a new configuration
        # to explore. Do it as follows:
        tune.report(user_response_time=user_response_time)

Running & Verifying Experiment Execution

Find below the commands to deploy this application and check its execution.

$ e2clab optimize ~/git/workflow_optimization/ ~/git/workflow_optimization/

Deployment Validation & Experiment Results

As we defined num_samples=9 (see line 22) we have 9 evaluations of the search space (9 application deployments on G5K). The table bellow summarizes the results. The configuration found by the algorithm that minimizes the user_response_time consists of 6 machines ('num_workers': 6) each one with 48 cores ('cores_per_worker': 48) and 2 memory slots ('memory_per_worker': 2). This configuration gives a user response time of 20.68 seconds.

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name               status         num_workers     cores_per_worker     memory_per_worker     iter     total time (s)     user_response_time │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ run_objective_21bdbea8   TERMINATED               3                   34                     1        1            76.8974                40.3137 │
│ run_objective_d8b3048b   TERMINATED               7                   29                     1        1            79.81                  28.7783 │
│ run_objective_f44d0fd6   TERMINATED               4                   49                     1        1           129.91                  33.5102 │
│ run_objective_aa1e053e   TERMINATED               6                   48                     2        1            66.3485                20.6806 │
│ run_objective_9adcfe45   TERMINATED               9                   38                     1        1           134.151                 29.4035 │
│ run_objective_45150746   TERMINATED               5                   31                     1        1           190.87                  30.6452 │
│ run_objective_9eaf2742   TERMINATED               8                   38                     1        1           529.561                 28.8289 │
│ run_objective_a118f595   TERMINATED               2                   22                     1        1           127.343                 56.2727 │
│ run_objective_214fc572   TERMINATED               3                   40                     1        1           316.947                 40.1667 │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Hyperparameters found:  {'num_workers': 6, 'cores_per_worker': 48, 'memory_per_worker': 2}

Find below the 9 directories generated from each deployment and experiment execution.

$ ls -la optimization/

    ... Jul 27 16:30 20230727-162917-3f998bb2a73846439fbb1e480a2cb22a
    ... Jul 27 16:30 20230727-162921-aab11d8263654143b6a10bed0d8fd14f
    ... Jul 27 16:31 20230727-162926-c6abfed0dfd84b13b8868d74f39666bc
    ... Jul 27 16:31 20230727-163034-cc4d8bbd9ade4206b7daee8b6be531b6
    ... Jul 27 16:32 20230727-163041-36d62067b97b47d7839be6c61f40ecdc
    ... Jul 27 16:34 20230727-163136-cc9f6537cade4771a8bba2fdbf269e93
    ... Jul 27 16:40 20230727-163140-3aa41ba4e95b4900ab6ff981beab3318
    ... Jul 27 16:35 20230727-163255-a81909fb7869450095f8b83053760542
    ... Jul 27 16:40 20230727-163447-c25333d7554b418b8a3f37f1a0ce6097

The generated files consist of:

$ ls -la optimization/20230727-162917-3f998bb2a73846439fbb1e480a2cb22a/

    20230727-162917/        # validation files generated from each deployment
    optimization-results/   # the optimization results
    layers_services.yaml    # E2Clab config files
    network.yaml
    workflow.yaml

For each deployment, in 20230727-162917/, we have the validation files such as layers_services-validate.yaml, results/, and workflow-validate.out.

$ ls -la optimization/20230727-162917-3f998bb2a73846439fbb1e480a2cb22a/20230727-162917/

    layers_services-validate.yaml
    results/
    workflow-validate.out

In optimization-results/, we have

$ ls -la optimization/20230727-162917-3f998bb2a73846439fbb1e480a2cb22a/optimization-results/

    params.json     # the parameters explored by the algorithm
    params.pkl      # contains state information of the algorithm (for checkpoint)

Note

Checkpoints: users can snapshot the training progress.