Application Optimization
In this tutorial, we show how to optimize the performance of a toy application (but it could be a real-life application like Pl@ntNet, see our article). The optimization algorithm aims to find the the infrastructure parameters (e.g., the number of workers/machines to process the application) and the application parameters (e.g., the number of cores per worker and the memory available) to minimize the execution time.
In this example you will learn how to:
Define an optimization problem (mathematical definition) and then express it in E2Clab as a User-Defined Optimization
Use Bayesian Optimization method and the Extra Trees Regressor algorithm provided by scikit-optimize (users can use other libraries such as Ax, BayesOpt, BOHB, Dragonfly, etc.). Which search algorithm to choose?
Define the parallelism level of the application deployment on Grid’5000 (but it could be in FIT IoT LAB, Chameleon, or combining resources from various testbeds)
Use User-Defined Optimization to manage the optimization, for instance, change the infrastructure (layers_services.yaml) and application (my_application.py) parameters
Execute experiments and analyze the optimization results
The optimization problem
What is the infrastructure configuration and software configuration that minimizes the user response time?
The optimization problem to be solved can be stated as follows (Equation 1):
Experiment Artifacts
$ cd ~/git/
$ git clone https://gitlab.inria.fr/E2Clab/examples/workflow_optimization
In this repository you will find:
the E2Clab configuration files such as layers_services.yaml, network.yaml, and workflow.yaml, as well as, the UserDefinedOptimization.py
the toy application my_application.py
Defining the Experimental Environment
Layers & Services Configuration
This configuration file presents the layers and services that compose this example.
We request resources from Grid’5000 environment: g5k
. We define the cloud
layer
and add a service to it myapplication
. The service runs in a single machine
quantity: 1
. In our optimization problem num_workers
will change quantity:
to
deploy the application on multiple machines \(1 \leq num\_workers \leq 10\).
1environment:
2 job_name: optimization
3 walltime: "00:05:00"
4 g5k:
5 job_type: ["allow_classic_ssh"]
6 cluster: ecotype
7layers:
8- name: cloud
9 services:
10 - name: myapplication
11 quantity: 1
The toy application
All the optimization variables, such as \(1 \leq num\_workers \leq 10\), \(20 \leq cores\_per\_worker \leq 50\), and \(1 \leq memory\_per\_worker \leq 3\) are passed to the application as follows:
$ python my_application.py --config "{{ optimization_config }}"
To emulate the application behavior based on the infrastructure configuration and software configuration, we defined the equation presented in lines 20 to 23. The workload_size = 100 and the communication_cost = 2 (communication between workers).
1import time
2import argparse
3import ast
4
5parser = argparse.ArgumentParser()
6parser.add_argument(
7 "--config",
8 type=str,
9 required=True,
10 help="Application configuration suggested by the optimization algorithm",
11)
12args = parser.parse_args()
13
14_config = ast.literal_eval(args.config)
15
16
17print(f" ******* optimization config = {_config}")
18workload_size = 100
19communication_cost = 2
20user_response_time = \
21 _config['num_workers'] * communication_cost + \
22 workload_size/(_config['cores_per_worker']*_config['num_workers']) + \
23 workload_size/(_config['memory_per_worker']*_config['num_workers'])
24
25print(f" Running...")
26time.sleep(user_response_time)
27print(f" ******* user_response_time = {user_response_time}")
28
29with open('results.txt', 'w') as f:
30 f.write(f'user_response_time,{user_response_time},{args.config}')
31
Network Configuration
In this example, we do not have an optimizaiton variable related to the network
configuration. But we could have, like we did for the layers_services.yaml.
In this case, no changes are required in the network.yaml
file.
1networks:
Workflow Configuration
This configuration file presents the application workflow configuration.
The Cloud application
cloud.*
:
prepare
copies from the local machine to the remote machine the application.
launch
executes the application using the configuration suggested by the algorithm.
finalize
after experiment ends, copies the result from the remote to the local
machine. The result.txt file contains the user_response_time (value depends on the
infrastructure and software configuration).
1- hosts: cloud.*
2 prepare:
3 - debug:
4 msg: "Copying files"
5 - copy:
6 src: "{{ working_dir }}/my_application.py"
7 dest: "/tmp/my_application.py"
8 launch:
9 - debug:
10 msg: "Starting app: optimization_config = {{ optimization_config }}"
11 - shell: cd /tmp/ && python my_application.py --config "{{ optimization_config }}"
12 finalize:
13 - debug:
14 msg: "Backuping data"
15 - fetch:
16 src: "/tmp/results.txt"
17 dest: "{{ working_dir }}/results/"
18 flat: true
19 validate_checksum: no
User-Defined Optimization
run() function:
We use Bayesian Optimization as the optimization method and the Extra Trees Regressor algorithm, see line 12 algo = SkOptSearch()
3 is the parallelism level of the workflow deployments, see line 13 algo = ConcurrencyLimiter(algo, max_concurrent=3)
We define the optimization problem (Equation 1) in lines 15 to 27, see objective = tune.run(…)
run_objective()
prepare() creates an optimization directory. Each application deployment evaluation has its own directory. In lines 40 to 47 we update the
layers_services.yaml
file to add_config["num_workers"]
inquantity:
.launch(optimization_config=_config) makes a new deployment with the infrastructure and application configurations suggested by the search algorithm. It executes all the E2Clab commands for the deployment, such as layers_services, network, workflow (prepare, launch, finalize) and finalize. In this example, we have 3 parallel deployments and the search algorithm is trained asynchronously.
finalize() saves the optimization results in the
optimization directory
.
1from e2clab.optimizer import Optimization
2from ray import tune
3from ray.tune.search import ConcurrencyLimiter
4from ray.tune.schedulers import AsyncHyperBandScheduler
5from ray.tune.search.skopt import SkOptSearch
6import yaml
7
8
9class UserDefinedOptimization(Optimization):
10
11 def run(self):
12 algo = SkOptSearch()
13 algo = ConcurrencyLimiter(algo, max_concurrent=3)
14 scheduler = AsyncHyperBandScheduler()
15 objective = tune.run(
16 self.run_objective,
17 metric="user_response_time",
18 mode="min",
19 name="my_application",
20 search_alg=algo,
21 scheduler=scheduler,
22 num_samples=9,
23 config={
24 'num_workers': tune.randint(1, 10),
25 'cores_per_worker': tune.randint(20, 50),
26 'memory_per_worker': tune.randint(1, 3)
27 },
28 fail_fast=True
29 )
30
31 print("Hyperparameters found: ", objective.best_config)
32
33 def run_objective(self, _config):
34 # '_config' is the configuration suggested by the algorithm
35 # create an optimization directory using "self.prepare()"
36 self.prepare()
37 # update the parameters of your application configuration files
38 # using 'self.optimization_dir' you can locate your files
39 # update your files with the values in '_config' (suggested by the algorithm)
40 with open(f'{self.optimization_dir}/layers_services.yaml') as f:
41 config_yaml = yaml.load(f, Loader=yaml.FullLoader)
42 for layer in config_yaml["layers"]:
43 for service in layer["services"]:
44 if service["name"] in ["myapplication"]:
45 service["quantity"] = _config["num_workers"]
46 with open(f'{self.optimization_dir}/layers_services.yaml', 'w') as f:
47 yaml.dump(config_yaml, f)
48
49 # deploy the configurations using 'self.launch()'
50 self.launch(optimization_config=_config)
51
52 # after the application ends the execution, save the optimization results
53 # using 'self.finalize()'
54 self.finalize()
55 # get the metric value generated by your application after its execution
56 # this metric is what you want to optimize
57 # for instance, the 'user_response_time' is saved in the 'self.experiment_dir'
58 user_response_time = 0
59 with open(f'{self.experiment_dir}/results/results.txt') as file:
60 for line in file:
61 user_response_time = float(line.rstrip().split(',')[1])
62
63 # report the metric value to Ray Tune, so it can suggest a new configuration
64 # to explore. Do it as follows:
65 tune.report(user_response_time=user_response_time)
Running & Verifying Experiment Execution
Find below the commands to deploy this application and check its execution.
$ e2clab optimize ~/git/workflow_optimization/ ~/git/workflow_optimization/
Deployment Validation & Experiment Results
As we defined num_samples=9
(see line 22) we have 9 evaluations of the search space
(9 application deployments on G5K). The table bellow summarizes the results. The
configuration found by the algorithm that minimizes the user_response_time
consists of
6 machines ('num_workers': 6
) each one with 48 cores ('cores_per_worker': 48
) and
2 memory slots ('memory_per_worker': 2
). This configuration gives a user response time
of 20.68 seconds
.
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status num_workers cores_per_worker memory_per_worker iter total time (s) user_response_time │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ run_objective_21bdbea8 TERMINATED 3 34 1 1 76.8974 40.3137 │
│ run_objective_d8b3048b TERMINATED 7 29 1 1 79.81 28.7783 │
│ run_objective_f44d0fd6 TERMINATED 4 49 1 1 129.91 33.5102 │
│ run_objective_aa1e053e TERMINATED 6 48 2 1 66.3485 20.6806 │
│ run_objective_9adcfe45 TERMINATED 9 38 1 1 134.151 29.4035 │
│ run_objective_45150746 TERMINATED 5 31 1 1 190.87 30.6452 │
│ run_objective_9eaf2742 TERMINATED 8 38 1 1 529.561 28.8289 │
│ run_objective_a118f595 TERMINATED 2 22 1 1 127.343 56.2727 │
│ run_objective_214fc572 TERMINATED 3 40 1 1 316.947 40.1667 │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Hyperparameters found: {'num_workers': 6, 'cores_per_worker': 48, 'memory_per_worker': 2}
Find below the 9 directories generated from each deployment and experiment execution.
$ ls -la optimization/
... Jul 27 16:30 20230727-162917-3f998bb2a73846439fbb1e480a2cb22a
... Jul 27 16:30 20230727-162921-aab11d8263654143b6a10bed0d8fd14f
... Jul 27 16:31 20230727-162926-c6abfed0dfd84b13b8868d74f39666bc
... Jul 27 16:31 20230727-163034-cc4d8bbd9ade4206b7daee8b6be531b6
... Jul 27 16:32 20230727-163041-36d62067b97b47d7839be6c61f40ecdc
... Jul 27 16:34 20230727-163136-cc9f6537cade4771a8bba2fdbf269e93
... Jul 27 16:40 20230727-163140-3aa41ba4e95b4900ab6ff981beab3318
... Jul 27 16:35 20230727-163255-a81909fb7869450095f8b83053760542
... Jul 27 16:40 20230727-163447-c25333d7554b418b8a3f37f1a0ce6097
The generated files consist of:
$ ls -la optimization/20230727-162917-3f998bb2a73846439fbb1e480a2cb22a/
20230727-162917/ # validation files generated from each deployment
optimization-results/ # the optimization results
layers_services.yaml # E2Clab config files
network.yaml
workflow.yaml
For each deployment, in 20230727-162917/
, we have the validation files such as
layers_services-validate.yaml
, results/
, and workflow-validate.out
.
$ ls -la optimization/20230727-162917-3f998bb2a73846439fbb1e480a2cb22a/20230727-162917/
layers_services-validate.yaml
results/
workflow-validate.out
In optimization-results/
, we have
$ ls -la optimization/20230727-162917-3f998bb2a73846439fbb1e480a2cb22a/optimization-results/
params.json # the parameters explored by the algorithm
params.pkl # contains state information of the algorithm (for checkpoint)
Note
Checkpoints: users can snapshot the training progress.