************************
Application Optimization
************************
In this tutorial, we show how to optimize the performance of a toy application
(but it could be a real-life application like `Pl@ntNet `_, see
`our article `_). The optimization algorithm **aims to
find the the infrastructure parameters** (*e.g.,* the number of workers/machines to process
the application) and the **application parameters** (*e.g.,* the number of cores per worker
and the memory available) to **minimize the execution time**.
In this example **you will learn how to**:
- Define an optimization problem (mathematical definition) and then express it in E2Clab
as a **User-Defined Optimization**
- Use **Bayesian Optimization** method and the **Extra Trees Regressor** algorithm
provided by `scikit-optimize `_
(users can use other libraries such as Ax, BayesOpt, BOHB, Dragonfly, *etc.*).
`Which search algorithm to choose?
`_
- Define the **parallelism level** of the application deployment on Grid'5000 (but it
could be in FIT IoT LAB, Chameleon, or combining resources from various testbeds)
- Use **User-Defined Optimization** to manage the optimization, for instance, change the
infrastructure (`layers_services.yaml`) and application (`my_application.py`) parameters
- Execute experiments and analyze the optimization results
The optimization problem
========================
**What is the infrastructure configuration and software configuration that minimizes the
user response time?**
The optimization problem to be solved can be stated as follows (**Equation 1**):
| **Find** :math:`(num\_workers, cores\_per\_worker, memory\_per\_worker)`, **in order to**
| **Minimize** `UserResponseTime`
| **Subject to**
| :math:`1 \leq num\_workers \leq 10`
| :math:`20 \leq cores\_per\_worker \leq 50`
| :math:`1 \leq memory\_per\_worker \leq 3`
Experiment Artifacts
====================
.. code-block:: bash
$ cd ~/git/
$ git clone https://gitlab.inria.fr/E2Clab/examples/workflow_optimization
In this repository you will find:
- the **E2Clab configuration files** such as layers_services.yaml, network.yaml, and
workflow.yaml, as well as, the **UserDefinedOptimization.py**
- the toy application **my_application.py**
Defining the Experimental Environment
=====================================
Layers & Services Configuration
-------------------------------
This configuration file presents the **layers** and **services** that compose this example.
We request resources from Grid'5000 ``environment: g5k``. We define the ``cloud`` layer
and add a service to it ``myapplication``. The service runs in a single machine
``quantity: 1``. In our optimization problem ``num_workers`` will change ``quantity:`` to
deploy the application on multiple machines :math:`1 \leq num\_workers \leq 10`.
.. literalinclude:: ./application_optimization/layers_services.yaml
:language: yaml
:linenos:
The toy application
-------------------
All the **optimization variables**, such as :math:`1 \leq num\_workers \leq 10`,
:math:`20 \leq cores\_per\_worker \leq 50`, and :math:`1 \leq memory\_per\_worker \leq 3`
are passed to the application as follows:
.. code-block:: bash
$ python my_application.py --config "{{ optimization_config }}"
To emulate the application behavior based on the infrastructure configuration and
software configuration, we defined the equation presented in **lines 20 to 23**. The
**workload_size = 100** and the **communication_cost = 2** (communication between workers).
.. literalinclude:: ./application_optimization/my_application.py
:language: python
:linenos:
Network Configuration
---------------------
In this example, we do not have an **optimizaiton variable** related to the network
configuration. But we could have, like we did for the **layers_services.yaml**.
In this case, no changes are required in the ``network.yaml`` file.
.. literalinclude:: ./application_optimization/network.yaml
:language: yaml
:linenos:
Workflow Configuration
----------------------
This configuration file presents the application workflow configuration.
- **The Cloud application** ``cloud.*``:
``prepare`` copies from the local machine to the remote machine the application.
``launch`` executes the application using the configuration suggested by the algorithm.
``finalize`` after experiment ends, copies the result from the remote to the local
machine. The **result.txt file** contains the **user_response_time** (value depends on the
infrastructure and software configuration).
.. literalinclude:: ./application_optimization/workflow.yaml
:language: yaml
:linenos:
User-Defined Optimization
-------------------------
run() function:
---------------
- We use **Bayesian Optimization** as the optimization method and the Extra Trees
Regressor algorithm, `see line 12` **algo = SkOptSearch()**
- 3 is the parallelism level of the workflow deployments, `see line 13`
**algo = ConcurrencyLimiter(algo, max_concurrent=3)**
- We define the optimization problem (**Equation 1**) in lines 15 to 27, see
**objective = tune.run(...)**
run_objective()
---------------
- **prepare()** creates an optimization directory. Each application deployment evaluation
has its own directory. In **lines 40 to 47** we update the ``layers_services.yaml`` file
to add ``_config["num_workers"]`` in ``quantity:``.
- **launch(optimization_config=_config)** makes a new deployment with the infrastructure
and application configurations suggested by the search algorithm. It executes all the
*E2Clab commands* for the deployment, such as *layers_services*, *network*, *workflow
(prepare, launch, finalize)* and *finalize*. In this example, we have **3 parallel
deployments** and the search algorithm is **trained asynchronously**.
- **finalize()** saves the optimization results in the ``optimization directory``.
.. literalinclude:: ./application_optimization/UserDefinedOptimization.py
:language: python
:linenos:
Running & Verifying Experiment Execution
========================================
Find below the commands to deploy this application and check its execution.
.. code-block:: bash
$ e2clab optimize ~/git/workflow_optimization/ ~/git/workflow_optimization/
Deployment Validation & Experiment Results
==========================================
As we defined ``num_samples=9`` (see line 22) we have 9 evaluations of the search space
(9 application deployments on G5K). The table bellow summarizes the results. The
configuration found by the algorithm that minimizes the ``user_response_time`` consists of
6 machines (``'num_workers': 6``) each one with 48 cores (``'cores_per_worker': 48``) and
2 memory slots (``'memory_per_worker': 2``). This configuration gives a user response time
of ``20.68 seconds``.
.. code-block:: text
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status num_workers cores_per_worker memory_per_worker iter total time (s) user_response_time │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ run_objective_21bdbea8 TERMINATED 3 34 1 1 76.8974 40.3137 │
│ run_objective_d8b3048b TERMINATED 7 29 1 1 79.81 28.7783 │
│ run_objective_f44d0fd6 TERMINATED 4 49 1 1 129.91 33.5102 │
│ run_objective_aa1e053e TERMINATED 6 48 2 1 66.3485 20.6806 │
│ run_objective_9adcfe45 TERMINATED 9 38 1 1 134.151 29.4035 │
│ run_objective_45150746 TERMINATED 5 31 1 1 190.87 30.6452 │
│ run_objective_9eaf2742 TERMINATED 8 38 1 1 529.561 28.8289 │
│ run_objective_a118f595 TERMINATED 2 22 1 1 127.343 56.2727 │
│ run_objective_214fc572 TERMINATED 3 40 1 1 316.947 40.1667 │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Hyperparameters found: {'num_workers': 6, 'cores_per_worker': 48, 'memory_per_worker': 2}
Find below the 9 directories generated from each deployment and experiment execution.
.. code-block:: text
$ ls -la optimization/
... Jul 27 16:30 20230727-162917-3f998bb2a73846439fbb1e480a2cb22a
... Jul 27 16:30 20230727-162921-aab11d8263654143b6a10bed0d8fd14f
... Jul 27 16:31 20230727-162926-c6abfed0dfd84b13b8868d74f39666bc
... Jul 27 16:31 20230727-163034-cc4d8bbd9ade4206b7daee8b6be531b6
... Jul 27 16:32 20230727-163041-36d62067b97b47d7839be6c61f40ecdc
... Jul 27 16:34 20230727-163136-cc9f6537cade4771a8bba2fdbf269e93
... Jul 27 16:40 20230727-163140-3aa41ba4e95b4900ab6ff981beab3318
... Jul 27 16:35 20230727-163255-a81909fb7869450095f8b83053760542
... Jul 27 16:40 20230727-163447-c25333d7554b418b8a3f37f1a0ce6097
The generated files consist of:
.. code-block:: bash
$ ls -la optimization/20230727-162917-3f998bb2a73846439fbb1e480a2cb22a/
20230727-162917/ # validation files generated from each deployment
optimization-results/ # the optimization results
layers_services.yaml # E2Clab config files
network.yaml
workflow.yaml
For each deployment, in ``20230727-162917/``, we have the **validation files** such as
``layers_services-validate.yaml``, ``results/``, and ``workflow-validate.out``.
.. code-block:: text
$ ls -la optimization/20230727-162917-3f998bb2a73846439fbb1e480a2cb22a/20230727-162917/
layers_services-validate.yaml
results/
workflow-validate.out
In ``optimization-results/``, we have
.. code-block:: bash
$ ls -la optimization/20230727-162917-3f998bb2a73846439fbb1e480a2cb22a/optimization-results/
params.json # the parameters explored by the algorithm
params.pkl # contains state information of the algorithm (for checkpoint)
.. note::
**Checkpoints:** users can `snapshot the training progress
`_.