************************ Application Optimization ************************ In this tutorial, we show how to optimize the performance of a toy application (but it could be a real-life application like `Pl@ntNet `_, see `our article `_). The optimization algorithm **aims to find the the infrastructure parameters** (*e.g.,* the number of workers/machines to process the application) and the **application parameters** (*e.g.,* the number of cores per worker and the memory available) to **minimize the execution time**. In this example **you will learn how to**: - Define an optimization problem (mathematical definition) and then express it in E2Clab as a **User-Defined Optimization** - Use **Bayesian Optimization** method and the **Extra Trees Regressor** algorithm provided by `scikit-optimize `_ (users can use other libraries such as Ax, BayesOpt, BOHB, Dragonfly, *etc.*). `Which search algorithm to choose? `_ - Define the **parallelism level** of the application deployment on Grid'5000 (but it could be in FIT IoT LAB, Chameleon, or combining resources from various testbeds) - Use **User-Defined Optimization** to manage the optimization, for instance, change the infrastructure (`layers_services.yaml`) and application (`my_application.py`) parameters - Execute experiments and analyze the optimization results The optimization problem ======================== **What is the infrastructure configuration and software configuration that minimizes the user response time?** The optimization problem to be solved can be stated as follows (**Equation 1**): | **Find** :math:`(num\_workers, cores\_per\_worker, memory\_per\_worker)`, **in order to** | **Minimize** `UserResponseTime` | **Subject to** | :math:`1 \leq num\_workers \leq 10` | :math:`20 \leq cores\_per\_worker \leq 50` | :math:`1 \leq memory\_per\_worker \leq 3` Experiment Artifacts ==================== .. code-block:: bash $ cd ~/git/ $ git clone https://gitlab.inria.fr/E2Clab/examples/workflow_optimization In this repository you will find: - the **E2Clab configuration files** such as layers_services.yaml, network.yaml, and workflow.yaml, as well as, the **UserDefinedOptimization.py** - the toy application **my_application.py** Defining the Experimental Environment ===================================== Layers & Services Configuration ------------------------------- This configuration file presents the **layers** and **services** that compose this example. We request resources from Grid'5000 ``environment: g5k``. We define the ``cloud`` layer and add a service to it ``myapplication``. The service runs in a single machine ``quantity: 1``. In our optimization problem ``num_workers`` will change ``quantity:`` to deploy the application on multiple machines :math:`1 \leq num\_workers \leq 10`. .. literalinclude:: ./application_optimization/layers_services.yaml :language: yaml :linenos: The toy application ------------------- All the **optimization variables**, such as :math:`1 \leq num\_workers \leq 10`, :math:`20 \leq cores\_per\_worker \leq 50`, and :math:`1 \leq memory\_per\_worker \leq 3` are passed to the application as follows: .. code-block:: bash $ python my_application.py --config "{{ optimization_config }}" To emulate the application behavior based on the infrastructure configuration and software configuration, we defined the equation presented in **lines 20 to 23**. The **workload_size = 100** and the **communication_cost = 2** (communication between workers). .. literalinclude:: ./application_optimization/my_application.py :language: python :linenos: Network Configuration --------------------- In this example, we do not have an **optimizaiton variable** related to the network configuration. But we could have, like we did for the **layers_services.yaml**. In this case, no changes are required in the ``network.yaml`` file. .. literalinclude:: ./application_optimization/network.yaml :language: yaml :linenos: Workflow Configuration ---------------------- This configuration file presents the application workflow configuration. - **The Cloud application** ``cloud.*``: ``prepare`` copies from the local machine to the remote machine the application. ``launch`` executes the application using the configuration suggested by the algorithm. ``finalize`` after experiment ends, copies the result from the remote to the local machine. The **result.txt file** contains the **user_response_time** (value depends on the infrastructure and software configuration). .. literalinclude:: ./application_optimization/workflow.yaml :language: yaml :linenos: User-Defined Optimization ------------------------- run() function: --------------- - We use **Bayesian Optimization** as the optimization method and the Extra Trees Regressor algorithm, `see line 12` **algo = SkOptSearch()** - 3 is the parallelism level of the workflow deployments, `see line 13` **algo = ConcurrencyLimiter(algo, max_concurrent=3)** - We define the optimization problem (**Equation 1**) in lines 15 to 27, see **objective = tune.run(...)** run_objective() --------------- - **prepare()** creates an optimization directory. Each application deployment evaluation has its own directory. In **lines 40 to 47** we update the ``layers_services.yaml`` file to add ``_config["num_workers"]`` in ``quantity:``. - **launch(optimization_config=_config)** makes a new deployment with the infrastructure and application configurations suggested by the search algorithm. It executes all the *E2Clab commands* for the deployment, such as *layers_services*, *network*, *workflow (prepare, launch, finalize)* and *finalize*. In this example, we have **3 parallel deployments** and the search algorithm is **trained asynchronously**. - **finalize()** saves the optimization results in the ``optimization directory``. .. literalinclude:: ./application_optimization/UserDefinedOptimization.py :language: python :linenos: Running & Verifying Experiment Execution ======================================== Find below the commands to deploy this application and check its execution. .. code-block:: bash $ e2clab optimize ~/git/workflow_optimization/ ~/git/workflow_optimization/ Deployment Validation & Experiment Results ========================================== As we defined ``num_samples=9`` (see line 22) we have 9 evaluations of the search space (9 application deployments on G5K). The table bellow summarizes the results. The configuration found by the algorithm that minimizes the ``user_response_time`` consists of 6 machines (``'num_workers': 6``) each one with 48 cores (``'cores_per_worker': 48``) and 2 memory slots (``'memory_per_worker': 2``). This configuration gives a user response time of ``20.68 seconds``. .. code-block:: text ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Trial name status num_workers cores_per_worker memory_per_worker iter total time (s) user_response_time │ ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ run_objective_21bdbea8 TERMINATED 3 34 1 1 76.8974 40.3137 │ │ run_objective_d8b3048b TERMINATED 7 29 1 1 79.81 28.7783 │ │ run_objective_f44d0fd6 TERMINATED 4 49 1 1 129.91 33.5102 │ │ run_objective_aa1e053e TERMINATED 6 48 2 1 66.3485 20.6806 │ │ run_objective_9adcfe45 TERMINATED 9 38 1 1 134.151 29.4035 │ │ run_objective_45150746 TERMINATED 5 31 1 1 190.87 30.6452 │ │ run_objective_9eaf2742 TERMINATED 8 38 1 1 529.561 28.8289 │ │ run_objective_a118f595 TERMINATED 2 22 1 1 127.343 56.2727 │ │ run_objective_214fc572 TERMINATED 3 40 1 1 316.947 40.1667 │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ Hyperparameters found: {'num_workers': 6, 'cores_per_worker': 48, 'memory_per_worker': 2} Find below the 9 directories generated from each deployment and experiment execution. .. code-block:: text $ ls -la optimization/ ... Jul 27 16:30 20230727-162917-3f998bb2a73846439fbb1e480a2cb22a ... Jul 27 16:30 20230727-162921-aab11d8263654143b6a10bed0d8fd14f ... Jul 27 16:31 20230727-162926-c6abfed0dfd84b13b8868d74f39666bc ... Jul 27 16:31 20230727-163034-cc4d8bbd9ade4206b7daee8b6be531b6 ... Jul 27 16:32 20230727-163041-36d62067b97b47d7839be6c61f40ecdc ... Jul 27 16:34 20230727-163136-cc9f6537cade4771a8bba2fdbf269e93 ... Jul 27 16:40 20230727-163140-3aa41ba4e95b4900ab6ff981beab3318 ... Jul 27 16:35 20230727-163255-a81909fb7869450095f8b83053760542 ... Jul 27 16:40 20230727-163447-c25333d7554b418b8a3f37f1a0ce6097 The generated files consist of: .. code-block:: bash $ ls -la optimization/20230727-162917-3f998bb2a73846439fbb1e480a2cb22a/ 20230727-162917/ # validation files generated from each deployment optimization-results/ # the optimization results layers_services.yaml # E2Clab config files network.yaml workflow.yaml For each deployment, in ``20230727-162917/``, we have the **validation files** such as ``layers_services-validate.yaml``, ``results/``, and ``workflow-validate.out``. .. code-block:: text $ ls -la optimization/20230727-162917-3f998bb2a73846439fbb1e480a2cb22a/20230727-162917/ layers_services-validate.yaml results/ workflow-validate.out In ``optimization-results/``, we have .. code-block:: bash $ ls -la optimization/20230727-162917-3f998bb2a73846439fbb1e480a2cb22a/optimization-results/ params.json # the parameters explored by the algorithm params.pkl # contains state information of the algorithm (for checkpoint) .. note:: **Checkpoints:** users can `snapshot the training progress `_.