Smart Surveillance Application 

This example is depicted in Figure 1: Anatomy and was provided by Pedro Silva. The example consists of data producers at the Edge (8 nodes, each one with 40 cameras); gateways at the Fog (4 nodes, each one with a java application to process images); and a Flink Cluster (1 node) and a Kafka Cluster (1 node) at the Cloud. The Flink Cluster is composed of one Job Manager and one Task Manager. The Kafka Cluster consists of a Zookeeper server and a Kafka Broker. Lastly, in a single node we have the metrics collector, an java application to collect metrics such as the end-to-end latency.

In this example you will learn how to:

Configure a Flink Cluster and a Kafka Cluster.
Deploy 4 gateways and 320 data producers and interconnect them in round-robin.
Define network constraints (delay, loss, and bandwidth) between the Edge, Fog, and Cloud infrastructures.
Manage and run Flink Job, Kafka broker, gateways, and data producers.
Check end-to-end execution.

../_images/cctv_anatomy.png — Figure 1: Anatomy

Experiment Artifacts 

Following the instructions below you can get access to the experiments artifacts required to run this application. These artifacts refers to applications, libraries, dataset, and configuration files.

# G5K frontend
$ ssh rennes.grid5000.fr
$ tmux
$ cd git/
$ git clone https://gitlab.inria.fr/E2Clab/examples/cctv
$ cd cctv/
$ ls
artifacts/          # contains dataset/, libs/, libs-opencv/, and Flink job
cluster-2020/       # contains the code used to generate the charts
getting_started/    # contains layers_services.yaml, network.yaml, and workflow.yaml

Defining the Experimental Environment 

Layers & Services Configuration

This configuration file presents the layers and services that compose the Smart Surveillance example. The Flink Cluster (single node, quantity: 1) is composed of one Job Manager and one Task Manager. The Kafka Cluster (single node, quantity: 1) consists of a Zookeeper server and a Kafka Broker. Lastly, we have gateways (4 nodes: quantity: 1 and repeat: 3) and the data producers (two nodes, quantity: 1 and repeat: 7). Note that all services will be monitored roles: [monitoring] during the execution of experiments if you defined a monitoring service monitoring: type: dstat.

environment:
  job_name: cctv-example
  walltime: "01:30:00"
  g5k:
    job_type: ["deploy"]
    env_name: "debian10-x64-big"
    cluster: parasilo
monitoring:
  type: dstat
layers:
- name: cloud
  services:
  - name: Flink
    quantity: 1
    roles: [monitoring]
    env:
      FLINK_PROPERTIES: "jobmanager.heap.size: 8000m\n
                         parallelism.default: 16\n
                         taskmanager.numberOfTaskSlots: 32\n
                         taskmanager.heap.size: 7000m\n
                         env.java.opts.taskmanager: -Djava.library.path=/opt/flink/lib/"
  - name: Kafka
    quantity: 1
    roles: [monitoring]
    env:
      KAFKA_ZOOKEEPER_CONNECTION_TIMEOUT_MS: '30000'
      KAFKA_BATCH_SIZE: '200000'
      KAFKA_LINGER_MS: '50'
- name: fog
  services:
  - name: Mosquitto
    quantity: 1
    roles: [monitoring]
    repeat: 3
- name: edge
  services:
  - name: Producer
    quantity: 1
    roles: [monitoring]
    repeat: 7
- name: experiment_manager
  services:
  - name: Metrics_collector
    quantity: 1

Network Configuration

The file below presents the network configuration between the cloud, fog, and edge infrastructures. Between cloud and fog we have the following configuration delay: 5ms, loss: 2%, rate: 1gbit, while between the edge and fog we are emulating a 4G LTE network delay: 50ms, loss: 5%, rate: 150gbit.

networks:
- src: cloud
  dst: fog
  delay: "5ms"
  rate: "1gbit"
  loss: "2%"
- src: fog
  dst: edge
  delay: "50ms"
  rate: "150mbit"
  loss: "5%"

Workflow Configuration

This configuration file presents the application workflow configuration. It will be explained in the following order prepare, launch, and finalize.

prepare

Regarding Kafka cloud.kafka.*.leader, we are creating two topics in-uni-data and out-uni-data with 32 partitions.
Regarding Flink Job Manager cloud.flink.*.job_manager, we are copying from the local machine to the remote machine the Flink Job.
Regarding gateways fog.mosquitto.*, we are copying its libraries.
Regarding producers edge.producer.*, we are copying the producer java application and its libraries.
Regarding metrics collector experiment_manager.metrics_collector.*, we are copying its libraries.

launch

Regarding Flink Job Manager cloud.flink.*.job_manager, we are copying it to the container and submitting the Flink Job.
Regarding gateways fog.mosquitto.*, we are starting the java application.
Regarding producers edge.producer.*, we are starting 40 java processes per node. We used the depends_on attribute to interconnect the 320 producers with the 4 gateways in round robin grouping: "round_robin". This configuration may be seen in the shell command, where we used the prefix mosquitto prefix: "mosquitto" to access the URL of gateways {{ mosquitto.url }}.
Regarding metrics collector experiment_manager.metrics_collector.*, we are starting the java application to collect some metrics.

finalize

Regarding Kafka cloud.kafka.*.leader, we are stopping and removing Kafka and Zookeeper docker containers.
Regarding Flink Job Manager cloud.flink.*.job_manager, we are stopping and removing the Task Manager container.
Regarding gateways fog.mosquitto.*, we are collecting the metrics of all 4 gateways, such as their latency and throughput.
Regarding producers edge.producer.*, we are collecting the metrics of all 320 producers, such as their throughput.
Regarding metrics collector experiment_manager.metrics_collector.*, we are collecting the end-to-end processing latency.

# KAFKA
- hosts: cloud.kafka.*.leader
  depends_on:
    service_selector: "cloud.kafka.*.zookeeper"
    grouping: "round_robin"
    prefix: "zookeeper"
  prepare:
    - debug:
        msg: "Creating my Kafka topics"
    - shell: "sudo docker exec leader /usr/bin/kafka-topics --create --partitions 32 --replication-factor 1 --if-not-exists --zookeeper {{ zookeeper.url }} --topic in-uni-data"
    - shell: "sudo docker exec leader /usr/bin/kafka-topics --create --partitions 32 --replication-factor 1 --if-not-exists --zookeeper {{ zookeeper.url }} --topic out-uni-data"
  finalize:
    - debug:
        msg: "Stopping Kafka and Zookeeper"
    - shell: "sudo docker container stop leader && sudo docker container rm leader"
    - shell: "sudo docker container stop zookeeper && sudo docker container rm zookeeper"
# FLINK
- hosts: cloud.flink.*.job_manager
  depends_on:
    service_selector: "cloud.kafka.*.leader"
    grouping: "round_robin"
    prefix: "kafka"
  prepare:
    - debug:
        msg: "Copying my Flink job dependencies"
    - copy:
        src: "{{ working_dir }}/university-people-count-flink-partial-on-cloud-0.0.1-SNAPSHOT.jar"
        dest: "/opt/university-people-count-flink-partial-on-cloud-0.0.1-SNAPSHOT.jar"
    - copy:
        src: "{{ working_dir }}/libs/"
        dest: "/opt/lib/"
    - copy:
        src: "{{ working_dir }}/libs-opencv/"
        dest: "/opt/lib/"
  launch:
    - debug:
        msg: "Starting my Flink job"
    - shell: "sudo docker cp /opt/university-people-count-flink-partial-on-cloud-0.0.1-SNAPSHOT.jar {{ self_container }}:/opt/ &&
              sudo docker cp /opt/lib/ task_manager:/opt/flink/"
    - shell: "sudo docker exec {{ self_container }} flink run -d -q /opt/university-people-count-flink-partial-on-cloud-0.0.1-SNAPSHOT.jar
                32 {{ kafka.url }} in-uni-data out-uni-data {{ kafka.url }} 20"
  finalize:
    - debug:
        msg: "Stopping Flink"
    - shell: "sudo docker container stop task_manager && sudo docker container rm task_manager"
# GATEWAYS
- hosts: fog.mosquitto.*
  depends_on:
    - service_selector: "fog.mosquitto.*"
      grouping: "address_match"
      prefix: "mosquitto"
    - service_selector: "cloud.kafka.*.leader"
      grouping: "round_robin"
      prefix: "kafka"
  prepare:
    - debug:
        msg: "Copying my active gateways libs (Mosquitto + Edgent) from: {{ working_dir }}"
    - copy:
        src: "{{ working_dir }}/libs/"
        dest: "/opt/lib/"
    - copy:
        src: "{{ working_dir }}/libs-opencv/"
        dest: "/opt/lib/"
  launch:
    - debug:
        msg: "Starting my active gateways (Mosquitto + Edgent)"
    - shell: "java -Djava.library.path=/opt/lib/ -cp /opt/lib/*:/opt/lib university_people_count.university_people_count_fog_to_cloud.UniversityPeopleCountActiveMosquittoToKafka
              tcp://{{ mosquitto.url }} in-uni-data 0 tcp://{{ kafka.url }} in-uni-data 0 10 100 /tmp/mosquitto /opt/metrics 2>&1 | tee -a /opt/cctv-ag.log"
      async: 260
      poll: 0
  finalize:
    - debug:
        msg: "Backup gateways metrics to {{ working_dir }} + /experiment-results/gateways/"
    - fetch:
        src: "/opt/metrics/latency"
        dest: "{{ working_dir }}/experiment-results/gateways/"
        validate_checksum: no
    - fetch:
        src: "/opt/metrics/throughput"
        dest: "{{ working_dir }}/experiment-results/gateways/"
        validate_checksum: no
# PRODUCERS
- hosts: edge.producer.*
  depends_on:
    service_selector: "fog.mosquitto.*"
    grouping: "round_robin"
    prefix: "mosquitto"
  prepare:
    - debug:
        msg: "Copying producer libs"
    - copy:
        src: "{{ working_dir }}/libs/"
        dest: "/opt/lib/"
    - copy:
        src: "{{ working_dir }}/dataset/"
        dest: "/opt/dataset/"
  launch:
    - debug:
        msg: "Starting producers and connecting them with gateways (round robin)"
    - shell: "java -Xms256m -Xmx1024m -cp /opt/lib/*:/opt/lib university_people_count.university_people_count_data_source.UniversityPeopleCountDataSourceToMosquitto
              /opt/dataset/data-{{ item }} 10 {{ item }} tcp://{{ mosquitto.url }} in-uni-data /tmp/mosquitto 220 0 /opt/metrics-{{ item }}
              2>&1 | tee -a /opt/cctv-producer-{{ item }}.log"
      async: 220
      poll: 0
      loop: "{{ range(1, 41)|list }}"
  finalize:
    - debug:
        msg: "Backup producers metrics"
    - fetch:
        src: "/opt/metrics-{{ item }}/throughput"
        dest: "{{ working_dir }}/experiment-results/producers/"
        validate_checksum: no
      loop: "{{ range(1, 41)|list }}"
# METRICS COLLECTOR
- hosts: experiment_manager.metrics_collector.*
  depends_on:
    service_selector: "cloud.kafka.*.leader"
    grouping: "round_robin"
    prefix: "kafka"
  prepare:
    - debug:
        msg: "Copying application to collect the experiment metrics"
    - copy:
        src: "{{ working_dir }}/libs/"
        dest: "/opt/lib/"
  launch:
    - debug:
        msg: "Starting my application to collect the experiment metrics"
    - shell: "sleep 20 && java -cp /opt/lib/*:/opt/lib university_people_count.university_people_count_sink.UniversityPeopleCountCloudInSink {{ kafka.url }}
              in-uni-data 0 /opt/metrics/in-sink 2>&1 | tee -a /opt/cctv-sink-input-cloud.log"
      async: 230
      poll: 0
    - shell: "sleep 20 && java -cp /opt/lib/*:/opt/lib university_people_count.university_people_count_sink.UniversityPeopleCountOutSink {{ kafka.url }}
              out-uni-data 0 /opt/metrics/out-sink 2>&1 | tee -a /opt/cctv-sink-output.log"
      async: 230
      poll: 0
  finalize:
    - debug:
        msg: "Backup sinks metrics"
    - fetch:
        src: "/opt/metrics/in-sink/latency"
        dest: "{{ working_dir }}/experiment-results/sinks/in-sink/"
        validate_checksum: no
    - fetch:
        src: "/opt/metrics/out-sink/latency"
        dest: "{{ working_dir }}/experiment-results/sinks/out-sink/"
        validate_checksum: no

Running & Verifying Experiment Execution 

Find below the commands to deploy this application and check its execution. Once deployed, you can check in the Flink WebUI the configurations that we have defined in the layers-services.yaml and workflow.yaml configuration files, such as the number of Task Managers, total task slots, Job Manager and Task manager heap size, job parallelism, etc. (see Figure 2: Flink WebUI, Figure 3: Flink configurations, and Figure 4: Flink Job). Besides, you can also check the data sent to Mosquitto in-uni-data topic and Kafka in-uni-data and out-uni-data topics. For producers and metrics collector you can check the java processes running on it.

Running multiple experiments: in this example we are running two experiments --repeat 1, each one with a duration of four minutes --duration 240.

# G5K frontend
$ ssh rennes.grid5000.fr
$ tmux
$ cd git/e2clab/

# G5K interactive usage
$ oarsub -p "cluster='parasilo'" -l host=1,walltime=2 -I

# As soon as the host becomes available, you can run your experiments
$ source ../venv/bin/activate
$ e2clab deploy --repeat 1 --duration 240
                /home/drosendo/git/cluster-2020-artifacts/getting_started/
                /home/drosendo/git/cluster-2020-artifacts/artifacts/

Verifying experiment execution.

# Discover hosts in layers_services-validate.yaml file
$ cat layers_services-validate.yaml

# Tunnel to Flink Job Manager
$ ssh -NL 8081:localhost:8081 parasilo-12.rennes.grid5000.fr

# Kafka Cluster
$ ssh parasilo-14.rennes.grid5000.fr

# Kafka input topic
$ sudo docker exec -it leader /usr/bin/kafka-console-consumer --bootstrap-server localhost:9092 --topic in-uni-data

# Kafka output topic
$ sudo docker exec -it leader /usr/bin/kafka-console-consumer --bootstrap-server localhost:9092 --topic out-uni-data

# Gateway input topic
$ ssh parasilo-25.rennes.grid5000.fr
$ sudo docker exec -it mosquitto mosquitto_sub -t "in-uni-data"

# Producer
$ ssh parasilo-3.rennes.grid5000.fr
$ ps -C java | wc -l
41
$ ps -xau | grep java
java -Xms256m -Xmx1024m -cp /opt/lib/*:/opt/lib university_people_count.university_people_count_data_source.UniversityPeopleCountDataSourceToMosquitto /opt/dataset/data-1 10 1 tcp://parasilo-25.rennes.grid5000.fr:1883 in-uni-data /tmp/mosquitto 220 0 /opt/metrics-1
...
java -Xms256m -Xmx1024m -cp /opt/lib/*:/opt/lib university_people_count.university_people_count_data_source.UniversityPeopleCountDataSourceToMosquitto /opt/dataset/data-40 10 40 tcp://parasilo-25.rennes.grid5000.fr:1883 in-uni-data /tmp/mosquitto 220 0 /opt/metrics-40

# Metrics Collector
$ ssh parasilo-9.rennes.grid5000.fr
$ ps -xau | grep java
java -cp /opt/lib/*:/opt/lib university_people_count.university_people_count_sink.UniversityPeopleCountCloudInSink parasilo-14.rennes.grid5000.fr:9092 in-uni-data 0 /opt/metrics/in-sink
java -cp /opt/lib/*:/opt/lib university_people_count.university_people_count_sink.UniversityPeopleCountOutSink parasilo-14.rennes.grid5000.fr:9092 out-uni-data 0 /opt/metrics/out-sink

In Figure 2: Flink WebUI you can check: the single Task Manager configured with 32 task slots taskmanager.numberOfTaskSlots: 32.

../_images/cctv_webui.png — Figure 2: Flink WebUI

In Figure 3: Flink configurations you can check: env.java.opts.taskmanager: -Djava.library.path=/opt/flink/lib/, jobmanager.heap.size: 8000m, taskmanager.heap.size: 7000m, and parallelism.default: 16.

../_images/cctv_configs.png — Figure 3: Flink configurations

In Figure 4: Flink Job you can check: the running job with parallelism: 32.

../_images/cctv_running_job.png — Figure 4: Flink Job

Deployment Validation & Experiment Results 

Find below the files generated after the execution of each experiment. It consists of validation files layers_services-validate.yaml, network-validate/, and workflow-validate.out, monitoring data dstat/, and experiment-results/ (files generated by producers producers/, gateways gateways/, and metrics collector sinks/). Note that, for each experiment a new directory is generated 20200614-155815/.

$ ls /home/drosendo/git/cluster-2020-artifacts/getting_started/20200614-155815/
dstat/                          # Monitoring data for each physical machine
layers_services-validate.yaml   # Mapping between layers and services with physical machines
network-validate/               # Network configuration for each physical machine
workflow-validate.out           # Commands used to deploy application (prepare, launch, and finalize)
experiment-results/             # Defined in workflow file
    gateways/                        # latency and throughput of all gateways (4 in this example)
    producers/                       # throughput of all producers (320 in this example)
    sinks/                           # latency: fog-to-cloud and end-to-end

Note

Providing a systematic methodology to define the experimental environment and providing access to the methodology artifacts (layers_services.yaml, network.yaml, and workflow.yaml) leverages the experiment Repeatability, Replicability, and Reproducibility, see ACM Digital Library Terminology.

Smart Surveillance Application

Experiment Artifacts

Defining the Experimental Environment