then build and tag images "docker tag <orig>:latest localhost:5000/<orig>:latest" and follow with "docker push localhost:5000/<orig>:latest" then kubernetes should be able to find the image
then build and tag images "docker tag <orig>:latest localhost:5000/<orig>:latest" and follow with "docker push localhost:5000/<orig>:latest" then Kubernetes should be able to find the image
test darknet: ./darknet detector test cfg/coco.data cfg/yolov3.cfg yolov3.weights -ext_output shopping-crowded-mall-17889570.jpg
Go into the [docker][docker_folder] directory for Dockerfiles and bash scripts, each type of image has it's own directory.
- Building the base image is relatively straight forward:
`docker build -t <tag_name> -f Dockerfile .`
- Which should load the image into your local docker registry (or whatever registry you're using). To build the remote images, designed to be run on Nivida Jetsons mounted on a UAV, the build command is slightly more complex due to the different architectures:
- Which should load the image into your local docker registry (or whatever registry you're using). To cross build the remote images, designed to be run on Nivida Jetsons mounted on a UAV, the build command is slightly more complex due to the different architectures. Though you can of course simply build the images on the Jetson they're intended for:
- Then the tarball can be loaded into the docker registry, on the control plane node or (more likely) the ARM processor. There is an extra step in that the tarball needs to be sent to the remote processor and loaded into the relevant registry:
...
...
@@ -55,51 +54,45 @@ Go into the [docker][docker_folder] directory for Dockerfiles and bash scripts,
- And we're looking for the most recently added un-tagged image. Copy that ID and run:
`docker image tag <image_ID> <image_name>`
- The image names used in this project are `llh/basestation:v0` for the base image, and `llh/drone:v0` for the drone image. Keeping image names the same will omit the need to edit the deployment yaml files. Once tagged we can verify the image is in fact the correct architecture with:
`docker image inspect <tag_name>`
Some notes on this process. The image you'll want to build on the Jetsons is the YOLO detector image stored in the [yolo directory](yolo_folder). The Dockerfile is fairly lengthy, including ROS, openCV, as well as darknet itself, so the build process takes approximately 1-2 hours.
### NFS Setup
Currently the base-station image is stored on the Dockerhub and does not have a Dockerfile to build it yet. You can find the AMD architecture image [Here](base_station_image).
The pod deployments also make use of an NFS hosted from the control-plane node (the machine on which the cluster setup script is run). The NFS allows all pods to save images or otherwise to the same shared directory across the cluster. The setup steps are as follows:
### Pushing To Local Registry
- Install relevant NFS tools (usually already installed on standard linux distros):
`sudo apt-get install nfs-utils nfs-utils-lib`
Sometimes Kubernetes does not find the images available to Docker unless they are pushed to a locally run registry or Dockerhub, this can be seen as `ImagePullErr` when attempting to run a deployment in Kubernetes. The steps to solve this are below:
- To configure which directories are included in the filesystem, edit the `/etc/exports` file on the control-plane to allow connections from all pods as well as the host itself (note, these rules might be overkill for the application, and may need edits for a larger cluster). Also, the shared folder certainly does not need to be in the home directory but it's the example the author used. The contents of the exports file should look similar to:
`docker run -d -p 5000:5000 --restart=always --name registry registry:2`
-The only other thing to do on the host machine is to check that the NFS service is running with:
`service nfs-server status`
-Retag the image you wish to push onto this registry to include the local registry's address at the front:
`docker tag <original_image_name>:<image_tag> localhost:5000/<original_image_name>:<image_tag>`
- Fortunately, kubernetes handles the mounting process for remote storage, so the `image\*` files specified in the [kubernetes][kubernetes_folder] directory should handle the details of the NFS.
- The image can then be pushed as follows, this will take a moment depending on how large the image is:
Once mounted without errors, create a file in the mounted directory from the remote machine (e.g. `touch /home/user/imageNFS/text.txt`) and check if it appears in the host machine's directory. The same test should also be done from both pods once scheduled on the cluster (though if they do indeed reach scheduling, chances are the storage has been mounted).
A similar idea is used to push an image to Dockerhub but the `localhost` is replaced with your username.
## Cluster Setup
After the images are built, the next step is likely to bring up a cluster. One note about nomenclature, both ROS and kubernetes have the concept of nodes however they refer to two different object in our cluster. ROS-nodes are processes that run on kubernetes pods, and kubernetes nodes are remote machines that run pods.
After the images are built or obtained, the next step is likely to bring up a cluster. One note about nomenclature, both ROS and Kubernetes have the concept of nodes however they refer to two different object in our cluster. ROS nodes are processes that run on Kubernetes pods, and Kubernetes nodes are remote machines that run pods.
The `clusterSetup.sh` script should setup a kubernetes cluster, with some assumptions about the application. It can be run via (should request password due to swap memory):
`/clusterSetup.sh`
The `clusterSetup.sh` script should setup a Kubernetes cluster, with some assumptions about the application. It can be run via (should request password due to swap memory):
`./clusterSetup.sh`
Once the cluster is running, the next step is likely to add remote nodes to the cluster. The cluster setup script also copies the node join command generated when creating a cluster to make it easier to add remote machines to the cluster. To add a node, start an SSH session on the remote machine and run:
`sudo <pasted_join_command>` which should look similar to:
It is worth noting that the cluster setup script is very much a work in progress and includes many commands that are only used in testing a highly-available cluster architecture, however the default settings will work fine for general use. Once the cluster is running, the next step is likely to add remote nodes to the cluster. The cluster setup script also copies the node join command generated when creating a cluster to make it easier to add remote machines to the cluster. To add a node, start an SSH session on the remote machine and run:
`sudo swapoff -a` then the pasted join command which should look similar to:
The cluster can also be brought up using the highly-available paradigm (more details and steps [here][HAtutorialK8] and on the official source [here][HAtutorial]), allowing multiple control-plane nodes to be in the cluster. To achieve this setup, we will be creating an external etcd cluster with all the nodes to be used in the K8 cluster, configure `kubeadm` to use that external etcd option, create a load balancer with kube-vip pods so there is a single point of contact, and finally add each node to the K8 cluster as a control-plane.
We can create an external etcd cluster using files from this repo with the following steps:
- Make sure etcd is installed on the machine, steps [here][etcdInstall], though we will be using a different `etcd.conf` file so stop after step 1. Confirm installation with `etcd --version` (expect a warning about unsupported ARM arch if using jetson, `export ETCD_UNSUPPORTED_ARCH=arm64` fixes it). This also installs the `etcdctl` tool which we'll need later.
- Create the file `/etc/systemd/system/etcd3.service`, making it a copy of the `etcd3.service` template in the [HA cluster][ha_folder] folder. Then we can remove any other `etcdx.service` files to reduce confusion, can also edit the existing `etcd.service` file. Do this for all nodes to be added to the cluster. No changes need to be made to the service file as we will be using an environment file to specify addresses.
- Create the environment, file specified in `etcd3.service`, as `/etc/etcd.env` with the contents of the template `etcd.env` file in the [HA cluster][ha_folder] folder. In this environment file we specify each host's ip address and hostname, and the current host ip address and hostname. Note, only the last two lines need to be changed per host, only lines with `THIS_*`, in this way each host can know the others' addresses. On arm architecture nodes, the line `ETCD_UNSUPPORTED_ARCH=arm64` should be added to the top of the environment file.
- Bringing up the etcd cluster can be tricky because each host needs to be able to detect the others during start up, otherwise errors get thrown, so we first need to stop the etcd service (`sudo systemctl stop etcd3.service` on each host prior) and reload the service definition on all hosts one after another:
`sudo rm -r /var/lib/etcdCluster/
sudo systemctl stop etcd3.service
sudo systemctl daemon-reload
sudo systemctl enable etcd3.service
sudo systemctl start etcd3.service`<!-- rm will be /etcdCluster btw -->
Note, if a cluster has already been created in this way, we need to remove the data directory of that cluster by `rm -r /var/lib/etcd/` whilst the `etcd3.service` is stopped (may require super user shell). Also `journalctl -xe` and `systemctl status etcd3.service` can be helpful for troubleshooting. Also disabling the existing etcd service can make for less work on reboot.
- If all is well we should be able to confirm all nodes are present in the cluster by first specifying the addresses in the command line then requesting a member list:
Where `HOST_x` is the ip address of each host, the same values as in `etcd.env`. If all members are accounted for, we can move on to the load balancer steps.
<!-- in our case it'll be: export ENDPOINTS=192.168.111.200:2379,192.168.111.202:2379,192.168.111.201:2379 -->
In order to use multiple nodes as control planes, we need to create a single point of access for the kubernetes API server (there are about a million ways to do this, keepalived, google, aws, but we'll be using the fancy new [kube-vip][load-balancer] option). We'll be running a `kube-vip` pod on each node by placing the `kube-vip.yaml` file in the `/etc/kubernetes/manifests/` directory of each node (`kube-vip.yaml` template can be found in the [HA cluster][ha_folder] folder as per usual). In `kube-vip.yaml` there are some networking options, in our case we will likely need to change the interface option depending on the node's configuration. There are other options as well, the default values are usually fine, we should note the load-balancer ip address however. Note that the pod definition needs to be in the manifests directory before starting the cluster.
### Starting HA Cluster
To start the cluster, we can still use the cluster startup script however we now need to specify how to reach the etcd cluster and what nodes it uses. In `kubeadm-config.yaml` (in root directory) change the external etcd nodes' ip addresses to the value used in the etcd cluster setup step. We can now run
`./clusterStartup` on the first node (order doesn't matter, probably).
@@ -191,7 +139,7 @@ Once the cluster is up, either with the cluster setup bash script or otherwise,
To get into a bash shell of pod (usually find which pod up via `kubectl get pods`), can be done with `kubectl exec -it <pod_name> -- bash`. This will start a persistent bash terminal in the pod specified. Once inside a pod the file system can be viewed as any linux system (`ls`, `cd` and so on).
- To start using ROS commands, the workspace needs to be sourced first. The source command is `source /opt/ros_ws/devel/setup.bash` (depending on which image is in use, the location of ros_ws folder may change but it should be close to where the bash script started you).
- To start using ROS commands, the workspace needs to be sourced first. The Darknet pods automatically does this via the `~/.bashrc` file but in case something goes wrong, the source command is `source /opt/ros_ws/devel/setup.bash` (depending on which image is in use, the location of ros_ws folder may change but it should be close to where the bash script started you).
- Now that the ros workspace has been sourced, all ros commands should work.
* To check what topics are available and or check if there is a ros master running `rostopic list` can be used which will return a list of all topics currently being published
...
...
@@ -205,34 +153,35 @@ Alternately, an SSH connection can be created if using the built images discusse
`kubectl get pods -o wide`
- The pod in question can then be SSH'ed into via (the pod's IPs are usually in the form of 10.244.\*.\* depending on your particular cluster set up):
`ssh ssher@<pod_ip>`
`ssh root@<pod_ip>`
- If prompted for a password, use `Meepp973` as it is the default password for the `root` user for now (security is not important for this project as you can see). There is also the problem of network identity for the pods, since the identity can change between deployments of the same pod (also the reason we need to look up the IP each time), sometimes the SSH tool on linux can throw a man-in-the-middle error. To get around this, delete the entry in the `~/.ssh/known_hosts` file as suggested by the error message. Instead of having the cluster setup script delete the known hosts file, we leave it to the user to delete specific entries for now.
- If prompted for a password, use `password` as it is the default password for the `ssher` user for now (security is not important for this project as you can see). There is also the problem of network identity for the pods, since the identity can change between deployments of the same pod (also the reason we need to look up the IP each time), sometimes the SSH tool on linux can throw a man-in-the-middle error. To get around this, delete the entry in the `~/.ssh/known_hosts` file as suggested by the error message. Instead of having the cluster setup script delete the known hosts file, we leave it to the user to delete specific entries for now.
The advantage of using the SSH command instead of `kubectl exec ...` is that the `ssher` user specified already has the correct ROS environment sourced and ready to use. In normal use, only the base pod (also referred to as the master pod in some places) should be needed to control the cluster although both pods should be available via SSH.
The advantage of using the SSH command instead of `kubectl exec ...` is that the `roslaunch` command can work straight out of the box rather than first having to ssh between the Darknet pod and the Base Station pod. In normal use, only the base pod (also referred to as the master pod in some places) should be needed to control the cluster although both pods should be available via SSH. Headless services in the Base Station and Darknet deployments are used in combination with a `machines.launch` file in the Base Station image to allow seamless ROS launching.
## Imaging ROS Test
A relatively simple imaging pipeline has been built into the images for testing and characterization purposes. A usb webcam, or action camera with a usb interface, is needed in the current configuration although it can be easily switched to work with existing image files. For the pipeline, two ROS-nodes are launched, one capturing images from the camera and one saving those images to a networked file system (NFS). Note that it only matters where the image capturing ROS-node is running as it requires the camera or other image source to be present, the image saving ROS-node can be scheduled anywhere in the cluster as the NFS and raw image are provided to the cluster as a whole.
A relatively simple imaging pipeline has been built into the images for testing and characterization purposes. A usb webcam, or action camera with a usb interface, is needed in the current configuration although it can be easily switched to work with existing image files. For the pipeline, two ROS-nodes are launched, one capturing images from the camera and one saving those images to a storage volume. Note that it only matters where the image capturing ROS-node is running as it requires the camera or other image source to be present, the image saving ROS-node can be scheduled anywhere in the cluster as the storage volume is included in the deployment itself.
The steps to run the image pipeline are below:
- Start cluster and add base and drone deployments the [kubernetes folder][kubernetes_folder], discussed above.
- Start cluster and add base and drone deployments the [Kubernetes folder][Kubernetes_folder], discussed above.
- Start an SSH session on the base pod, get the pod's IP via:
`kubectl get pods -o wide`
and start the session by:
`ssh ssher@<base pod ip>`
`ssh root@<base pod ip>`
- Run the imaging ROS-nodes via a roslaunch file from the base node:
`roslaunch base_station image_test.launch`
The pods are already configured to have access to each other via services as well as the NFS available on the control-plane node so we don't need to specify anything more.
If the pipeline is working the NFS directory shared from the control-plane should be filling up with images. Not every image received is saved so there may be a delay of a few seconds between images at the moment (kubernetes-based processing pipeline is TBD).
If the pipeline is working the storage directory shared from the control-plane should be filling up with images. Not every image received is saved so there may be a delay of a few seconds between images at the moment (Kubernetes-based processing pipeline is TBD).
## Handy Troubleshooting Commands
There are a number of non-intuitive kubernetes commands that are super helpful for specific tasks:
There are a number of non-intuitive Kubernetes commands that are super helpful for specific tasks:
- Drain a particular node so that it can be removed:
- List all admin cluster events (can change namespace or omit):
`kubectl get events -n kube-system`
- Regenerate token join command:
`kubeadm token create --print-join-command`
Regenerate join command and save it to clipboard:
...
...
@@ -265,6 +217,15 @@ Alternate:
- It has happened that the cluster networking fails entirely, making services unreachable from inside pods. The solution seemed to be to delete the flannel deployment on the cluster and re-apply it afterwards. This error can be especially difficult to detect.
- If running the cluster in a scenario without internet access, please follow the guide [here](coredns-fix). Start by `export KUBE_EDITOR="nano"` to edit with nano if you don't know how to use VIM. Then run `kubectl edit configmap coredns -n kube-system` and we're trying to delete the line (don't ask me why or how this works, it just does):
The pod deployments can also make use of an NFS hosted from the control-plane node (the machine on which the cluster setup script is run). The NFS allows all pods to save images or otherwise to the same shared directory across the cluster. The setup steps are as follows:
- Install relevant NFS tools (usually already installed on standard linux distros):
`sudo apt-get install nfs-utils nfs-utils-lib`
- To configure which directories are included in the filesystem, edit the `/etc/exports` file on the control-plane to allow connections from all pods as well as the host itself (note, these rules might be overkill for the application, and may need edits for a larger cluster). Also, the shared folder certainly does not need to be in the home directory but it's the example the author used. The contents of the exports file should look similar to:
- The only other thing to do on the host machine is to check that the NFS service is running with:
`service nfs-server status`
- Fortunately, Kubernetes handles the mounting process for remote storage, so the `image\*` files specified in the [Kubernetes][Kubernetes_folder] directory should handle the details of the NFS.
- A quick test can be performed to make sure the NFS is working as intended. Mount the directory from a remote machine with:
Once mounted without errors, create a file in the mounted directory from the remote machine (e.g. `touch /home/user/imageNFS/text.txt`) and check if it appears in the host machine's directory. The same test should also be done from both pods once scheduled on the cluster (though if they do indeed reach scheduling, chances are the storage has been mounted).
### Highly Available Cluster Setup
The cluster can also be brought up using the highly-available paradigm (more details and steps [here][HAtutorialK8] and on the official source [here][HAtutorial]), allowing multiple control-plane nodes to be in the cluster. To achieve this setup, we will be creating an external etcd cluster with all the nodes to be used in the K8 cluster, configure `kubeadm` to use that external etcd option, create a load balancer with kube-vip pods so there is a single point of contact, and finally add each node to the K8 cluster as a control-plane.
We can create an external etcd cluster using files from this repo with the following steps:
- Make sure etcd is installed on the machine, steps [here][etcdInstall], though we will be using a different `etcd.conf` file so stop after step 1. Confirm installation with `etcd --version` (expect a warning about unsupported ARM arch if using jetson, `export ETCD_UNSUPPORTED_ARCH=arm64` fixes it). This also installs the `etcdctl` tool which we'll need later.
- Create the file `/etc/systemd/system/etcd3.service`, making it a copy of the `etcd3.service` template in the [HA cluster][ha_folder] folder. Then we can remove any other `etcdx.service` files to reduce confusion, can also edit the existing `etcd.service` file. Do this for all nodes to be added to the cluster. No changes need to be made to the service file as we will be using an environment file to specify addresses.
- Create the environment, file specified in `etcd3.service`, as `/etc/etcd.env` with the contents of the template `etcd.env` file in the [HA cluster][ha_folder] folder. In this environment file we specify each host's ip address and hostname, and the current host ip address and hostname. Note, only the last two lines need to be changed per host, only lines with `THIS_*`, in this way each host can know the others' addresses. On arm architecture nodes, the line `ETCD_UNSUPPORTED_ARCH=arm64` should be added to the top of the environment file.
- Bringing up the etcd cluster can be tricky because each host needs to be able to detect the others during start up, otherwise errors get thrown, so we first need to stop the etcd service (`sudo systemctl stop etcd3.service` on each host prior) and reload the service definition on all hosts one after another:
`sudo rm -r /var/lib/etcdCluster/
sudo systemctl stop etcd3.service
sudo systemctl daemon-reload
sudo systemctl enable etcd3.service
sudo systemctl start etcd3.service`<!-- rm will be /etcdCluster btw -->
Note, if a cluster has already been created in this way, we need to remove the data directory of that cluster by `rm -r /var/lib/etcd/` whilst the `etcd3.service` is stopped (may require super user shell). Also `journalctl -xe` and `systemctl status etcd3.service` can be helpful for troubleshooting. Also disabling the existing etcd service can make for less work on reboot.
- If all is well we should be able to confirm all nodes are present in the cluster by first specifying the addresses in the command line then requesting a member list:
Where `HOST_x` is the ip address of each host, the same values as in `etcd.env`. If all members are accounted for, we can move on to the load balancer steps.
<!-- in our case it'll be: export ENDPOINTS=192.168.111.200:2379,192.168.111.202:2379,192.168.111.201:2379 -->
In order to use multiple nodes as control planes, we need to create a single point of access for the Kubernetes API server (there are about a million ways to do this, keepalived, google, aws, but we'll be using the fancy new [kube-vip][load-balancer] option). We'll be running a `kube-vip` pod on each node by placing the `kube-vip.yaml` file in the `/etc/Kubernetes/manifests/` directory of each node (`kube-vip.yaml` template can be found in the [HA cluster][ha_folder] folder as per usual). In `kube-vip.yaml` there are some networking options, in our case we will likely need to change the interface option depending on the node's configuration. There are other options as well, the default values are usually fine, we should note the load-balancer ip address however. Note that the pod definition needs to be in the manifests directory before starting the cluster.
#### Starting HA Cluster
To start the cluster, we can still use the cluster startup script however we now need to specify how to reach the etcd cluster and what nodes it uses. In `kubeadm-config.yaml` (in root directory) change the external etcd nodes' ip addresses to the value used in the etcd cluster setup step. We can now run
`./clusterStartup` on the first node (order doesn't matter, probably).