image pipeline readme updates

efa90a1c · Your Name · 39f8ecc5 · efa90a1c · efa90a1c · efa90a1c
Commit efa90a1c authored Feb 11, 2022 by Your Name
8 changed files
--- a/README.md
+++ b/README.md
@@ -14,46 +14,99 @@ Which should load the image into your local docker registry (or whatever registr

 `docker buildx build --platform linux/arm64 -o type=oci,dest=<image_name>.tar -t <tag_name> .`

-Then the tarball can be loaded into the docker registry, on the control plane node or (more likely) the ARM processor:
+Then the tarball can be loaded into the docker registry, on the control plane node or (more likely) the ARM processor. There is an extra step in that the tarball needs to be sent to the remote processor and loaded into the relevant registry:

-`<ssh to processor and send image> docker load < <image_name>.tar`
+`scp <tarball_filename>.tar username@remote:/home/username/Downloads`

-Verify the image is in fact the correct architecture:
+Then the loading command on the remote processor is (from directory where image is):
+
+`docker image load <image_name>.tar`
+
+Since an image loaded from a tarball comes with no tag (that has been the author's experience in any case), we can now tag the newly generated image. However, first we need the ID of said new image:
+
+`docker image ls`
+
+And we're looking for the most recently added un-tagged image. Copy that ID and run:
+
+`docker image tag <image_ID> <image_name>`
+
+The image names used in this project are `llh/basestation:v0` for the base image, and `llh/drone:v0` for the drone image. Keeping image names the same will omit the need to edit the deployment yaml files. Once tagged we can verify the image is in fact the correct architecture with:

 `docker image inspect <tag_name>`

+### NFS Setup
+
+The pod deployments also make use of an NFS hosted from the control-plane node (the machine on which the cluster setup script is run). The NFS allows all pods to save images or otherwise to the same shared directory across the cluster. The setup steps are as follows:
+
+- Install relevant NFS tools (usually already installed on standard linux distros):
+
+`sudo apt-get install nfs-utils nfs-utils-lib`
+
+- To configure which directories are included in the filesystem, edit the `/etc/exports` file on the control-plane to allow connections from all pods as well as the host itself (note, these rules might be overkill for the application, and may need edits for a larger cluster). Also, the shared folder certainly does not need to be in the home directory but it's the example the author used. The contents of the exports file should look similar to:
+
+`/home/<user_name>/imageNFS 10.244.0.0/24(rw)
+/home/<user_name>/imageNFS 10.244.1.0/24(rw)
+/home/<user_name>/imageNFS <control_plane_ip>/24(rw)
+`
+
+- The only other thing to do on the host machine is to check that the NFS service is running with: `service nfs-server status`
+
+- Fortunately, kubernetes handles the mounting process for remote storage, so the `image\*` files specified in the [kubernetes][kubernetes_folder] directory should handle the details of the NFS.
+
+- A quick test can be performed to make sure the NFS is working as intended. Mount the directory from a remote machine with: `mount -t nfs <host_machine_ip>:/home/<user_name>/imageNFS /home/<remote_user_name>/imageNFS`. Once mounted without errors, create a file in the mounted directory from the remote machine (e.g. `touch /home/user/imageNFS/text.txt`) and check if it appears in the host machine's directory. The same test should also be done from both pods once scheduled on the cluster (though if they do indeed reach scheduling, chances are the storage has been mounted).
+
 ## Cluster Setup

-After the images are built, the next step is likely to bring up a cluster. The `clusterSetup.sh` script does this, with some assumptions about your application. The high level steps are as follows:
+After the images are built, the next step is likely to bring up a cluster. One note about
+
+The `clusterSetup.sh` script should setup a kubernetes cluster, with some assumptions about the application. It can be run via (should request password due to swap memory):
+
+`/clusterSetup.sh`
+
+Once the cluster is running, the next step is likely to add remote nodes to the cluster. The cluster setup script also copies the node join command generated when creating a cluster to make it easier to add remote machines to the cluster. To add a node, start an SSH session on the remote machine and run:
+
+`sudo <pasted_join_command>` which should look similar to `kubeadm join <control_plane_ip>:<port> --token qwf631.i58dqjxebj01u3h9 --discovery-token-ca-cert-hash sha256:5701ff981a1b22145b4e8a631bd9567525adb76643bece39c0533173b5de579a`. And now you see why the command gets copied by the setup script.
+
+deployments can be added be running (note that the `nodeSelector` and `image` fields may require editing depending on hostnames and image names):
+
+`kubectl apply -f kubernetes/`

- turn off swap: `swapoff -a`
+Which will add all objects/pods listed in the [kubernetes directory][kubernetes_folder].

- start cluster, config is for pod subnet (may take time): `kubeadm init --config kubeadm-config.yml`
+### Without Setup Script

- move config over:
+If the cluster setup script does not work for the application the low level steps are as follows:
+
+- Turn off swap: `swapoff -a`
+
+- Start cluster, config is for pod subnet (may take time): `kubeadm init --config kubeadm-config.yml`
+
+- Move config over:
 `mkdir -p $HOME/.kube
 sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
 sudo chown $(id -u):$(id -g) $HOME/.kube/config`

- apply flannel networking: `kubectl apply -f flannel.yml`
+- Apply flannel networking: `kubectl apply -f flannel.yml`

- remove taint from control plane: `kubectl taint nodes --all node-role.kubernetes.io/master-`
+- Remove taint from control plane: `kubectl taint nodes --all node-role.kubernetes.io/master-`

- label node(s) if deployment calls for it (ones in this repo do via `nodeSelector: <tag>=<value>` in the deployment): `kubectl label nodes <node_name> <tag=<value>`
+- Label node(s) if deployment calls for it (ones in this repo only rely on hostname, but eh): `kubectl label nodes <node_name> <tag=<value>`

- start applying deployments starting with master: `kubectl apply -f kubernetes/master-deployment.yml`
+- Start applying deployments starting with master: `kubectl apply -f kubernetes/master-deployment.yml`

- continue applying deployments for as many as needed
+- Continue applying deployments for as many as needed

- turn off cluster: `kubeadm reset` potentially followed by [clean-up][cleanupCluster]
+- Turn off cluster: `kubeadm reset` potentially followed by [clean-up][cleanupCluster]

 [cleanupCluster]: [https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-reset/]

 ## ROS Commands

-Once the cluster is up, either with the cluster setup bash script or otherwise, there are several helpful testing commands in ROS that we can take advantage of.
+Once the cluster is up, either with the cluster setup bash script or otherwise, there are several helpful built in testing commands to take advantage of.

- Get into bash shell of pod (usually find which pod is acting up via `kubectl get pods`), can be done with `kubectl exec -it <pod_name> -- bash`. This will start a persistent bash terminal in the pod specified. Once inside a pod the file system can be viewed as any linux system (using `ls` and `cd`).
+### Bash Shell in Pods
+
+To get into a bash shell of pod (usually find which pod up via `kubectl get pods`), can be done with `kubectl exec -it <pod_name> -- bash`. This will start a persistent bash terminal in the pod specified. Once inside a pod the file system can be viewed as any linux system (`ls`, `cd` and so on).

 - To start using ROS commands, the workspace needs to be sourced first. The source command is `source /opt/ros_ws/devel/setup.bash` (depending on which image is in use, the location of ros_ws folder may change but it should be close to where the bash script started you).

@@ -63,32 +116,64 @@ Once the cluster is up, either with the cluster setup bash script or otherwise,
  * To check what nodes are currently running `rosnode list` can be used, this can be helpful for debugging.
  * If any of the above commands come back with a `Unable to communicate with ROS_MASTER` it is likely that the environment was not sourced correctly, or that `roscore` has not been run yet (i.e. the base station pod is not up or can not be detected from the current pod).

+### SSH Server Bash shell
+
+Alternately, an SSH connection can be created if using the built images discussed above (as they include the open-SSH server environment already set up). Note that the pods will need to be fully scheduled in the cluster and in the running state for the IPs to be available and the SSH servers running. Get the relevant IPs via:
+
+`kubectl get pods -o wide`
+
+- The pod in question can then be SSH'ed into via (the pod's IPs are usually in the form of 10.244.\*.\* depending on your particular cluster set up):
+
+`ssh ssher@<pod_ip>`
+
+If prompted for a password, use `password` as it is the default password for the `ssher` user for now (security is not important for this project as you can see). There is also the problem of network identity for the pods, since the identity can change between deployments of the same pod (also the reason we need to look up the IP each time), sometimes the SSH tool on linux can throw a man-in-the-middle error. To get around this, delete the entry in the `~/.ssh/known_hosts` file as suggested by the error message. Instead of having the cluster setup script delete the known hosts file, we leave it to the user to delete specific entries for now.
+
+The advantage of using the SSH command instead of `kubectl exec ...` is that the `ssher` user specified already has the correct ROS environment sourced and ready to use. In normal use, only the base pod (also referred to as the master pod in some places) should be needed to control the cluster although both pods should be available via SSH.
+
+## Imaging ROS Test
+
+A relatively simple imaging pipeline has been built into the images for testing and characterization purposes. A usb webcam, or action camera with a usb interface, is needed in the current configuration although it can be easily switched to work with existing image files. For the pipeline, two ROS-nodes are launched, one capturing images from the camera and one saving those images to a networked file system (NFS). Note that it only matters where the image capturing ROS-node is running as it requires the camera or other image source to be present, the image saving ROS-node can be scheduled anywhere in the cluster as the NFS and raw image are provided to the cluster as a whole.
+
+The steps to run the image pipeline are below:
+
+- Start cluster and add base and drone deployments the [kubernetes folder][kubernetes_folder], discussed above.
+
+- Start an SSH session on the base pod, get the pod's IP via: `kubectl get pods -o wide` and start the session by: `ssh ssher@<base pod ip>`.
+
+- Run the imaging ROS-nodes via a roslaunch file from the base node: `roslaunch base_station image_test.launch`. The pods are already configured to have access to each other via services as well as the NFS available on the control-plane node so we don't need to specify anything more.
+
+If the pipeline is working the NFS directory shared from the control-plane should be filling up with images. Not every image received is saved so there may be a delay of a few seconds between images at the moment (kubernetes-based processing pipeline is TBD).
+
 ## Handy Troubleshooting Commands

 There are a number of non-intuitive kubernetes commands that are super helpful for specific tasks:

- drain a particular node so that it can be removed: `kubectl drain <node name> --delete-emptydir-data --force --ignore-daemonsets` then `kubectl delete node <node_name>`
+- Drain a particular node so that it can be removed: `kubectl drain <node name> --delete-emptydir-data --force --ignore-daemonsets` then `kubectl delete node <node_name>`

- get bash terminal inside pod: `kubectl exec -it <pod_name> -- bash`
+- Get bash terminal inside pod: `kubectl exec -it <pod_name> -- bash`

- get which pods are on which nodes (a travesty this one): `kubectl get pod -o=custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName --all-namespaces`
+- Get which pods are on which nodes (a travesty this one): `kubectl get pod -o=custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName --all-namespaces`

- another option to see pod location: `kubectl get pods --all-namespaces -o wide`
+- Another option to see pod location: `kubectl get pods --all-namespaces -o wide`

- regenerate token join command: `kubeadm token create --print-join-command`
+- Regenerate token join command: `kubeadm token create --print-join-command`. Regenerate join command and save it to clipboard: `kubeadm token create --print-join-command | xclip -selection clipboard`. Note that `sudo` is not required for the token commands.

- add pod cidr to remote nodes, sometimes this is required depending on which interface is used to add a node: `kubectl patch node <NODE_NAME> -p '{"spec":{"podCIDR":"<SUBNET>"}}'` from this really niche [github page][podcidr-help]
-`
+- Add pod cidr to remote nodes, sometimes this is required depending on which interface is used to add a node: `kubectl patch node <NODE_NAME> -p '{"spec":{"podCIDR":"<SUBNET>"}}'` from this really niche [github page][podcidr-help]
+
+- Another option to get individual pod cluster IPs: `kubectl get pods -o wide`. Alternate: `kubectl cluster-info dump | grep IP`
+
+- It has happened that the cluster networking fails entirely, making services unreachable from inside pods. The solution seemed to be to delete the flannel deployment on the cluster and re-apply it afterwards. This error can be especially difficult to detect.

 ## Links With More Information

 Debugging clusters can be somewhat difficult as everything is stuffed into pods, the links below have some steps to try:

- if `CrashLoopBackoff` is thrown when running `kubectl get pods`, take a look [here][crashloop]
- general steps for building images with buildx [here][docker-buildx]
- there's some specific steps that got ARM64 builds working in my case, solution [here][docker-buildx-platform]
+- If `CrashLoopBackoff` is thrown when running `kubectl get pods`, take a look [here][crashloop]
+- General steps for building images with buildx [here][docker-buildx]
+- There's some specific steps that got ARM64 builds working in my case, solution [here][docker-buildx-platform]

 [podcidr-help]: https://github.com/flannel-io/flannel/blob/master/Documentation/troubleshooting.md
 [crashloop]: https://managedkube.com/kubernetes/pod/failure/crashloopbackoff/k8sbot/troubleshooting/2019/02/12/pod-failure-crashloopbackoff.html
 [docker-buildx]: https://collabnix.com/building-arm-based-docker-images-on-docker-desktop-made-possible-using-buildx/
 [docker-buildx-platform]: https://github.com/docker/buildx/issues/464
+[kubernetes_folder]: kubernetes/
--- a/clusterStartup.sh
+++ b/clusterStartup.sh
@@ -29,7 +29,7 @@ kubeadm token create --print-join-command | xclip -selection clipboard
 # save that file to clipboard
 # cat ./tmpJoinCommand.txt | xclip -selection clipboard
 # rm ./tmpJoinCommand.txt
-echo "set up done, join command copied (can ctrl-c now)"
+echo "set up done, join command copied"
 exit 0

 # dang pod cidr decided not to be assigned correctly, so follow links to assign it manually:

--- a/docker/base/Dockerfile
+++ b/docker/base/Dockerfile
@@ -7,6 +7,8 @@ RUN apt-get update && apt-get install -y libsdl2-dev libusb-1.0-0-dev build-esse
 RUN pip3 install git+https://github.com/catkin/catkin_tools.git
 RUN pip install opencv-python

+RUN echo "freshenst"
+
 # clone ros package repo
 ENV ROS_WS /opt/ros_ws
 RUN mkdir -p $ROS_WS/src
@@ -32,6 +34,9 @@ COPY ros_entrypoint.sh /usr/local/bin/ros_entrypoint.sh

 RUN useradd -m -s /bin/bash -p $(openssl passwd -1 password) ssher
 RUN usermod -aG sudo ssher
+# add ssher user to the video group to allow streaming
+RUN adduser ssher video
+RUN usermod -a -G video ssher

 RUN chmod 755 /usr/local/bin/ros_entrypoint.sh

@@ -44,9 +49,7 @@ RUN echo "ssh-keyscan -H service-drone >> /home/ssher/.ssh/known_hosts" >> /home

 # put rsa key in image
 RUN mkdir -p /home/ssher/.ssh
-              # COPY selfkey.pub /home/ssher/.ssh/authorized_keys
 RUN chown -R ssher:ssher /home/ssher/.ssh
-              # RUN chmod 600 /home/ssher/.ssh/authorized_keys
 RUN service ssh start
 EXPOSE 22


--- a/docker/drone/Dockerfile
+++ b/docker/drone/Dockerfile
@@ -38,6 +38,8 @@ RUN cmake .. \
 WORKDIR /usr/local/include/
 RUN git clone https://github.com/libigl/eigen.git && mv eigen/ Eigen3/ && cp -r ./Eigen3/Eigen/ ./Eigen/

+RUN echo "refreshest"
+
 # clone ros package repo
 ENV ROS_WS /opt/ros_ws
 RUN mkdir -p $ROS_WS/src
@@ -68,6 +70,10 @@ COPY ros_entrypoint.sh /usr/local/bin/ros_entrypoint.sh

 RUN useradd -m -s /bin/bash -p $(openssl passwd -1 password) ssher
 RUN usermod -aG sudo ssher
+# add ssher user to the video group to allow streaming
+RUN adduser ssher video
+RUN usermod -a -G video ssher
+
 RUN chmod 755 /usr/local/bin/bashCheckRoscore.sh && chmod 755 /usr/local/bin/ros_entrypoint.sh

 # make ssh-ing easier
@@ -100,9 +106,7 @@ RUN chown ssher /opt/ros/noetic/kube_setup.sh \

 # set up ssh stuff
 RUN mkdir -p /home/ssher/.ssh
-                      # COPY selfkey.pub /home/ssher/.ssh/authorized_keys
 RUN chown -R ssher:ssher /home/ssher/.ssh
-                      # RUN chmod 600 /home/ssher/.ssh/authorized_keys
 RUN service ssh start
 EXPOSE 22


--- a/kubernetes/drone-deployment.yaml
+++ b/kubernetes/drone-deployment.yaml
@@ -22,14 +22,9 @@ spec:
      - name: image-storage
        persistentVolumeClaim:
          claimName: image-volume-claim
-      # - name: video-source
-      #   persistentVolumeClaim:
-      #     readOnly: true
-      #     claimName: video-volume-claim
      - name: video-source
        hostPath:
          path: /dev/video0
-
      - name: ttyacm
        hostPath:
          path: /dev/ttyACM0
@@ -49,11 +44,6 @@ spec:
          mountPath: /home/ssher/imageNFS
        - name: video-source
          mountPath: /dev/video99
-        # volumeDevices:
-        # - name: video-source
-        #   devicePath: /dev/videoSource
-        # - name: video-source
-        #   mountPath: /home/ssher/videoSource
        image: llh/drone:v0
        command: [ "/bin/bash"]       # You need to run some task inside a
        args: ["-c", "source /opt/ros_ws/devel/setup.bash && sudo service ssh restart && /usr/local/bin/ros_entrypoint.sh && while true; do sleep 10; done;"] # Our simple program just sleeps inside
@@ -66,8 +56,8 @@ spec:
        - name: ROS_HOSTNAME
          value: service-drone
      nodeSelector:
-        kubernetes.io/hostname: pop-os
-        # kubernetes.io/hostname: neruda # temp for testing purposes
+        # kubernetes.io/hostname: pop-os # to run on control plane
+        kubernetes.io/hostname: neruda
 ---

 apiVersion: v1

--- a/kubernetes/videoPV.yaml
+++ b/kubernetes/videoPV.yaml
-apiVersion: v1
-kind: PersistentVolume
-metadata:
-  name: video-volume
-spec:
-  storageClassName: video
-  volumeMode: Block
-  capacity:
-    storage: 500Gi
-  local:
-    path: /dev/video0
-  accessModes:
-  - ReadOnlyMany
-  persistentVolumeReclaimPolicy: Retain
-  nodeAffinity:
-    required:
-      nodeSelectorTerms:
-      - matchExpressions:
-        - key: kubernetes.io/hostname
-          operator: In
-          values:
-          - pop-os
--- a/kubernetes/videoPVC.yaml
+++ b/kubernetes/videoPVC.yaml
-apiVersion: v1
-kind: PersistentVolumeClaim
-metadata:
-  name: video-volume-claim
-spec:
-  storageClassName: video
-  accessModes:
-  - ReadOnlyMany
-  volumeName: video-volume
-  volumeMode: Block
-  resources:
-    requests:
-      storage: 500Gi
--- a/kubernetes/videoSC.yaml
+++ b/kubernetes/videoSC.yaml
-apiVersion: storage.k8s.io/v1
-kind: StorageClass
-metadata:
-  name: video
-provisioner: kubernetes.io/no-provisioner
-reclaimPolicy: Retain
-allowVolumeExpansion: true
-volumeBindingMode: WaitForFirstConsumer