This blog details the work completed by Emma Pabich from Aachen, Ewoud Crolle from Delft and Karen Archer from Hatfield for DLRC 2018, which took place in September/October 2018. This challenge, initiated by Volkswagen Group AI Research, is an annual event that invites students to join for a month and work on a scientific problem combining machine learning and robotics. The contents of this post are not related to the research work done at https://argmax.ai.
The Deep Learning Research Challenge centres around teaching a Franka Emika Panda 7DoF Robotic Arm to classify sensor readings according to SELF, ENVIRONMENT and OTHER.
The Panda is equipped with an Intel RealSense depth camera mounted adjacent to the end effector. It captures RGB images at 480 x 640 pixels and depth information at 240 x 320 locations in the field of view of the camera. The upper limit of depth detection is 10m, with accuracy dropping off at short range and towards 10m. Given the lab environment, depth information has typically been thresholded to the region we are interested in.
In addition the Panda is also equipped with nine LiDAR single point sensors distributed equally on the fourth and sixth joint. These LiDAR sensors have a range of approximately 2m. If nothing is detected within this range, a value of 8191 is received. Throughout the project, these have been interpreted to mean no object detected within 2m, and therefore readings were capped at a value of 2m.
The classification of sensor readings into SELF, ENVIRONMENT and OTHER agents can have multiple purposes, including to complete a task in the working environment, avoid obstacles while performing that task or even detecting another robot in a collaborative setting. As an additional task we could choose our own "brownie task".
LiDAR classification into SELF has been achieved using a supervised approach, supplemented with an additional neural network to predict the expected mean LiDAR reading for a given joint configuration. For the RGB and depth information we have used another supervised approach, built on the convolutional encoder decoder SegNet [@@badrinarayanan2015segnet] which segments sensor output according to SELF using both the RGB and depth information, making the classification more robust then only using the RGB or the depth information separately. The classification into OTHER agents has been done by detecting motion in the environment using the pretrained neural network Flownet2 [@@ilg2017flownet]
As a brownie task we have elected that the robot clear up coloured Jenga blocks from the table and put them in a box. The rectangular blocks are identified using simple colour masking and the depth information is used to estimate location, size and orientation to enable the robot to complete the task.
We have two possibilities to move towards the blocks, first by the internal controller and then secondly by our implemented torque controller which can do collision avoidance while moving to an object. This collision avoidance is done by probabilistic task prioritisation.
Classifying LiDAR Sensors
There are nine single LiDAR sensors mounted on the Panda, on two of the upper joints, facing outwards typically symmetrical and perpendicular to the joint. The LiDAR reading supplies a relative distance from the sensor itself in a certain direction towards a maximum distance of 2m. Hence, the LiDAR sensors provide us a measure of distance, primarily used to gather real time information about the environment for the sake of collision avoidance. As a single measurement, it can not be used for recognising what the object is. This gives rise to one problem: some of the sensors give a close by reading when it detects a part of the robot itself. Acting on this reading as if there was an obstacle would cause the robot to try to move away from itself resulting in erratic behaviour.
Crazy Panda avoiding itself in simulation resulting in sporadic behaviour.
Transformation to World Frame
Because the LiDARs are mounted on the robot itself, the position and orientation changes with the motion of the robot. The position of each joint in cartesian coordinates can be calculated from the current angle of each joint using transformations and the Denavit-Hartenberg parameters supplied by the manufacturer. Additional transformations were applied to estimate the position of the LiDAR sensors by measuring the offset of each LiDAR compared to its adjacent joint. We estimated the direction of each LiDAR according to the axis of the coordinate system of the attached joint. Using this information LiDAR readings were calculated in the 3D real world coordinates relative to the base frame of the robot.
During normal operation of the robot, the end effector of the robot was constrained to remain within 'virtual walls', essentially a box directly in front of the robot arm. These constraints were adhered to when collecting data and later designing a suitable task.
Supervised Classification of Self
An analytical solution of the space occupied by the robot would be intractable given the dependency on the joint angles and complicated curves and geometry of the Panda. An approximation would be to model the links as cylinders, however this would be time consuming without any guarantees of sufficient accuracy. Knowing the position and direction of LiDAR sensors in the world frame we favoured an approach of developing heuristics to establish whether a LiDAR was directed at itself. Similiar heuristics were used when considering output from the depth camera as discussed later.
These heuristics enabled us to develop supervised classification of the LiDAR sensors using the joint angles as input. Each LiDAR reading was then classified as to whether the Panda sees itself or not. We were therefore able to automatically label sufficient LiDAR data to train neural networks. The heuristics based in the 3D world coordinates are as follows:
Given the maximum distance of the LiDAR sensors, we ensured the visible environment was empty except for the table and mounting block for the robot.
LiDAR readings and joint angles were recorded while the Panda moved to random locations within the virtual wall constraints.
All data where LiDAR sensor distances exceeded 60 cm were labeled as ENVIRONMENT.
All data where the calculated 3D height was below 0 cm was the table and hence labeled as ENVIRONMENT.
All remaining data was labeled as SELF.
Classification was achieved using sequential feedforward networks with three dense layers of size 7, a dropout layer and a final layer with sigmoid activation function. During training the class weights were balanced since seeing itself while moving within the virtual walls was a rare event and in steep minority in comparison to events when it did not see itself. Without this, the network simply learnt to never predict seeing itself.
For this approach to be successful, it was crucial that the environment was empty, anything detected closer than 60 cm and above table level was classified as self. The figures below show the clear and consistent peaks in the distribution of LiDAR readings in an empty environment. In addition, the training data was augmented with LiDAR readings artificially shortened to represent objects in the environment, enabling the network to learn rules and generalise to a cluttered environment. Later data was collected where particular LiDAR sensors were obstructed and used in combination with other data to train final versions of the neural network classifiers.
Pearson Correlation Coefficients were calculated to test the association between LiDAR sensors. A value of zero implies no correlation and 1 strongly correlated. As the maximum value is around 0.3, there appears to be little or no correlation between LiDAR sensors. Thus it made sense to treat LiDARs separately and model with separate neural networks.
Unsupervised Prediction of LiDAR readings
Another set of feed forward networks were trained for each sensor to predict the mean of the expected LiDAR reading in the absence of objects, assuming that the data is normally distributed with a variance of 1. The inputs to the network are the 7 joint angles, training continued over 100 epochs with the learning rate decaying over the training period, maximising the likelihood of achieving the ground truth LiDAR value given the predicted mean and an assumed variance of 1.
Observing LiDAR output in a constant joint angle configuration showed several significant outliers over time. In order to ensure the accuracy of the training data when learning to predict actual LiDAR readings, batches of LiDAR data were recorded and the median of these was taken to represent the expected, ground truth LiDAR reading for each joint configuration randomly selected from within the virtual walls.
The figure below shows the results of visualising the classification and LiDAR reading prediction. There is a high degree of accuracy when the end effector is within the virtual walls, where training data was collected. As anticipated, a lower level of accuracy is achieved outside the virtual walls, however it is still able to approximate readings and predict values within around 20cm. Predicting self using the classified model was tested and trained in awkward positions where it achieves an acceptable level of accuracy.
When testing the model on the real robot we noticed that it worked well for the usual configurations, but did not perform so well in more awkward configurations. This can be explained by the lack of training data in these configurations. We improved the model by added some more of the unusual cases.
The Panda has two basic control modes. The first is POSITION CONTROL, in which the user provides a desired end-effector position in cartesian coordinates and optionally a quaternion-orientation. An internal controller is used to move to this target position. The second possibility is JOINT CONTROL. For this control you either send joint positions or torques for each joint. For our collision avoidance it was necessary to control the robot in joint space which required that we write a torque controller for the joints.
Collision avoidance means doing two tasks at once, reaching a goal state while at the same time avoiding obstacles. In our approach these obstacles are detected by the seven LiDAR sensors distributed on the robot joints. This is achieved by computing the torques to reach the target position while conducting collision avoidance in the nullspace of the joints. This means that even though the robot has reached the target position with its end-effector it still has degrees of freedom available in the joints. The joints can then be moved without moving away from the target position. Since both tasks cannot be executed with the same accuracy priorities are defined for the tasks. The task with the lower priority will then be executed with a lower accuracy. For the prioritisation we have used Bayesian task prioritisation as defined in [@paraschos2017probabilistic]. This allows the computation of the combined torque control signal for multiple tasks, which in our case is reaching a target position and avoiding obstacles seen by the LiDAR sensors.
In our approach we consider obstacles that are closer than 30 cm. Whenever a LiDAR sensor has a close reading it defines a repulsion field from this reading. The repulsion force depends on the distance from the obstacle, a closer obstacle results in a stronger force. The LiDAR sensors are always approximately along one axis of the respective joint coordinate system, thus the force is also only along that axis. The video below shows obstacle avoidance working.
Video showing the Panda avoiding obstacles simulated by blocking the LiDAR sensors.
Communication between the object detection and the control is via the internal messaging system. The object detection sends a target position to the torque control. The torque control then sends back commands to the gripper control once the target is reached.
When implementing joint control in this way, several issues were overcome. We learnt that the controller is very sensitive to its parameters. Finding the correct parameters to give it enough torque so that it actually reaches the the target position but at the same time still reacts avoids obstacles with enough force was very difficult. Additionally a torque that might be sufficient to move from the initial position (go0) to a target position might not be sufficient to move from a different position to the same target. The parameters are especially hard to define when specifying the end-effector orientation in our target position in order to be able to grasp an object.
Our controller is designed on the principle of a coiled spring, the greater the distance to the goal point, the stronger the force that attracts it to this point. Since the force then decreases the closer we get to the target, we have found the need to increase the proportional gain of our controller once we are very close to the target position.
Controlling the torques directly presents a safety risk which was reduced by closely limiting the applied torques. We are restricting both the change in the torques with a rate limit and the actual maximum torque.
Task Object Detection
To be able to perform our chosen "brownie" task it is necessary to distinguish Jenga blocks within the task environment. The Jenga blocks were chosen with have a distinct colour and they are all the same shape. The detection system uses the RGB and depth information to estimate a target position in the world coordinate frame as well as the orientation to be able to pick up the block.
Region Detection using Colours
Regions of colours matching the blocks are identified using HSV colour masks where the hue is separated from the saturation and value (a measure for darkness). For the three coloured blocks a range was determined to identify the blocks in a range of lighting conditions, while still excluding surroundings and shadows. This principle works very well in a regulated situation where the surroundings don't have the same colour as the blocks.
To remove the visual artifacts a filter is applied which removes all the blobs smaller than a threshold value. Size is estimated based on the coordinates of region bounding boxes in 3D world coordinates, and large items are ignored. In the case where blocks are close or overlapping, the blocks could be considered to be one region and the centroid of the region will no longer be centred on one block, but possibly the space between two blocks. This was considered ancillary to the central task of classifying sensors.
Region Detection using Depth Information
Block detection would be more robust if depth information was incorporated, helping in situations where the background has a similar colour to that of the blocks. However, owing to the size of the Jenga blocks, there is a relatively small change in depth in contrast to other depth changes within the environment. Segmentation of the depth information was attempted using the Felzenswalbe algorithm [@@felzenszwalb2004efficient]. This worked well on large items, however when selecting small blocks, the small scale parameter resulted in over segmenting the table and even Jenga blocks. In cases when the table sloped away in the field of view, the sloping gradients resulted in striated regions. Strategies attempted to incorporate depth segmentation did not improve upon the HSV region detection.
HSV colour masks are converted into separate regions using the Scikit Image library. These are supplemented with simple algorithms to estimate the size of objects once the bounding box and centroid of each region proposal are converted to coordinates in the camera frame using the camera's intrinsic matrix. The depth information obviates the need for a stereo camera reconstruction. These points are then further transformed to the 3D coordinates in the world frame relative to the base frame of the robot using the transformation matrices and the current robot joint angles.
We choose to pickup the object that is closest to the camera, based on the lowest distance read at the pixel location identified as the centroid of each region. This should make sure that we keep focussed on the same object while moving in to grab it. In practice this method does not guarantee the robot keeps moving to the same object because the camera has an offset to the gripper and therefor the closest object is subject to change. The robot would also not detect any changes to the blocks as no tracking has been implemented.
It is important to know the orientation of the block to align the gripper to be able to grasp the object. We assume that the objects lie flat on the table so we only have to look for an angle in the table frame, grasping the object on its shortest axis. This axis is identified by establishing the edge points of a chosen object and then selecting the point closest to the centroid. The vector between this point and the centroid gives the required gripping direction.
Controlling robot to achieve task
We use the built in position and orientation control from the Panda. To make sure the object is not hit from the side we first position the gripper right above the object. At this stage, the field of view is re-segmented, taking a closer look to better estimate the position and required orientation to successfully grasp the Jenga block.
Video of Panda carrying out the Jenga clearing task.
There are various ways in which object detection could be improved for future work, including:
Refine the regions by identifying rectangular shapes within region proposals to identify individual Jenga blocks.
Track objects of interest.
Make use of a neural network such as the pre-trained network YOLOv3 to detect task objects.
Segmenting Image Data
One of the objectives of the challenge was to classify image data pixel-wise which has been achieved for SELF segmenting RGB-D image information. A similar pixel-wise approach was taken to classify OTHER using motion as discussed in the next section.
SegNet is a convolutional autoencoder developed by members of the Computer Vision and Robotics Group at the University of Cambridge [@@badrinarayanan2015segnet]. The figure below shows the network architecture. The network learns to predict pixel-wise classification of images through supervised learning.
Creating training masks
The basis of automatically creating the training masks is done in the same fashion as with the Lidars. In this case the depth camera can be seen as a large collection of lidar sensors. The depth image is transformed into a 3d map of pixels in the frame of the base of the robot. Everything that is far away is removed just like everything below the table level. The remaining pixels are transformed back to the pixel frame which gives a mask image of the robot. The masks were further refined using colour masking to remove areas where the base of the robot was included in the mask, examples are shown below. RGB images were downsampled to ensure the same resolution across all images. Depth images were converted to single channel grayscale encoding.
Images were collected to ensure an adequate proportion of images with the Panda in the field of view of the camera. These were scaled ensuring an upper limit of 1.0 for the RGB-D channels and used to train the SegNet network without using any pre-trained weights.
Results using only the RGB data were susceptible to classifying other fixtures in the environment falsely as the Panda. To counter this, a parallel SegNet architecture was implemented to process the depth information, similar to an approach taken by [@eitel2015multimodal]. The two networks were concatenated and combined with a final fusion layer before predicting the pixel-wise one hot encoding of not self and self.
Pixelwise segmentation is demonstrated in the visualisation of the Panda field of view as shown above. The RGB network learnt features based on colour and texture, where as the depth network learnt features based on shape, particularly the distinctive curvature of the elbow joint. As to be expected, it can be fooled by objects exhibiting similar features.
It would be more beneficial if the depth and RGB information could be learnt in the context of the current configuration of the robot, specifically the joint angles of the robot. The Panda may then be able to learn to distinguish between itself and another Panda, or even another object with properties similar to the features that have been learnt by the network. To achieve this, the joint angles could be incorporated as inputs after the encoder to ensure sufficient emphasis is placed on the relatively few inputs representing the state of the robot.
Identifying Other agents
Identifying other agents in the environment is an important aspect of collaborating robots. In our approach another robot is defined by the fact that it is moving. A stationary robot can be treated as any other object in the environment. Once it starts moving however the robot has to be a lot more careful around it and as a later step maybe even plan its trajectory differently. This is why we wanted to be able to detect motion in the environment. We have then used a pre-trained neural network called FlowNet2 [@@ilg2017flownet] developed by Freiburg to detect the movement between to camera frames.
The video below shows the detected flow of a moving robot. On the left you can see the Panda picking up objects and on the right the detected flow, which can be interpreted as the velocity in x and y direction. To visualize it the output flow is graded by both magnitude and direction of velocity.
Video showing the detection of motion whilst the Panda is picking up blocks.
One disadvantage of FlowNet2 is that it can only detect movement if the camera itself is stationary. Deducting the camera movement from the frames as a preprocessing step might be possible for small movements of the camera but is not feasible in our scenario.
This means that we would have to stop the robots movement every now and again to scan the environment for moving objects. If the robot detects a moving object close to its next target position it should then plan its movement slower and stop more often to check for moving objects.
Even though the detection of movement works quite well for us, we have not yet integrated this classification into our brownie tasks since we are not collaborating with another robot yet. For future work it would be very interesting to try to add as a part of the collision avoidance.
Conclusions and Future Work
While completing a task with the Panda we have been successful in using deep learning to classify SELF pixel-wise in images using SegNet, as well as with each of the 9 LiDAR sensors using a combination of feedforward neural networks. We are able to estimate live LiDAR readings anticipated within the default ENVIRONMENT. Based on this information, we can deduce OTHER objects in the environment. Future work includes combining the task and joint control modes into one controller, where this classification can be used to inform collision avoidance. In the absence of knowing SELF, the Panda would attempt to avoid itself resulting in unpredictable behaviour.
Detecting movement provides a strategy for detecting OTHER agents. We are able to detect motion in the field of view of a stationary end effector using FlowNet2 returning pixel wise classification of OTHER based on movement. This was demonstrated using footage of our Panda completing a task to show how this can be used in practise. Future work includes incorporating this into trajectory planning to the point where collaborating robots could apply turn taking when completing a task. There are additional demonstrations using footage from the Panda depth camera.