Project stereo

From MAE/ECE 148 - Introduction to Autonomous Vehicles
Jump to navigation Jump to search
Assembled car

The goal of this project was to implement stereo vision in the Donkeycar framework using two USB cameras, training the neural network to drive our robot car around a track, and compare its performance to that of Donkeycar vehicle trained on the same track with only one camera.

Mounting USB Cameras to the Car

Closeup of mount

A 3D printed mount was created for the front of the car to hold both USB cameras. The mount was designed so that the seats for both cameras would move as a single unit, pointing them at the same angle and minimizing the possibility of them pointing in different directions.

The mount also contains screw holes in the center that were intended to attach the original Picamera mount that we used earlier in the first half of the quarter. See the section "Possible Improvements" for more details on how this was meant to be used.


The template file provided by Tawn Kramer (here) already has code to implement stereo vision with the USB cameras. This code uses the pygame library, which does not come preinstalled in the Donkeycar framework. To install pygame, log onto the RPI and type in the following:

  > sudo apt-get install python-pygame

After this, create a vehicle with the template file, then change into the new directory:

  > donkey createcar --path ~/stereoCar --template tawn_donkey
  > cd stereoCar

Note, stereoCar is the name of the new vehicle. Replace it in the commands above with the name of your choosing. Next, modify to calibrate the steering and throttle as shown in the class tutorial, but with the additional step of changing the following parameter:

  IMAGE_W = 160
  IMAGE_H = 120

Now Donkeycar is set to use stereo vision. Make sure the cameras are plugged into the USB ports of the RPI, and start driving the car to collect data by typing the following:

 > python drive --camera=stereo

To train the neural network with your data, follow the instrustions given in the class tutorial. To run the model on your RPI, remember to add the "--camera=stereo" flag.

  > python drive --model==models/*your model goes here*.h5 --camera=stereo

Tawn Kramer's stereo code works by taking the images from the two stereo cameras(Left, Right), making them monochrome, producing an image equal to the difference of intensities (Left minus Right), then layering the three images and saving it as a single RGB image. This allows Donkeycar to utilize two cameras without having to change any other code in the architecture.

Left Right Difference Combined Image
Left Camera Right Camera Difference Combined Image

The combined images are saved into tubs into the data directory just as the images taken by the single Picamera would be.


The implementation of stereo appears to have substantially increased the efficiency of training the car as well as the speed at which it drives autonomously. Initially, when training the car with only the Picamera, about 60 laps (68,891 images) were needed before the car could finally drive autonomously at 10% - 15% the max throttle speed. In addition, autonomous driving would only work for a specific positioning of the sun around 2:00 pm. In contrast, implementing stereo vision appears to give a much more robust model for training the car. Only 40 laps (40,367 images) were needed to achieve autonomous operation, and the average speed when driving autonomously increased to 20% of the max throttle speed. Furthermore, the car was able to drive autonomously at a variety of times throughout the day even though data had only been collected on mornings with heavy overcast.

Depth Perception

The tawn_donkey stereo code utilizes the difference of the two camera images as one of layers of the saved image. In experiment, this has proved useful for our experiment, as the markings on the track contrasted enough with the surrounding pavement. However, in other tracks or types of environments this might not provide enough data for the neural netwok to get a sense of its environment. Our group proposed that replacing the difference layer with a depth map would be more versatile.

A depth map is an image produced from a pair of stereo images where the instensity at a given point in the image is proportional to the distance that the point is to the cameras.

Left Right Depth Map
Left Camera Right Camera Depth Map

Source: University of Tsukuba,

To expand on the functionality of the stereo code, our team attempted to implement depth perception, saving the resulting depth map as the third layer of each saved image.


OpenCV already has functions built into it that implements depth mapping from two images.

The code that our group wrote consists of three files:,, and


  > Basic procedure
    1. setup VideoCapture object 
    2. setup storage path
    3. take a certain amount of photos using both left and right camera. When doing this procedure, using cv2.grab and cv2.retrieve so that photos from two cameras are taken simultaneously
    4. exit and release camera


  > Basic procedure
    1. setup parameters for calibration and rectification (better copy and paste directly from some code, then modify slightly for your own use)
    2. setup objectpoint array (which will contains position for each corner) and imagepoint array (which will contain chessboard corner themselves)
    3. use findChessboardCorners and cornerSubPix with images taken from, then append the corners you get to imagepoints
    4. use calibrateCamera to both cameras, then plug in their output to stereoCalibrate, then plug in the output to stereoRectify, and last plug in the output to initUndistortRectifyMap to produce the distortion map.
    5. output the map as a set or a tuple

3. (or if you get the older version)

  > Basic procedure
   1. setup stereoMatcher object and its parameters (no need for VideoCapture object unless you are testing this script independently from donkeycar library.
   2. use the photos you receive from donkeycar and map from to calibrate and grayscale them.
   3. stereoMatcher.compute to produce the disparity map and utilize a trivial equation (need camera focal length and baseline distance) to produce the depth map
   4. to visualize it, divide it by some visualization scale and imshow, return the original depth map to the 

Comment and Caution

1. When, move the camera so that chessboard is inside the photos, but not exactly the same position in every photo.

2. objectpoint array from needs to be in certain format. Be sure to setup so that it fits your chessboard.

3. There are some cases in which original images are needed and some other cases where grayscale images are needed.

4. You may want to find some stereoMatcher tuner on linux to make sure you are using the right parameters.

5. scale from donkeycar library is 160*120. You need to expand it into some larger scale in order to fit the map from

6. if the output is terrrible, do not focus on trivial details such as the termination or interpolation criteria, there must be some fundamental parameter errors.

7. cv2 has many version, hence the function name and syntax may be different. Please do not waste your time on copying till something works (T-T). Directly use help(cv2) to get the right function name and syntax.

8. Please do not be scared by these maps/photo transformation, they are only numpy array or matrix.

9. If you follow our code, the biggest problem you will meet is the speed problem when donkeycar library is running.

Possible Improvements

For a more accurate comparison of the performance between the stereo and mono regimes, we considered using the Picamera concurrently with the two USB cameras to record data through all three cameras simultaneously. This would ensure that both the stereo and mono models were trained in the same conditions with similar data, giving us a mono model that would act as a proper control group. We didn't have the time to program thi functionality, but this might be something worth implementing in future projects.

The code we wrote for depth perception was functional, but running on the Raspberry Pi hardware was too slow. Future teams could work to improve our existing code, or even implement depth perception on faster hardware instead.