Processing 10000 Pictures Using Many Computers With Oropo (Debian/Ubuntu)

Want to support HowtoForge? Become a subscriber!
 
Submitted by ms209495 (Contact Author) (Forums) on Mon, 2010-04-26 15:36. :: Debian | Ubuntu | Other

Processing 10000 Pictures Using Many Computers With Oropo (Debian/Ubuntu)

Introduction

Have you ever had a lot of data to process ? In such a moment after a while of processing we realize that it will take ages to complete. It would be faster if we could use two or three or even more computers. Let's use some computers - you think it is a lot of configuration ? You are wrong. With Oropo it's easy. Let's see.

It's difficult to talk about processing without an example. Let's discuss a problem of processing large number of pictures. First approach for solving this problem is to process pictures sequentially on one computer. Second approach is to process pictures parallelly on many computers.

 

Problem description

The problem is to process 10000 pictures. Each picture is in hight quality, the goal is to create a smaller version of each one. There is a library libjpeg that provides suitable programs.

 

Useful programs from libjpeg:

djpeg - decompress a JPEG file to an image file

cjpeg - compress an image file to a JPEG file

Script signature for processing single picture:

  • argument: path to the picture

  • result: smaller version of the picture

Sample script in bash:

 

Script make_smaller.sh

#!/bin/bash
QUALITY=30
if [ $# -ne 1 ]; then
	echo "arguments" 1>&2
	exit 1;
fi
FILE_PATH=$1
djpeg $FILE_PATH | cjpeg -quality $QUALITY 

 

Processing sequentially

All pictures can be processed by invoking script make_smaller.sh for each picture.

 

Processing sequentially

#!/bin/bash
MAKE_SMALLER=$PWD/make_smaller.sh
IMGS_DIR=$PWD/imgs
TARGET_DIR=$PWD/imgs_smaller
for file in $IMGS_DIR/*; do
	bash $MAKE_SMALLER $file > $TARGET_DIR/${file##*/}
done

 

Processing parallelly

We can process all pictures using Oropo Executor system. Tasks for processing pictures will be added to a queue and processed parallelly on many computers. Each picture will be processed with script make_smaller.sh.

 

Processing parallelly

#!/bin/bash
MAKE_SMALLER=$PWD/make_smaller.sh
IMGS_DIR=$PWD/imgs
for file in $IMGS_DIR/*; do
	oropo-system-pusher -p "string:bash" -p "path:$MAKE_SMALLER" -p "path:$file"
done

Processing results can be found in /var/lib/oropo/response/*/0 files.

 

Summary

In previous sections two approaches for processing pictures were presented. First approach uses single computer for processing. Second approach uses many computers for processing. Complexity of both solutions deployment is almost the same. With second approach processing will be completed faster.

 

Oropo Project

General

Oropo Project home page: http://www.oropo.org.

 

Installation

To install Oropo you need to install Oropo System on central node and Oropo Executor on each node that will be used for processing (it may be central node also).

Oropo packages are located in oropo repository, you need to do these steps to be able to install packages.

 

Configuration on each node:

Add this entry to /etc/apt/sources.list file:

deb http://students.mimuw.edu.pl/~ms209495/oropo/debian sid main

Execute command:

apt-get update

 

Installing Oropo System on central node:

Execute command:

apt-get install oropo-system

 

Installing Oropo Executor on processing nodes:

Execute command:

apt-get install oropo-executor

 

Configuration

Configuration on central node:

Add yourself to oropo group to have sufficient permissions:

adduser `whoami` oropo

Add processing nodes addresses to the Oropo System:

oropo-monitor-ctl --id_prefix oropomonitor --add node1_ip_address

oropo-monitor-ctl --id_prefix oropomonitor --add node2_ip_address

oropo-monitor-ctl --id_prefix oropomonitor --add nodeN_ip_address


Please do not use the comment function to ask for help! If you need help, please use our forum.
Comments will be published after administrator approval.
Submitted by Ole Tange (not registered) on Tue, 2010-06-15 01:09.

Using GNU Parallel http://www.gnu.org/software/parallel/ it would be like this:

find imgs | parallel 'djpeg {} | cjpeg -quality 30 > {.}.smaller.jpg'

To run on the local and multiple remote computers with one job per CPU core, transferring data to and from the remote computer do:

find imgs | parallel -j+0 --trc {.}.smaller.jpg -Scomputer1,computer2,: 'djpeg {} | cjpeg -quality 30 > {.}.smaller.jpg'

Watch the intro video for more http://www.youtube.com/watch?v=LlXDtd_pRaY