eHive production system - Long multiplication example pipeline

Long multiplication pipeline as our toy example

  1. Long multiplication pipeline solves a problem of multiplying two very long integer numbers by pretending the computations have to be done in parallel on the farm. While performing the task it demonstates the use of the following features:
    1. A pipeline can have multiple analyses (this one has three: 'start', 'part_multiply' and 'add_together').
    2. A job of one analysis can create jobs of other analyses by 'flowing the data' down numbered channels or branches. These branches are then assigned specific analysis names in the pipeline configuration file (one 'start' job flows partial multiplication subtasks down branch #2 and a task of adding them together down branch #1).
    3. Execution of one analysis can be blocked until all jobs of another analysis have been successfully completed ('add_together' is blocked both by 'part_multiply').
    4. As filesystems are frequently a bottleneck for big pipelines, it is advised that eHive processes store intermediate and final results in a database (in this pipeline, 'intermediate_result' and 'final_result' tables are used).
  2. The pipeline is defined in 4 files:
  3. The main part of any PipeConfig file, pipeline_analyses() method, defines the pipeline graph whose nodes are analyses and whose arcs are control and dataflow rules.

Initialization and running the long multiplication pipeline

  1. Before running the pipeline you will have to initialize it using init_pipeline.pl script supplying PipeConfig module and the necessary parameters. Have another look at LongMult_conf.pm file. The default_options() method returns a hash that pretty much defines what parameters you can/should supply to init_pipeline.pl . You will probably need to specify the following:
    
            $ init_pipeline.pl Bio::EnsEMBL::Hive::PipeConfig::LongMult_conf \
                -ensembl_cvs_root_dir $ENS_CODE_ROOT \
                -pipeline_db -host=<your_mysql_host> \
                -pipeline_db -user=<your_mysql_username> \
                -pipeline_db -user=<your_mysql_password> \
    
    This should create a fresh eHive database and initalize it with long multiplication pipeline data (the two numbers to be multiplied are taken from defaults). Upon successful completion init_pipeline.pl will print several beekeeper commands and a mysql command for connecting to the newly created database. Copy and run the mysql command in a separate shell session to follow the progress of the pipeline.
  2. Run the first beekeeper command that contains '-sync' option. This will initialize database's internal stats and determine which jobs can be run.
  3. Now you have two options: either to run the beekeeper.pl in automatic mode using '-loop' option and wait until it completes, or run it in step-by-step mode, initiating every step by separate executions of 'beekeeper.pl ... -run' command. We will use the step-by-step mode in order to see what is going on.
  4. Go to mysql window and check the contents of analysis_job table:
            MySQL> SELECT * FROM analysis_job;
    
    It will only contain jobs that set up the multiplication tasks in 'READY' mode - meaning 'ready to be taken by workers and executed'. Go to the beekeeper window and run the 'beekeeper.pl ... -run' once. It will submit a worker to the farm that will at some point get the 'start' job(s).
  5. Go to mysql window again and check the contents of analysis_job table. Keep checking as the worker may spend some time in 'pending' state. After the first worker is done you will see that 'start' jobs are now done and new 'part_multiply' and 'add_together' jobs have been created. Also check the contents of 'intermediate_result' table, it should be empty at that moment:
            MySQL> SELECT * from intermediate_result;
    
    Go back to the beekeeper window and run the 'beekeeper.pl ... -run' for the second time. It will submit another worker to the farm that will at some point get the 'part_multiply' jobs.
  6. Now check both 'analysis_job' and 'intermediate_result' tables again. At some moment 'part_multiply' jobs will have been completed and the results will go into 'intermediate_result' table; 'add_together' jobs are still to be done. Check the contents of 'final_result' table (should be empty) and run the third and the last round of 'beekeeper.pl ... -run'
  7. Eventually you will see that all jobs have completed and the 'final_result' table contains final result(s) of multiplication.

Back