Tuesday, April 26, 2016

Introduction to Java 8 Streams Part 2 - Parallel Streams



We have seen in the first part of this series how to use streams which are considered one of the key features of Java 8. In this post, we will walk through a particular feature of streams: Parallel streams. To give a rather simplistic definition, parallel streams involves doing the work in parallel using more resources to finish faster. They can be used by calling a magic method: parallel(). Java 8 hides all the complexity behind dividing the work and parallelizing it which relieves the developer from carring about things like threads and concurrency. However, many have reservations about using parallel streams for several reasons. Before going into the pitfalls of parallel streams, let's go through a quick example that illustrates their "good" side. Suppose we want to sum all even integers from 1 to one billion, one way of doing it using streams would be:

 static long sum = 0;
    
    public static void main(String[] args){
        
      IntStream.rangeClosed(1, 1000000000)
                .filter(e -> e % 2 == 0)
                .forEach(e -> addToSum(e));
                
    }
    
    public static void addToSum(int num){
      sum += num;
        
    }
The excution time gives:


adding paralllelization to our stream:

static long sum = 0;
    
    public static void main(String[] args){
        
      IntStream.rangeClosed(1, 1000000000)               
                .parallel()
                .filter(e -> e % 2 == 0)
                .forEach(e -> addToSum(e));
                
    }
    
    public static void addToSum(int num){
      sum += num;
        
    }

The excution time gives:

Notice that using parallel(), we have reduced the execution from 27 seconds to 18 seconds which seems like a legitimate advantage in the favor of parallel streams. Let's try and print the result sum for both cases :

Without parralel(): 250000000500000000
With parralel() (try 1): 144666330514402188
With parralel() (try 2): 146443671692846754
With parralel() (try 3): 142515120769803018


Any thoughts? We have gained on excution time, but the result is wrong, and it is not anywhere closer to the correct result. Why is that? The main reason why this occurs is that multiple threads are trying to access a shared variable in parallel. 

So think twice before using parallel stream. Because you can parallelize, it does not mean you should. Questions that the developer needs to ask before making using this construct are: is the collection/environement thread safe? is the problem large enough to be worth parralelizing? Are the resources enough? Hoping that this brief example shed light on what happens behind the curtains when using parallel(). 




2 comments:

  1. This is not a stream problem, this would happen with a 'traditional' parallelized version with threads too. You have to use either a synchronized block, a lock or just an AtomicLong for this.

    ReplyDelete
    Replies
    1. I do agree with your point. The problem is not specific to streams, but while using parallel(), the user may not know what does this function do exactly, and he does not have a grasp on the number threads, how they launched and executed, and that was the point of this post. The solutions you provided are valid solutions for this issue, but slow very much the processing of the list, to the point that paralellizing becomes useless.

      Delete