Refactoring to Eclipse Collections: Making Your Java Streams Leaner, Meaner, and Cleaner

Image result for Refactoring to Eclipse Collections: Making Your Java Streams Leaner, Meaner, and Cleaner

Java Streams, introduced in Java 8, are great – they allow you to take full advantage of lambda expressions to replace repetitive code with methods capturing common iteration patterns, resulting in more functional code.

However, as much of an improvement as the streams are, they are ultimately just an extension of the existing collection framework, and thus carry quite a bit of baggage.

Can we improve things further? Can we have even richer interfaces as well as cleaner, more readable code? Can we realize tangible memory savings as compared to legacy collection implementations? Can we have better, more seamless support for the functional programming paradigm?

The answer is yes! Eclipse Collections(formerly GS Collections), a drop-in replacement for the Java Collections framework, will do just that.

In this article, we will demonstrate several examples of refactoring standard Java code to Eclipse Collections data structures and APIs, and also demonstrate some of the memory savings you can achieve.

There will be a lot of code examples here, which will show how to change the code that uses the standard Java collections and streams, to code that uses the Eclipse Collection framework.

But before we dive into the code, we’ll spend some time to understand what Eclipse Collections is, why it is needed, and why you might want to refactor working idiomatic Java to Eclipse Collections.

Eclipse Collections History

Eclipse Collections was initially created at Goldman Sachs for an application platform with a very large distributed cache component. The system, still in production, stores hundreds of gigabytes of data in memory.

In fact a cache is simply a map; we store objects there and get them out. These objects can contain other maps and collections. Initially, the cache was based on the standard data structures from the package java.util. *. But it quickly became apparent that these collections have two significant drawbacks: inefficient use of memory, and very limited interfaces (a syndrome that leads to hard to read and repetitive code). Because the problems were rooted in the collection implementation, it was not possible to patch the cache code with utility libraries. To solve those two problems at once, a decision was made at Goldman Sachs engineering to create a new Collection framework completely from scratch.

At the time, it seemed to be a somewhat radical solution, but it worked. Now, this framework is under the umbrella of the Eclipse Foundation.

At the end of the article, we share links that will help you find out more about the project itself and ways to learn to use Eclipse Collections and to contribute to it.

Why Refactor to Eclipse Collections?

What are the benefits of Eclipse Collections? Thanks to its richer APIs, efficient memory usage, and better performance, Eclipse Collections is, in our opinion, the richest collection library for Java. It is also designed to be fully compatible with the collections that you get from the JDK.

Easy Migration

Before we dive into the benefits, it’s important to note that moving to Eclipse Collections is easy, and you don’t have to do it all at once. Eclipse Collections includes fully compatible implementations of JDK java.util.* List, Set and Map interfaces. It is also compatible with libraries from the JDK, such as Collectors. Our data structures inherit from the JDK interfaces for these types, so they are drop-in replacements for their JDK counterparts (with the exception of our Stack interface, which is not compatible, and our new primitive and Immutable collections, for which there are no equivalents in the JDK).

Richer APIs

The Eclipse Collections implementations of the java.util.List, Set and Map interfaces have much richer APIs, which we will explore in the code examples later on. There are also additional types that are not present in the JDK, such as Bag, Multimap and BiMap. A Bag is a multiset; a set with repeating elements. Logically, you can think of it as a map of items to the number of their occurrences. BiMap is an “inverted” map where not only can you find the value by key, but you can also find the key by value. Multimaps are maps where values themselves are Collections (Key -> List, Key -> Set, etc.).

Choice of Eager or Lazy

Eclipse Collections allows you to easily switch between lazy and eager implementations of collections, which greatly helps writing, understanding, and debugging of functional code in Java. Unlike the streams API, eager evaluation is the default. Should you want lazy evaluation, simply write .asLazy() on your data structure before continuing writing your logic.

Immutable Collection Interfaces

Immutable collections allow you to develop more correct code by enforcing the immutability at the API level. The correctness of the program in this case will be guaranteed by the compiler, which will avoid surprises during its execution. The combination of immutable collections and richer interfaces allows you to write pure functional code in Java.

Collections of Primitive Types

Eclipse Collections also has a full complement of primitive containers, and all primitive collection types have immutable equivalents. It is also worth noting that while the JDK streams provide support for int, long, and double streams, Eclipse Collections has support for all eight primitives, and allows you to have collections that hold their primitive values directly (as opposed to their boxed object counterparts, e.g. Eclipse Collections IntList, a list of ints, vs. JDK List<Integer>, a list of boxed primitive values).

No “Bun” Methods

What are “bun” methods? This is an analogy invented by Brian Goetz, chief architect of Java at Oracle. A hamburger (two buns with meat in between) represents the structure of typical stream code. Using Java streams, if you want to do something, it does not matter what, you must put your methods in between two buns – the stream() (or parallelStream() ) method at the beginning, and a collector method at the end. These buns are empty calories, which you do not really need, but you can’t get to the meat without them. In Eclipse Collections these methods are not required. Here is an example of bun methods in the JDK rendition: imagine that we have a list of people with their names and ages, and we want to extract  the names of the people over 21 years old:

Any Types You Need

In the Eclipse Collections, there are types and methods for every use case and they are easy to find by the features you require. It is not necessary to memorize their individual names – just think about what kind of structure you need. Do you need a collection that is mutable or immutable? Sorted? What type of data do we want to store in the collection – primitive or object? What kind of collection do you need? Lazy, eager, or parallel? Following the chart in the next section, it is quite easy to construct the data structure we require.

Instantiate Them Using Factories

This is similar to Java 9 collection factory methods on List, Set, and Map interfaces, with even more options!

Methods [just some of] by Category

Rich APIs are available directly on the collection types themselves, which inherit from the RichIterable interface (or PrimitiveIterable on the primitive side). We’ll take a look at some of these APIs in the coming examples.

Methods – Lots More…

Word clouds – those are so two years ago, aren’t they? However, this one is not entirely gratuitous – it illustrates a couple of important points. First, there are a lot of methods, covering every conceivable iteration pattern, available directly on the collection types. Second, word sizes in this word cloud are proportional to the number of implementations of the method. There are multiple implementations of methods on different collection types optimized for those specific types (so no lowest common denominator default methods here).

Code Examples

Example: Word Count

 

Let’s start with something simple.

Given a text (in this case, a nursery rhyme), count the number of occurrences of each word in the text. The result is a collection of words and the corresponding numbers of occurrences.

@BeforeClass
static public void loadData()
{
    words = Lists.mutable.of((
            "Bah, Bah, black sheep,\n" +           
            "Have you any wool?\n").split("[ ,\n?]+")   
    );
}

Note that we are using an Eclipse Collection factory method to populate words with a list of words. This is an equivalent of the Arrays.asList(…) method from the JDK, but it returns an instance of Eclipse Collections MutableList. Because MutableList interface is fully compatible with Listfrom the JDK, we can use this type for both the JDK and Eclipse Collections examples below.

First, let’s consider a naive implementation, which does not use streams:

@Test
public void countJdkNaive()
{
    Map<String, Integer> wordCount = new HashMap<>();

    words.forEach(w -> {
        int count = wordCount.getOrDefault(w, 0);
        count++;
        wordCount.put(w, count);
    });

    System.out.println(wordCount);

    Assert.assertEquals(2, wordCount.get(“Bah”).intValue());
    Assert.assertEquals(1, wordCount.get(“sheep”).intValue());
}

You can see here we create a new HashMap of String to Integer (mapping each word to the number of its occurrences), iterate through each word, and get its count from the map or default it to 0 if the word is not yet in the map. We then increment the value and store it back in the map. This is not a great implementation as we are focusing on “how” rather than the “what” of the algorithm, and the performance is not great either. Let’s try to rewrite it using idiomatic stream code:

@Test
public void countJdkStream()
{
   Map<String, Long> wordCounts = words.stream()
           .collect(Collectors.groupingBy(w -> w, Collectors.counting()));
   Assert.assertEquals(2, wordCounts.get(“Bah”).intValue());
   Assert.assertEquals(1, wordCounts.get(“sheep”).intValue());
}

In this case the code is fairly readable, but it is still not really efficient (as you can confirm if you run microbenchmarks over it). You also need to be aware of the utility methods on the Collectors class – they are not easily discoverable as they are not available directly on the stream API.

A way to have a really efficient implementation is to introduce a separate counter class and store that as a value in the map. Let’s say we have a class called Counter, which stores an integer value and has a method increment(), which increments that value by 1. Then we can rewrite the code above as

@Test
public void countJdkEfficient()
{
   Map<String, Counter> wordCounts = new HashMap<>();

   words.forEach(
           w -> {
               Counter counter = wordCounts.computeIfAbsent(w, x -> new Counter());
               counter.increment();
           }
   );

   Assert.assertEquals(2, wordCounts.get(“Bah”).intValue());
   Assert.assertEquals(1, wordCounts.get(“sheep”).intValue());
}

This is in fact a very efficient solution, but we had to write an entire new class (Counter) to make it work.

The Eclipse Collection Bag offers a solution tailor made for this problem, and provides an implementation that is optimized for just this specific collection type.

    @Test
    public void countEc()
    {
        Bag<String> bagOfWords = wordList.toBag(); // toBag() is a method on MutableList

        Assert.assertEquals(2, bagOfWords.occurrencesOf(“Bah”));
        Assert.assertEquals(1, bagOfWords.occurrencesOf(“sheep”));
        Assert.assertEquals(0, bagOfWords.occurrencesOf(“Cheburashka”)); // null safe - returns a zero instead of throwing an NPE
    }

All we have to do here is take our collection and call toBag() on it! We were also able to avoid a potential NPE in our last assertion by not calling intValue() on the object directly.

Example: Zoo

Let’s say we run a zoo. In a zoo we keep animals, who eat different kinds of food.

We want to query a few facts about the animals and the food they eat:

  • Highest quantity food
  • List of animals and the number of their favorite foods
  • Unique foods
  • Types of food
  • Find the meat and non-meat eaters

These code snippets have been tested with the Java Microbenchmark Harness (JMH) framework. We will go through the code examples and then take a look at how they compare. For a performance comparison, please see the “JMH Benchmark Results” section below.

Here’s our domain – zoo animals and food items they like to eat (each food item has a name, category, and quantity).

private static final Food BANANA = new Food(“Banana”, FoodType.FRUIT, 50);
private static final Food APPLE = new Food(“Apple”, FoodType.FRUIT, 30);
private static final Food CAKE = new Food(“Cake”, FoodType.DESSERT, 22);
private static final Food CEREAL = new Food(“Cereal”, FoodType.DESSERT, 80);
private static final Food SPINACH = new Food(“Spinach”, FoodType.VEGETABLE, 26);
private static final Food CARROT = new Food(“Carrot”, FoodType.VEGETABLE, 27);
private static final Food HAMBURGER = new Food(“Hamburger”, FoodType.MEAT, 3);

private static MutableList<Animal> zooAnimals = Lists.mutable.with(
        new Animal(“ZigZag”, AnimalType.ZEBRA, Lists.mutable.with(BANANA, APPLE)),
        new Animal(“Tony”, AnimalType.TIGER, Lists.mutable.with(CEREAL, HAMBURGER)),
        new Animal(“Phil”, AnimalType.GIRAFFE, Lists.mutable.with(CAKE, CARROT)),
        new Animal(“Lil”, AnimalType.GIRAFFE, Lists.mutable.with(SPINACH)),



Ex. 1 – Determine the most popular food item: how in-demand are our different food options?

@Benchmark
public List<Map.Entry<Food, Long>> mostPopularFoodItemJdk()
{
    //output: [Hamburger=2]
    return zooAnimals.stream()
            .flatMap(animals -> animals.getFavoriteFoods().stream())
            .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
            .entrySet()
            .stream()
            .sorted(Map.Entry.<Food, Long>comparingByValue().reversed())
            .limit(1)
            .collect(Collectors.toList());
}

Step by step: we first stream our zooAnimals, and .flatMap() each animal to its favorite foods, returning a stream of the food items consumed by each animal. Next, we want to group the food items using the food’s identity as the key and the count as the value, so that we can determine the count of animals per food item. This is a job for Collectors.counting(). To sort it, we  get the .entrySet() of this map, stream it, sort it by reversed values (remember, the value is the count of each food, and if we’re interested in most popular, we want reverse order), limit(1)then returns the top value, and finally, we collect it into a List.

The output, the most popular food, is [Hamburger=2].

Phew! Let’s see how we can achieve the same using Eclipse Collections.

@Benchmark
public MutableList<ObjectIntPair<Food>> mostPopularFoodItemEc()
{
    //output: [Hamburger:2]
    MutableList<ObjectIntPair<Food>> intIntPairs = zooAnimals.asLazy()
            .flatCollect(Animal::getFavoriteFoods)
            .toBag()
            .topOccurrences(1);
    return intIntPairs;
}

We start off the same by flat mapping each animal to its favorite foods. Because what we really want is a map of item to count, a bag is a perfect use case to help us solve our problem. We call toBag, and call topOccurrences, which returns the most frequently occurring items. topOccurrences(1) returns the required single most popular item, as a list of ObjectIntPairs (note the int is primitive); [Hamburger:2].

Ex 2 – number of favorite food items to animal: how diverse is each animal’s food choice – how many animals eat just one thing? How many eat two?

First the JDK rendition:

@Benchmark
public Map<Integer, String> printNumberOfFavoriteFoodItemsToAnimalsJdk()
{
    //output: {1=[Lil, GIRAFFE],[Simba, LION], 2=[ZigZag, ZEBRA],[Tony, TIGER],[Phil, GIRAFFE]}
    return zooAnimals.stream()
            .collect(Collectors.groupingBy(
                    Animal::getNumberOfFavoriteFoods,
                    Collectors.mapping(
                            Object::toString, // Animal.toString() returns [name,  type]
                            Collectors.joining(,))));// Concatenate the list of animals for each count into a string
}

And again, using Eclipse Collections:

@Benchmark
public Map<Integer, String> printNumberOfFavoriteFoodItemsToAnimalsEc()
{
    //output: {1=[Lil, GIRAFFE], [Simba, LION], 2=[ZigZag, ZEBRA], [Tony, TIGER], [Phil, GIRAFFE]}
    return zooAnimals
            .stream()
            .collect(Collectors.groupingBy(
                    Animal::getNumberOfFavoriteFoods,
                    Collectors2.makeString()));
}

This example highlights the use of native Java Collectors together with Eclipse Collections Collectors2; the two are not mutually exclusive. In this example we want to obtain the number of food items per animal. How do we achieve this? In native Java, we’d first use Collectors.groupingBy to group each animal to the number of its favorite foods. Then, we’d use the Collectors.mapping function to map each object to its toString, and finally call Collectors.joining to join the strings, separated by commas.

In Eclipse Collections, we could also use the Collectors.groupingBy method, but instead we’ll call Collectors2.makeString to have a little less verbosity and achieve the same result (makeString collects a stream into a comma delimited string).

Ex 3 – unique foods: how many different types of food do we have, and what are they?

@Benchmark
public Set<Food> uniqueFoodsJdk()
{
    return zooAnimals.stream()
            .flatMap(each -> each.getFavoriteFoods().stream())
            .collect(Collectors.toSet());
}

@Benchmark
public Set<Food> uniqueFoodsEcWithoutTargetCollection()
{
    return zooAnimals.flatCollect(Animal::getFavoriteFoods).toSet();
}

@Benchmark
public Set<Food> uniqueFoodsEcWithTargetCollection()
{
    return zooAnimals.flatCollect(Animal::getFavoriteFoods, Sets.mutable.empty());
}

Here we have a few ways to solve this problem! Using the JDK, we simply stream the zooAnimals, then flatMap their favorite foods, and finally, collect them into a set. In Eclipse Collections, we have two options. The first is roughly the same as the JDK version; flatten the favorite foods, then call .toSet() to put them into a set. The second is interesting because it uses the concept of target collections. You’ll notice flatCollect is an overloaded method, so we have different constructors available to us. Passing in a set as a second parameter means that we will flatten directly into the set, and skip the intermediary list that would have been used in the first example. We could have called asLazy() instead to avoid this extra garbage; the evaluation would wait until the terminal operation and thus would avoid the intermediate state. But if you prefer fewer API calls or you need to accumulate the results into an existing collection instance, consider using target collections when converting from one type to another.

Ex 4 – Meat and non-meat eaters: how many meat eaters do we have? How many non meat eaters?

Note, in both of the following examples we chose to explicitly declare the Predicate lambda at the top rather than inlinining them, to emphasize the distinction between the JDK Predicate and the Eclipse Collections Predicate. Eclipse Collections has had its own definitions for Function, Predicate and many other functional types long before they appeared in the java.util.functionpackage in Java 8. Now functional types in Eclipse Collections extend the equivalent types from the JDK, thus maintaining interoperability with libraries relying on the JDK types.

@Benchmark
public Map<Boolean, List<Animal>> getMeatAndNonMeatEatersJdk()
{
    java.util.function.Predicate<Animal> eatsMeat = animal ->
            animal.getFavoriteFoods().stream().anyMatch(
                            food -> food.getFoodType()== FoodType.MEAT);

    Map<Boolean, List<Animal>> meatAndNonMeatEaters = zooAnimals
            .stream()
            .collect(Collectors.partitioningBy(eatsMeat));
    //returns{false=[[ZigZag, ZEBRA], [Phil, GIRAFFE], [Lil, GIRAFFE]], true=[[Tony, TIGER], [Simba, LION]]}
    return meatAndNonMeatEaters;
}

@Benchmark
public PartitionMutableList<Animal> getMeatAndNonMeatEatersEc()
{
    org.eclipse.collections.api.block.predicate.Predicate<Animal> eatsMeat = animal ->
            animal.getFavoriteFoods().anySatisfy(food -> food.getFoodType() == FoodType.MEAT);

    PartitionMutableList<Animal> meatAndNonMeatEaters = zooAnimals.partition(eatsMeat);
    // meatAndNonMeatEaters.getSelected() = [[Tony, TIGER], [Simba, LION]]
    // meatAndNonMeatEaters.getRejected() = [[ZigZag, ZEBRA], [Phil, GIRAFFE], [Lil, GIRAFFE]]
    return meatAndNonMeatEaters;
}

Here, we want to partition our elements by meat and non meat eaters. We construct a Predicate, “eatsMeat” that looks at each animal’s favorite foods and sees if anyMatch/anySatisfy (JDK and EC, respectively), the condition that the food type is of FoodType.MEAT.

From there, in our JDK example, we .stream() our zooAnimals, and collect them with a .partitioningBy() Collector, passing in our eatsMeat Predicate. The return type for this is a Map with a true key and a false key. The “true” key returns those animals that eat meat, while “false” returns those that do not.

In Eclipse Collections, we call .partition() on the zooAnimals, again passing through the Predicate. We are left with a PartitionMutableList, which has two API points – getSelected(), and getRejected(), which both return MutableLists. The selected elements are our meat eaters, and the rejected ones are non meat eaters.

Memory Usage Comparison

 

In the examples above, the focus was primarily on the types and interfaces of the collections. We did mention in the beginning that the transition to the Eclipse Collections will also allow you to optimize memory usage. The effect can be quite significant, depending on how extensively collections are used in your particular application and what type of collections they are.

On the graphs, you see memory usage comparison between Eclipse Collections and collections from java.util. *

[Click on the image to enarlge it]

The horizontal axis shows the number of elements stored in a collection and the vertical axis represents the storage overhead in kilobytes. Overhead here means that we track the allocated memory after subtracting the size of the collection payload (so we show only the memory that the data structures themselves occupy). The value we measure is simply totalMemory() – freeMemory(), after we politely ask for System.gc(). The results we observe are stable and coincide with the results obtained with the same examples in Java 8 using jdk.nashorn.internal.ir.debug.ObjectSizeCalculator from the Nashorn project. (That utility measures the size precisely, but was unfortunately not compatible with Java 9 and beyond.)

The first graph shows the advantage of a primitive list of int values ​​(integers) from Eclipse Collections, compared to a list of JDK Integer values. The graph shows that for one million values, the implementation of a list from java.util.* will use more than 15 megabytes more memory (about 20 MB storage overhead for the JDK vs about 5 MB for Eclipse Collections).

Maps in Java are extremely inefficient, and require a Map.Entry object, which inflates memory usage.

But if maps are not memory efficient, then sets are simply awful, since Set uses Map in the underlying implementation, which is just wasteful. Map.Entry there serves no useful purpose, since only one property is needed – the key, which is the element of the set. Therefore, you see that Set and Map in Java use the same amount of memory, although a Set can be made much more compact, which is done in the Eclipse Collections. It ends up using much less memory than the JDK Set, as illustrated above.

Finally, the fourth graph shows the advantages of specialized collection types. As previously mentioned, a Bag is simply a set, which allows for multiple instances for each of its elements, and can be thought of mapping  of elements to their number of occurrences. You would use Bag to count occurrences of an item. In java.util. * an equivalent data structure is a Map of Item to Integer, where the developer is tasked with keeping the value of total occurrences up to date. Again, look at how much the specialized data structure (Bag) has been optimized to minimize memory usage and garbage collection.

Of course, we recommend testing this for each individual case. If you replace the standard Java collections with the Eclipse Collections, you will surely get improved results, but the magnitude of the impact they have on the overall use of your program’s memory depends on the specific circumstances.

JMH Benchmark Results

In this section we will analyze the execution speed of the examples we covered before, comparing the performance of the code before and after rewriting it using Eclipse Collections. The graph shows the number of operations per second measured for the Eclipse Collection and JDK versions for each test. The longer bars represent better results. As you see, the speed-up is dramatic:

[Click on the image to enarlge it]

We want to emphasize that in contrast to memory usage, the results we are showing are only applicable to our specific examples. Again, your specific results will depend very much on your particular situation, so be sure to test it against the real scenarios that make sense for your application.

Conclusion

Eclipse Collections has been developed over the last 10+ years to optimize your Java code and applications. It’s easy to get started – data structures are drop in replacements, and the API is generally more fluent than traditional streams code. Have a use case we haven’t solved yet? We’re happy to take your contributions! Feel free to check us out on GitHub. Share your results with us! We’d love to see your experience refactoring to Eclipse Collections and how it has impacted your applications. Happy coding!as