Streaming CSV files with the Stream API in Java 8

Today I will show you how to utilize the Java 8 Stream API to parse the content of a CSV file of persons.

This is a follow up from my blog post yesterday, where I produced a comma separated value string with help of the Stream API.

Why using the Stream API to parse our CSV?

The stream API is a handy abstraction for working with aggregated data. This becomes particular handy when we need to perform multiple actions, such as transforming the content, apply some filters and perhaps group them by a property. With the Stream API we are able to register a lot of actions we want to perform on each row in the CSV file, and doing it with a descent level of abstraction. We want the framework to handle the low level stuff, such as reading and looping the data, but still be in control of what we want to achieve.

The Stram API is a perfect fit for the task I want to solve today. I have a CSV file with a lot of persons, presented in the example below (simplified). The first task is to read all the “lines” of persons and make them in to a list of persons, List.

Example CSV file

Name, Age, City, Country
Ivar Østhus, 28, Oslo, Norway
Petter Dass, 19, Hålogaland, Norway
Ola Nordmann, 61, Sandnes, Norway
Viswanathan Anand, 43, Mayiladuthurai, India
Magnus Carlsen, 22, Tønsberg, Norway
…(about 1 million rows)

This list can be tremendously long and I want to fetch the 50 first adults (age > 17). Luckily the BufferedReader in Java 8 has been upgraded to provide me with the Stream abstraction. All I need to to is to call the .lines() method on BufferedReader. (The Stream abstraction is where Java 8 as store all the functional sweetness coming in Java 8, such as map, filter, max, min, sum, etc).

Solution with Stream API

InputStream is = new FileInputStream(new File("persons.csv"));
BufferedReader br = new BufferedReader(new InputStreamReader(is));

List<Person> persons = br.lines()
    .substream(1)
    .map(mapToPerson)
    .filter(person -> person.getAge() > 17)
    .limit(50)
    .collect(toList());

In the example we see that we skip the first line (this is the header line in our CSV file), using the substream(1) function.

Next we map the person from a CSV line to a Person object. We use a predefined lambda function for this:

//A bit hackish
public static Function<String, Person> mapToPerson = (line) -> {
  String[] p = line.split(", ");
  return new Person(p[0], Integer.parseInt(p[1]), p[2], p[3]);
};

Then we just call the limit function, telling the Stream API that we just want 50 first persons matching our criteria (must be adult).

Another cool thing we can do fairly easy with the Stream abstraction is to compute the average age of all the persons in the list:

double avergaeAge = br.lines()
  .substream(1)
  .map(mapToPerson)
  .collect(averagingInt(Person::getAge));

Or find the oldest person in the list:

Optional<Person> oldetsPerson = br.lines()
  .substream(1)
  .map(mapToPerson)
  .max(byAge);

//Lambda expression:
static Comparator<Person> byAge = 
  (p1, p2) -> p1.getAge() - p2.getAge()

(her we get an optional of a person in good functional manner).

Summary

We could of course use all of the other cool Collectors, such as groupBy, counting, ready to use with the Stream API. I will probably blog more about the Stream API soon.

Thanks for ready my blog. Some other post about Java 8 and lambdas I have written recently:

Java 8: joining strings with Stream API

I this brief blog post I show how to loop over a collection of persons and build a string built by their names. The new Java 8 Stream API makes this really easy, combined with lambda expressions, explained in a previous post.

Problem description:
– Given a list of persons
– Build a string, we should follow this format: “age1:name1, age2:name2″
– We should only include adult persons (age > 18)
– Sort the names by age

//Data
List<Person> persons = new ArrayList<>();
persons.add(new Person("Ola Hansen", 21));
..

//Solution
String names = persons.stream()
  .filter(p -> p.getAge() > 18)
  .sorted((p1, p2) -> p1.getAge() - p2.getAge())
  .map(p -> p.getAge() + ":" + p.getName())
  .collect(Collectors.joining(", "));

//Result
"21:Ola Hansen, 28:Ivar Østhus, 29:Kari Normann, 42:Donald Duck"

This problem is really easy to solve using the new Stream API, as shown in code example above. The key to the solution is the joining-collector provided as a ready to use Collector. The joining-collector uses StringBuilder under the hood, to build up the resulting String. How would the solution look like with imperative styled for-loops? How many garbage variables would you need?

The example touches multiple new concepts, such as filter, sorted, map, collect, introduced in the Java 8 Stream API. Later I will write about the Stream API more deeply.

What are lambdas in Java 8?

In this blog post I will briefly introduce lambdas which will be included as a new language feature in Java 8. I recently wrote a short introduction to functional programming support coming in Java 8. In this post I want to focus on lambda expressions, what they actually are and why they are awesome.

The motivation behind lambda expression is to provide super nice and simple syntax for passing functionality as arguments to another method, such as what to to when someone clicks a button. Pre JDK8 we used anonymous inner classes to do that, which typically implemented a functional interfaces (more details below). The problem we faced with that approach is that the syntax was verbose and unclear. It was really hard to write and read. Lambda expressions let you express instances of single-method classes more compactly [1]. We can think of lambda expressions as a way to define anonymous methods.

We can think of lambda expressions as a way to define anonymous methods.

Take lambdas for a spin

Let’s start simple. We have a List of numbers and we want to print all of the numbers using the new super awesome forEach-metod which accepts a consumer as argument. Without lambdas we can achieve this by implementing a Consumer:

List<Integer> numbers = Arrays.asList(1,2,3,4,5,6);

numbers.forEach(new Consumer<Integer>() {
  @Override
  public void accept(Integer value) {
    System.out.println(value);
  }
});

Wow that anonymous class looks really ugly and verbose. I actually prefer the regular enhanced-for loop over this ugly thing!

Lambda to the rescue

Thankfully lambdas comes to the rescue and allows us to just define the consumer function we want to be executed for each method:

List<Integer> numbers = Arrays.asList(1,2,3,4,5,6);
numbers.forEach((Integer value) -> System.out.println(value));

This is way better!! But There still is some noise in my eyes. Why do I have to tell the compiler the type of value? Can value possibly be anything other than Integer?

The answer is: NO it must be an Integer And it turns out that the Java 8 compiler can help us out by understanding the type for us, we don’t have to, hurray! This concept is known as type inference. Lets have a look:

List<Integer> numbers = Arrays.asList(1,2,3,4,5,6);
numbers.forEach(value -> System.out.println(value));

Wow, this starts looking like something readable. We even got rid of the parenthesis for the value parameter.

It’s important to notice that the forEach method still accept a Consumer as input and that it is the compiler that takes the provided lambda expression and converts it into a valid consumer.

Method Reference

Even though we ended up with a simple and easy to read lambda expression there is still something bothering me. We have created a function which takes the input argument and just calls a new function with the same argument as input.

Can’t we just use the println-function instead? The answer is method reference and here is an example:

List<Integer> numbers = Arrays.asList(1,2,3,4,5,6);
numbers.forEach(System.out::println);

We see that we use the special ‘::’-notation which allows us to borrow methods elsewhere. The result in this example is that the forEach method will call the println-method from System.out for each element in the list.

(Side note: It is also possible to refer to the constructor method with new: User::new).

Multiple blocks

Can we execute multiple lines of code in a lambda expression? Yes of course, just add some curly-brackets:

List<Integer> numbers = Arrays.asList(1,2,3,4,5,6);
numbers.forEach(value -> {
  String out = "Hi there value is: " + value;
  System.out.println(out);
});

Lexical Scoping & effectively final

Lambda expression closes over the scope of its definition, lexical scoping. From within a lambda expression we can only access local variables that are final or effectively final in the enclosing scope. Effectively final means that Java 8 relaxed the requirement to use the final keyword, but the variable can still not change if we want to access it inside a lambda expression. If the compile detects that the variable is mutated, inside or outside of the lambda-expression, it will complain.

List<Integer> numbers = Arrays.asList(1,2,3,4,5,6);
int someVal = 1;
numbers.forEach(value -> System.out.println(value+someVal));
someVal = 2;

The compile will complain in the above example because someVal is not effectively final.

More examples

As a last paragraph I will include a few legal lambda examples:

(int x, int y) -> { return x + y; }

(x, y) -> x + y

x -> x + x

() -> x

value -> { System.out.println(value); }

Functional interfaces

We can use lambdas with methods which takes a functional interface as argument. This section briefly introduce what a functional interface is.
The only requirement for a functinal interface is that it have one abstract unimplemented methods. It can have 0 or more default methods. In the example below I have showed a snippet from Predicate interface part of JDK8. The “FunctionalInterface” annotation is optional, but when present it will make sure that the interface have exactly one unimplemented method.

@FunctionalInterface
public interface Predicate<T> {
  boolean test(T t);
  default Predicate<T> and(Predicate<? super T> other) {
    Objects.requireNonNull(other);
    return (t) -> test(t) && other.test(t);
  }
  //...
}

The predicate is used to check whether an input argument satisfies our requirements. It has one abstract method “test” which should be implemented to verify the requirement. This functional interface also comes with default methods and, or, and negate. The first two are used to contact multiple predicates and the latter is used to invert a predicate. This allows us to reuse and build on top of existing predicates.

Other common functional interfaces found in the jdk8 java.util.functional package includes:

  1. Counsumer<T> – takes an input and performs an operation on it. Will cause side effects!
  2. Supplier<T> – a kind of factory. Will return a new instance or a existing instance.
  3. Predicate<T> – Checks if argument satisfies our requirements
  4. Function<T, R> – Used to transform an argument from type T to type R.
  5. BinaryOperator<T> – two T’s as argument, return one T as output

Summary

Lambdas are awesome and the corner stone in the introduction of functional programming in Java 8. It’s a clever way to introduce functional programming in Java, making it super simple to write them. Letting lambdas be defined via functional interfaces (already heavily used, e.g. eventListener) allows existing code to be forward compatible with lambdas. Clever!

You might think that lambdas is just a pretty syntax for creating anonymous inner classes under the hood. Then lambda capture just becomes constructor invocation. This is (thankfully) not the case. It would lead to performance issues (one class per lambda expression). Instead the language team uses the fifth bytecode method invocation mode introduces in Java 7, called invokedynamic. I want to do a special post on lambdas under the hood in a later blog post.

Getting functional in Java 8

In September I attended JavaOne 2013 in San Francisco. Oracle was showing off Java 8, scheduled for GA in Q1 2014. The feature comming in Java 8 which exited me the most was the functional part, introduced with Project Lambda.

All the other major platform, such as C#, has had this for years now and finally Java is growing up and will introduce functional programming in Java 8. In previous versions of Java we have been so used to imperative style programming that it is hard to even realize the alternatives. It has worked fine, but is very low level and a extremely verbose syntax compared to the alternatives. With the new functional features we are now able to express what we want to achieve more consciously and not worry so much about how to actually do it. Java 8 enables us to used old school rock solid OO design and combine it with functional patterns. Combined we will be able to achieve more with less, meaning fewer bugs and more value delivered. This is a big change for Java, even bigger than generics introduced a few years back.

In this post I will present a few simple examples on how you can utilize functions in Java 8.

Setup

As a basis for each example I will have a list of persons as showed in the snipped below.

List persons = new ArrayList();
persons.add(new Person("Knerten Lillebror", 12));
persons.add(new Person("Kari Normann", 29));
persons.add(new Person("Ole Hansen", 32));

Passing functions in Java 8

Say you want to print the name of each person in the list. How do we do that in Java today (pre Java 8)? EASY! We loop the list and for each item in the list we print the name. We even use the enhanced for loop. Pretty simple, right?

for(Person p : persons) {
  System.out.println(p.getName());
}

This is referred to imperative code syle. There is mainly two problems with this example:

  1. We have to introduce a temporary variable (p)
  2. We have to know HOW to iterate a list (the for ioop

Not only do we express WHAT we want to do, we also have to express HOW to do it, iterating all the elements and introducing a mutable element. In Java 8, we now have a forEach method on collections, which allows us to pass a function. The underlying framework will take care of how to loop each element. We will need to pass a Consumer, which performs an operation on each element:

persons.forEach(p -> System.out.println(p));

Remove elements

The collection also makes it super simple to remove elements from a collection. We just use a lambda expression, a predicate, to express which elements we want removed. How would you implement this pre Java 8?

persons.removeIf(p -> p.getAge() > 20);

* We could also use the syntax “(Person p) -> p.getAge() > 20)”, to specify type, this is optional, as it is automatically inferred by the compiler.

WARNING. I generally do not feel it is a good practice to use this function as it mutates the actual list. In my opinion it would be better if it returning a new list/view, without the elements matched by the predicate.

Method references

In Java 8 we will also be able to borrow functions from other classes using the “::” notation:

persons.forEach(System.out::println);

Function blocks

It’s even possible to pass a function block:

persons.forEach(p -> {
    System.out.print("hi there: ");
    System.out.println(p);
});

This is the first quick post with many more to come. The Stream API is especially interesting and I will cover a lot of it soon.

Why writing tests are worth it

Today I got myself thinking, what are the main benefits of system testing? We all think testing our code is important, but why? In this blog post I have collected a few points which in my opinion makes writing good test cases worth it. And of course, we are talking about automated tests.

  • Documentation – Good tests forms an excellent documentation on how to use and understand the code. How should you instantiate a class? How should you call that service? What are the limit values? The list here can be very long…
  • Improved code quality - I like to believe that good test provides at least some level of code quality. The fact that the coder bothered to write test is an indication on that he actually tried to do some quality work. Of course, if the tests sucks, they are not worth the bytes used to store them.
  • Verification of requirements - Without tests, how can you be sure that the code actually solves the issues it was supposed to solve? How do you now we are building the right thing?
  • Safer to refactor – With many tests of high quality I would have a better feeling refactoring code. In my opinion refactoring code is extremely important to ensure that we constantly improve the design of our solution all the time. I can’t imagine how to do this without tests (at least if I must refactor others code)
  • Instant feedback - The best feedback we get is from our users. Test though have the strength of giving us instant feedback while we develop. This instant feedback is important because it allows us to detect bugs earlier. The earlier we catch a bug, the less costly it is to fix it.
  • Limit values- Some application states are hard to reach with manual testing. With test we can just mock those services and instruct it to return those hard-to-reach corny values
The cost (time required) of fixing bugs rises if we discover them late in the development phase. The price is highest if the bug is discovered after the functionality is released. I usually visualize this in my head as the figure illustrated below. In my experience the cost of fixing a bug discovering in production is significantly higher, than if we manage to catch it during development.

I guess there are plenty more benefits of automated system testing. Please leave me a comment to let me know what I left out.

If you enjoyed this blog post, you would probably also like the one I did about TDD by example. TDD is a great way to make sure you are testing your code.

TDD by example – Factorial

Test-driven development is a development style with a simple process:

1. Write a failing test
2. Write code that corrects the failing test
3. Clean up your code
> Go to 1.

The goal of TDD is not writing the tests first, it’s a design process where you iterate over the design as you develop new code. Instead of doing a full upfront design, you design little by little as you need more functionality.Writing test first is just a tool to force you to focus on a small part of the code, how that part should work and improve the design all the time.

Generally doing TDD at-least guarantees tests and testability. It does not provide efficiency or quality in itself (I have not found any documented evidence of this). But it does provide some minimum level of quality and encourages the developers to think about the problem in a modualized way. It also makes sure that developers write tests. We all now that writing tests after committing the code is hard because of all those excuses: late friday night, pressure, the sprint is ending and we just need some other functionality done, etc.

Another big benefit from TDD is that it also (help) eliminates the waste from created from developers implementing stuff that might be useful. No code can be written before a test-case requiring that functionality.

You can find a full description of TDD on wikipedia.

Benefits of TDD:

TDD will make sure your code are testable and well tested. The high test-coverage will form an excellent documentation of your code.

  • automatically gives you testable code, by definition
  • ensures high test coverage
  • elimintaes waste from implementing stuff that might be useful som time.
  • shorter feedback loop
  • higher code confidentiality
  • makes you focus on smaller parts of the problem
  • forces you to think about the API’s before implementing code.
  • helps with modularizing of the code
  • will make it easier to refactor your code, because of the high test coverage
  • makes you iterate and improve the design throughout the whole development process
  • high test-coverage provides excellent documentation of your code

Shortcomings of TDD

A higher number of tests can not guarantee higher code quality, it can only provide you with a minimum level of quality of the resulting product.
  • can be time-consuming, especially in the beginning
  • can be hard, especially dealing with frameworks which put constraints on your code
  • done wrong: can make it hard to change the code. This is because there is so many tests everywhere that verify every little part of your code all the time
  • can be hard to prove that it actually are more cost effective, especially in the beginning
  • Can make you less productive, especially if you follow a very strict TDD-model where you only do the smallest change possible to satisfy a failing test. I often feel that I would be able to solve larger part of problem at once when i do TDD.

TDD in action

Now, after providing some background, lets start doing TDD, iteration for iteration. I will show you all the steps required.

Problem description

The task is to implement an factorial method in Java. The definition of factorial is:

Examples:

0! = 1
1! = 1
2! = 2 x 1 = 2
3! = 3 x 2 x 1 = 6

Limitaions: We will limit our self to only use the primitive int type in Java. This simplifies our problem, but limits the resulting number to 32bit. this means we will only do up to 10!.

Iteration 1

Lets start simple 0! should be 1.

Write test

We write our first test-method:

    @Test
    public void shouldReturnOneWhenZeroIn(){
        assertEquals(1, factorial(0));
    }

As we write this test we will get an compilation error complaining about the missing factorial method. I auto-generate this method in my IDE to become:

     private int factorial(int i) {
        return 0;
    }

I execute the test and bam:

junit.framework.AssertionFailedError:
Expected :1
Actual :0

Fix the failing test

Lets just fix it:

     private int factorial(int i) {
        return 1;
    }

Hurray our test is now running green!

Clean up code

I don’t see any reason to clean up the code yet. It’s simple and it solves the test case.

Iteration 2

Ok lets expand 1! should also be 1.

Write test

We write our first test-method:

    @Test
    public void shouldReturnOneWhenOneIn(){
        assertEquals(1, factorial(1));
    }

Wot, the test went green? I guess our previous implementation already covers this case, lets just head to next test case.

Iteration 3

Ok, in this iteration we want to make sure that 2! should be 2.

Write test

    @Test
    public void shouldReturnTwo() {
        assertEquals(2, factorial(2));
    }

Fix the failing test

    private int factorial(int i) {
        if(i < 2) return 1; 
        else return 2;
    }

Hurray, our tests are now green again.

Clean up code

I clean up the code by adding curly-braces.

    private int factorial(int i) {
        if(i < 2) {
           return 1;
        } else {
            return 2;
        }
    }

Iteration 4

In this iteration we want to make sure that 3! equals 6.

Write test

    @Test
    public void shouldReturnSix() {
        assertEquals(6, factorial(3));
    }

Executes and verifies that the test is failing.

Fix the failing test

      private int factorial(int i) {
        if (i < 2) {
            return 1;
        }

        if(i == 2) {
            return 1*i;
        } else {
            return 1*2*i;
        }
    }

Hmm.. do we start to see a pattern?

Clean up code

No clean up’s this time, I am happy with the code as it is.

Iteration 5

Write test

@Test
public void shouldReturnTwentyFour() {
  assertEquals(24, factorial(4));
}

Fix the failing test

We started to see a pattern in last iteration. Let’s try to do a recursive function where we multiply i with factorial(n-1).

private int factorial(int i) {
  if (i < 2) {
    return 1;
  }
    return factorial(i-1)*i;
  }

Hurray, it worked.

Iteration 5

From my manual calculator I find that 10! = 3628800. Lets do a test for it

Write test

    @Test
    public void shouldFindCorrectFactorialFor10() {
        assertEquals(3628800, factorial(10));
    }

Yes, my code return’s correctly.

Clean up test-code

Our test-cases are also duplicated code. We can clean this duplication by using an multidimensional array to represent each input/output values.

    @Test
    public void shouldReturnCorrectFactorialValue() {
        int values[][] = {{0,1}, {1,1}, {2,2}, {3,6}, {4,24}, {10, 3628800}};
        for(int[] value: values) {
            assertEquals(value[1], factorial(value[0]));
        }
    }

For me this feels natural. I do this to avoid having multiple of testMethods for every input value we are testing.

Summary

As you saw, we quickly found the recursive solution of this problem. Because the faculty operation is a well known problem, I would probably head straight to a similar solution without a test-first approach.

The point of this example is just to show the process and how it is performed. The benefits is generally more “visible” when the task faced is larger and more complex, where it is harder to see all the challenges required to solve upfront.

TDD gives us tested code, with shorter feedback loop, higher code confidentiality and hope of code quality and improved design. At least the developers have been forced to implement with speration of concerns in mind. More tests does not provide quality in itself and it all comes back to highly skilled developers.