Rational Java: performance

Showing posts with label performance. Show all posts

Friday, 15 April 2016

JLBH Examples 1 - Why Code Should be Benchmarked in Context

In this post:

A side by side example using JMH and JLBH for Date serialisation
Measuring Date serialisation in a microbenchmark
Measuring Date serialisation as part of a proper application
How to add a probe to your JLBH benchmark
Understanding the importance of measuring code in context

In the last post, 'Introducing JLBH' we introduced JLBH the latency testing tool that Chronicle used to test Chronicle-FIX and is now available as open source.

In the next few posts we are going to look at some example applications:

All the code for the examples cane be found here in my GitHub project:

One of the points I made in the introduction to JLBH was that it is important to benchmark code in context. That means benchmarking code in an environment as close to how it will be running in real life as possible. This post demonstrates this in practice.

Let's look at a relatively expensive Java operation - Date Serialisation - and see how long it takes:

First here's a JMH benchmark:

Running on my laptop (MBP i7) these are the results I get:

Result "test":
4.578 ±(99.9%) 0.046 us/op [Average]
(min, avg, max) = (3.664, 4.578, 975.872), stdev = 6.320
CI (99.9%): [4.533, 4.624] (assumes normal distribution)
Samples, N = 206803
mean = 4.578 ±(99.9%) 0.046 us/op
min = 3.664 us/op
p( 0.0000) = 3.664 us/op
p(50.0000) = 4.096 us/op
p(90.0000) = 5.608 us/op
p(95.0000) = 5.776 us/op
p(99.0000) = 8.432 us/op
p(99.9000) = 24.742 us/op
p(99.9900) = 113.362 us/op
p(99.9990) = 847.245 us/op
p(99.9999) = 975.872 us/op
max = 975.872 us/op

# Run complete. Total time: 00:00:21

Benchmark Mode Cnt Score Error Units

DateSerialiseJMH.test sample 206803 4.578 ± 0.046 us/op

A mean time of 4.5us for the operation:

We get pretty much the same results running the test with JLBH:

These are the results:

Warm up complete (400000 iterations took 2.934s)
-------------------------------- BENCHMARK RESULTS (RUN 1) ---------
Run time: 10.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,000,000) 50/90 99/99.9 99.99/99.999 - worst was 4.2 / 5.8 352 / 672 803 / 901 - 934
OS Jitter (13,939) 50/90 99/99.9 99.99 - worst was 8.4 / 17 639 / 4,130 12,850 - 20,450
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 2) ---------
Run time: 10.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,000,000) 50/90 99/99.9 99.99/99.999 - worst was 4.2 / 5.8 434 / 705 836 / 934 - 967
OS Jitter (11,016) 50/90 99/99.9 99.99 - worst was 8.4 / 17 606 / 770 868 - 1,340
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 3) ---------
Run time: 10.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,000,000) 50/90 99/99.9 99.99/99.999 - worst was 4.2 / 5.8 434 / 737 901 / 999 - 1,030
OS Jitter (12,319) 50/90 99/99.9 99.99 - worst was 8.4 / 15 573 / 737 803 - 901
---------------------------------------------------------------------------------------------------- SUMMARY (end to end)---------------Percentile run1 run2 run3 % Variation
50: 4.22 4.22 4.22 0.00
90: 5.76 5.76 5.76 0.00
99: 352.26 434.18 434.18 0.00
99.9: 671.74 704.51 737.28 3.01
99.99: 802.82 835.58 901.12 4.97
worst: 901.12 933.89 999.42 4.47
--------------------------------------------------------------------

A mean time of 4.2us for the operation:

Note: This is case where there is no advantage using JLBH over JMH. I just include the code as a comparison.

Now we're going to run exactly the same operation but inside a TCP call the code will work like this:

Client send the server a fix message over TCP loopback (localhost)
Server reads the message
Server does the date serialisation
Server returns a message to the client

As explained in the previous post JLBH allows us to produce a latency profile for any part of the code. We will add a probe for stage 3.

This time the results look like this:

Warm up complete (50000 iterations took 3.83s)
-------------------------------- BENCHMARK RESULTS (RUN 1) ------------------------
Run time: 6.712s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (100,000) 50/90 99/99.9 99.99 - worst was 822,080 / 1,509,950 1,711,280 / 1,711,280 1,711,280 - 1,711,280
date serialisation (100,000) 50/90 99/99.9 99.99 - worst was 11 / 19 31 / 50 901 - 2,420
OS Jitter (64,973) 50/90 99/99.9 99.99 - worst was 8.1 / 16 40 / 1,540 4,850 - 18,350
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 2) ---------
Run time: 6.373s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (100,000) 50/90 99/99.9 99.99 - worst was 1,107,300 / 1,375,730 1,375,730 / 1,375,730 1,375,730 - 1,375,730
date serialisation (100,000) 50/90 99/99.9 99.99 - worst was 11 / 19 29 / 52 901 - 1,670
OS Jitter (40,677) 50/90 99/99.9 99.99 - worst was 8.4 / 16 34 / 209 934 - 1,470
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 3) ---------
Run time: 5.333s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (100,000) 50/90 99/99.9 99.99 - worst was 55,570 / 293,600 343,930 / 343,930 343,930 - 343,930
date serialisation (100,000) 50/90 99/99.9 99.99 - worst was 9.0 / 16 26 / 38 770 - 1,030
OS Jitter (32,042) 50/90 99/99.9 99.99 - worst was 9.0 / 13 22 / 58 737 - 934
--------------------------------------------------------------------
-------------------------------- SUMMARY (end to end)---------------
Percentile run1 run2 run3 % Variation
50: 822083.58 1107296.26 55574.53 92.66
90: 1509949.44 1375731.71 293601.28 71.07
99: 1711276.03 1375731.71 343932.93 66.67
99.9: 1711276.03 1375731.71 343932.93 66.67
99.99: 1711276.03 1375731.71 343932.93 66.67
worst: 1711276.03 1375731.71 343932.93 66.67
--------------------------------------------------------------------
-------------------------------- SUMMARY (date serialisation )------
Percentile run1 run2 run3 % Variation
50: 11.01 11.01 8.96 13.22
90: 18.94 18.94 15.62 12.44
99: 31.23 29.18 26.11 7.27
99.9: 50.18 52.22 37.89 20.14
99.99: 901.12 901.12 770.05 10.19
worst: 2424.83 1671.17 1032.19 29.21

--------------------------------------------------------------------

As can be seen the very same Date Serialisation take over twice as long from ~4.5us to ~10us.

It's not really the place here to go into too much detail about why the code takes longer to run when in context but it's to do with CPU caches getting filled in between calls to the date serialisation.

When all we are running (as in the micro benchmark) is the Date Serialisation then that can fit nicely into a CPU cache and never needs to get cleared out. However when there is a gap between calls to the Date serialisation the code for the operation gets cleared out and needs to be reloaded.

JLBH allows you to benchmark code in context and that's an important part of latency benchmarking.

Thursday, 14 April 2016

JLBH - Introducing Java Latency Benchmarking Harness

In this Post:

What is JLBH
Why did we write JLBH
Differences between JMH and JLBH
Quick start guide

What is JLBH?

JLBH is a tool that can be used to measure latency in Java programs. It has these features:

Aimed at running code that would be larger than a micro benchmark.
Suitable for programs that use asynchronous activity like the producer consumer pattern.
Ability to benchmark individual points within the program
Ability to adjust the throughput into the benchmark
Adjusts for coordinated omission i.e. end to end latencies of iterations impact each other if they back up
Reports and runs its own jitter thread

Why did we write JLBH?

JLBH was written because we needed a way of benchmarking Chronicle-FIX. We created it to benchmark and diagnose issues in our software. It has proved extremely useful and it is now available in the Chronicle open source libraries.

Chronicle-FIX is an ultra low latency Java fix engine. It guarantees latencies, for example, that parsing a NewOrderSingle message into the object model will not exceed 6us all the way to the 99.9th percentile. In fact we needed measure all the way along the percentile range. This is latency / percentile typical profile.

50 -> 1.5us
90 -> 2us
99 -> 2us
99.9 -> 6us
99.99 -> 12us
99.999 -> 35us
Worst -> 500us

Chronicle Fix guarantees these latencies with various throughputs ranging from 10k messages / second to 100k messages / second. So we needed a test harness where we could easily vary the throughput.

We also needed to account for co-ordinated omission. In other words we couldn't just ignore the effect of a slow run on the following run. If run A was slow and that caused run B to be delayed, even if run B had no latency in its own run, that fact that it was delayed still had to be recorded.

We needed to try to differentiate between OS jitter, JVM jitter, and jitter caused by our own code. For that reason we added the option to have a jitter thread that did nothing but sample jitter in the JVM. This would show up a combination of OS jitter e.g. thread scheduling and general OS interrupts and global JVM events such as GC pauses.

We needed to attribute the latencies as best possible to individual routines or even lines of code, for this reason we also created the possibility of adding custom sampling into the program. The addition of NanoSamplers adds very little overhead to the benchmark and allows you to observe where your program introduces latency.

This is a schematic view of the benchmark we constructed to measure Chronicle-FIX.

We ended up with results like these:

This was typical run:

Run time: 100.001s
Correcting for co-ordinated:true
Target throughput:50000/s = 1 message every 20us
End to End: (5,000,000) 50/90 99/99.9 99.99/99.999 - worst was 11 / 15 17 / 20 121 / 385 - 541
Acceptor:1 init2AcceptNetwork (4,998,804) 50/90 99/99.9 99.99/99.999 - worst was 9.0 / 13 15 / 17 21 / 96 - 541
Acceptor:1.1 init2AcceptorNetwork(M) (1,196) 50/90 99/99.9 99.99 - worst was 22 / 113 385 / 401 401 - 401
Acceptor:2 socket->parse (4,998,875) 50/90 99/99.9 99.99/99.999 - worst was 0.078 / 0.090 0.11 / 0.17 1.8 / 2.1 - 13
Acceptor:2.0 remaining after read (20,649,126) 50/90 99/99.9 99.99/99.999 99.9999/worst was 0.001 / 0.001 0.001 / 0.001 0.001 / 1,800 3,600 / 4,590
Acceptor:2.1 parse initial (5,000,100) 50/90 99/99.9 99.99/99.999 - worst was 0.057 / 0.061 0.074 / 0.094 1.0 / 1.9 - 4.7
Acceptor:2.5 write To Queue (5,000,100) 50/90 99/99.9 99.99/99.999 - worst was 0.39 / 0.49 0.69 / 2.1 2.5 / 3.4 - 418
Acceptor:2.9 end of inital parse (5,000,000) 50/90 99/99.9 99.99/99.999 - worst was 0.17 / 0.20 0.22 / 0.91 2.0 / 2.2 - 7.6
Acceptor:2.95 on mid (5,000,000) 50/90 99/99.9 99.99/99.999 - worst was 0.086 / 0.10 0.11 / 0.13 1.4 / 2.0 - 84
Acceptor:3 parse NOS (5,000,000) 50/90 99/99.9 99.99/99.999 - worst was 0.33 / 0.38 0.41 / 2.0 2.2 / 2.6 - 5.5
Acceptor:3.5 total parse (5,000,000) 50/90 99/99.9 99.99/99.999 - worst was 1.1 / 1.2 1.8 / 3.0 3.5 / 5.8 - 418
Acceptor:3.6 time on server (4,998,804) 50/90 99/99.9 99.99/99.999 - worst was 1.1 / 1.2 1.8 / 3.1 3.8 / 6.0 - 418
Acceptor:4 NOS processed (5,000,000) 50/90 99/99.9 99.99/99.999 - worst was 0.21 / 0.23 0.34 / 1.9 2.1 / 2.8 - 121
Jitter (5,000,000) 50/90 99/99.9 99.99/99.999 - worst was 0.035 / 0.035 0.035 / 0.037 0.75 / 1.1 - 3.3
OS Jitter (108,141) 50/90 99/99.9 99.99 - worst was 1.2 / 1.4 2.5 / 4.5 209 - 217

All samples are summarised across runs at the end of the benchmark here are couple:

-------------------------------- SUMMARY (Acceptor:2.95 on mid)----------------------
Percentile run1 run2 run3 run4 run5 % Variation var(log)
50: 0.09 0.09 0.09 0.09 0.09 0.00 3.32
90: 0.10 0.10 0.10 0.10 0.10 0.00 3.58
99: 0.11 0.11 0.11 0.11 0.11 2.45 3.69
99.9: 0.13 0.13 0.62 0.78 0.13 76.71 6.01
99.99: 1.50 1.38 1.82 1.89 1.70 19.88 9.30
worst: 1.95 2.02 2.11 2.24 2.24 6.90 9.90
-------------------------------------------------------------------------------------
-------------------------------- SUMMARY (Acceptor:3 parse NOS)----------------------
Percentile run1 run2 run3 run4 run5 % Variation var(log)
50: 0.33 0.33 0.34 0.36 0.36 6.11 5.75
90: 0.38 0.38 0.46 0.46 0.46 12.42 6.24
99: 0.41 0.41 0.50 0.53 0.50 16.39 6.47
99.9: 2.11 2.02 2.11 2.11 2.11 3.08 9.76
99.99: 2.37 2.24 2.37 2.37 2.37 3.67 10.05
worst: 2.88 2.62 3.14 3.14 2.88 11.51 10.67

-------------------------------------------------------------------------------------

Using JLBH we were able to both benchmark our application against the criteria in the specification as well as diagnosing some of the latency spikes.

By varying the throughput and the run time of the benchmark and especially by adding sampling to various points in the code patterns started to emerge which lead us to the source of the latency. A particular example of this was an issue with DateTimeFormatter casing a TLB cache miss but that will be the subject of another post.

Differences between JMH and JLBH

I would expect most of those reading this article to be familiar with JMH (Java MicroBenchmarking Harness), this is an excellent tool for micro benchmarks and if you haven't already used it it's a worthwhile tool every Java developer should have in their locker. Especially those concerned with measuring latencies.

As you will see from JLBH design much of it was inspired by JMH.

So if JMH is so great why did we have to create another benchmark harness?

I guess at a high level the answer is in the name. JMH is squarely aimed at micro benchmarks whilst JLBH is there to find latencies in larger programs.

But it's not just that. After reading the last section you will see that there are a number of reasons you might want to choose JLBH over JMH for a certain class of problem.

Btw although you can always use JLBH instead of JMH, if you have a genuine micro benchmark that you want measured as cleanly and as exactly as possible I would always recommend you use JMH over JLBH. JMH is an extremely sophisticated tool and does what it does really well for example JMH forks JVMs for each run which at the present moment JLBH does not.

When you would use JLBH over JMH:

If you want to see your code running in context. (See JLBH Examples 1 - Why Code Should be Benchmarked in Context for a full treatment of this concept with examples.) The nature of JMH is to take a very small sample of your code, let's say in the case of a FIX engine just the parsing, and time it in isolation. In our tests the exact same fix parsing took over twice as long when run in context i.e. as part of the fix engine, as they did when run out of context i.e. in a micro benchmark. I have a good example of that in my Latency examples project DateSerialise where I demonstrate that serialising a Date object can take twice as long when run inside a TCP call. The reason for this is all to do with CPU caches and something we will return to in a later blog.
If you want to take into account coordinated omission. In JMH, by design, all iterations are independent of each other, so if one iteration of the code is slow it will have no effect on the next one. We can see a good example of this in the my Latency examples SimpleSpike where we see the huge effect that accounting for coordinated omission can have. Real world examples should almost always be measured when accounting for coordinated omission.
For example let's imagine you are waiting for a train and get delayed in the station for an hour because the train in front of you was late. Let's then imagine you get on the train an hour late and the train takes it's usual half an hour to reach it's destination. If you don't account for coordinated omission you will not consider yourself to have suffered any delay as your journey took exactly the correct amount of time even though you waited for an hour at the station before departing!
If you want to vary throughput into your test. JLBH allows you to set the throughput as a parameter to your benchmark. The truth is that latency makes little sense without a defined throughput so it's extremely important that you are able to see the results of varying the throughput on your latency profile. JMH does not allow you to set throughput. (In fact this goes hand in hand with the fact that JMH does not account for coordinated omission.)
You want to be able to sample various points within your code. An end to end latency is great as a start but then what? You need to be able to record a latency profile for many points within the code. With JLBH you can add probes into your codes wherever you choose at very little overhead to the program. JMH is designed so that you only measure from the start of your method (@Benchmark) to the end.
You want to measure OS and JVM global latencies. JLBH runs a separate jitter thread. This runs in parallel to your program and does nothing but sample latency by repeatedly calling System.nanoTime(). Whilst this doesn't in of itself tell you all that much it can be indicative as to what is going on side your JVM during the time of the benchmark. Additionally you can add a probe which does nothing (this will be explained later on) where you can sample latency inside the thread that runs the code you are benchmarking. JMH does not have this sort of functionality.

As I mentioned earlier, if you don't want to use one or more of these functionalities than favour JMH over JLBH.

Quick Start Guide

The code for JLBH can be found in Chronicle-Core library which can be found on GitHub over here.

To download from Maven-Central include this in your pom.xml (check the latest version):

<dependency>

    <groupId>net.openhft</groupId>

    <artifactId>chronicle-core</artifactId>

    <version>1.4.7</version>

</dependency>

To write a benchmark you have implement the JLBHTask interface:

It has just two methods you need to implement:

init(JLBH jlbh) you are passed a reference to JLBH which you will need to call back on (jlbh.sampleNanos()) when your benchmark is complete.
run(long startTime) the code to run on each iteration. You will need to retain the start time when you work out how long your benchmark has taken and call back on jlbh.sampleNanos(). JLBH counts the number of times sampleNanos() is called and it must exactly match the number of times run() is called. This is not the case for other probes you can create.
There is third optional method complete()that might be useful for cleanup for certain benchmarks.

All this is best seen in a simple example:

In this case we measure how long it takes to put an item on an ArrayBlockingQueue and to take it off again.

We add probes to see how long the call to put() and poll() take.

I would encourage you to run this varying the throughput and the size of the ArrayBlockingQueue and see what difference it makes.

You can also see the difference it makes if you set accountForCoordinatedOmission to true or false.

Take a look look at all the options with which you can set up your JLBH benchmark which are contained in JLBHOptions.

In the next posts we will look at some more examples of JLBH benchmarks.

Please let me know if you have any feedback on JLBH - if you want to contribute feel free to fork Chronicle-Core and issue a pull request!

Monday, 11 January 2016

Writing 2 Characters into a Single Java char

Here's another nice trick we used when creating the ultra low latency Chronicle FIX-Engine.

When it comes to reading data off a stream of bytes it's way more efficient, if possible, to store data in a char rather than having to read it into a String. (At the very least you are avoiding creating a String object, although this can be mitigated by using a cache or working with CharSequence rather than String but that's the subject of another post.)

Using JMH benchmarks I've found these timings: (I haven't included the source code for this as this is going to be the subject of another post where I describe the different methodologies in more detail).

Reading 2 ascii characters off a bytes stream into:

String - 34.48ns
Pooled String - 28.57ns
StringBuilder - 21.27ns
char (using 2 chars method) - 6.75ns

(As a benchmark reading a single char takes 3.27ns.)

The point is that it takes at least 3 times longer to read data into a String than a char, and that doesn't even take into account the garbage created.

So it goes without saying that when you know that you are expecting data that is always a single character, rather than reading that data into a String variable you should read it into a char.

Now what if you know that that data you are expecting on the stream is no more than 2 characters. (You find this situation, for example in FIX 5.0 tag 35 msgType). Do you have to use a String so that you can accommodate the extra character? At first thoughts it appears so, after all a char can only contain a single character.

Or can it?

A java char is made up of 2 bytes not one. Therefore if you know that your data is made up of ascii characters you know that only a single byte (of the 2 bytes in the char) will be used. For example 'A' is 65 though to 'z' which is 122.

You can print out the values that fit into a single byte with this simple loop:

for (int i = 0; i < 256; i++) {
    char c = (char)i;
    System.out.println(i+ ":" + c);
}

You are now free to use the other bye of the char to hold the second ascii character.

This is the way to do it:

In this example you have read 2 bytes 'a' and 'b' and want to store them in a single char.

byte a = (byte)'a';
byte b = (byte)'b';
//Now place a and b into a single char
char ab = (char)((a << 8) + b);
//To retrieve the bytes individually see code below 
System.out.println((char)(ab>>8) +""+ (char)(ab & 0xff)); 
To better understand this let's look at the binary:

byte a  = (byte)'a' // 01100001
byte b  = (byte)'b' // 01100010

As you can see below, when viewed as a char, the top 8 bits are not being used

char ca = 'a' // 00000000 01100001
char cb = 'b' // 00000000 01100010

Combine the characters with a taking the top 8 bits and b the bottom 8 bits.
char ab = (char)((a << 8) + b); // 01100001 01100010

Summary

It's more efficient reading data into a char rather than a String.  If you know that you have a maximum of 2 ascii characters they can be combined into a single Java char.  Of course only use this technique if you really are worried about ultra low latency!

Pages