Performance impact of memory layout

Modern CPUs are so fast that they are often wait for memory to be transferred. In case of memory access with a regular pattern, the CPU will prefetch memory that is likely to be used next. We can optimise this a little bit by arranging the numbers in memory that they are easy to fetch. For 1D data, there is not much that we can do, but for ND data, we have the choice between two layouts

  • x0, y0, … x1, y1, …

  • x0, x1, …, y0, y1, …

Which one is more efficient is not obvious, so we try both options here. It turns out that the second option is better and that is the one used internally in the builtin cost functions as well.

[1]:
from iminuit import Minuit
from iminuit.cost import UnbinnedNLL
import numpy as np
from scipy.stats import multivariate_normal

rng = np.random.default_rng(1)

xy1 = rng.normal(size=(1_000_000, 2))
xy2 = rng.normal(size=(2, 1_000_000))


def cost1(x, y):
    return -np.sum(multivariate_normal.logpdf(xy1, (x, y)))

cost1.errordef = Minuit.LIKELIHOOD

def cost2(x, y):
    return -np.sum(multivariate_normal.logpdf(xy2.T, (x, y)))

cost2.errordef = Minuit.LIKELIHOOD


def logpdf(xy, x, y):
    return multivariate_normal.logpdf(xy.T, (x, y))

cost3 = UnbinnedNLL(xy2, logpdf, log=True)
[2]:
%%timeit -n 1 -r 1
m = Minuit(cost1, x=0, y=0)
m.migrad()
1.68 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
[3]:
%%timeit -n 1 -r 1
m = Minuit(cost2, x=0, y=0)
m.migrad()
470 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
[4]:
%%timeit -n 1 -r 1
m = Minuit(cost3, x=0, y=0)
m.migrad()
528 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

cost2 and cost3 are using the “first all x then all y” memory layout. cost3 measures the small overhead incurred by using the built-in cost function UnbinnedNLL compared to a hand-tailored one.