Performance impact of memory layout

Modern CPUs are so fast that they are often wait for memory to be transferred. In case of memory access with a regular pattern, the CPU will prefetch memory that is likely to be used next. We can optimise this a little bit by arranging the numbers in memory that they are easy to fetch. For 1D data, there is not much that we can do, but for ND data, we have the choice between two layouts

  • x0, y0, … x1, y1, …

  • x0, x1, …, y0, y1, …

Which one is more efficient is not obvious, so we try both options here. It turns out that the second option is better and that is the one used internally in the builtin cost functions as well.

from iminuit import Minuit
from iminuit.cost import UnbinnedNLL
import numpy as np
from scipy.stats import multivariate_normal

rng = np.random.default_rng(1)

xy1 = rng.normal(size=(1_000_000,2))
xy2 = rng.normal(size=(2, 1_000_000))

def cost1(x, y):
    return -np.sum(multivariate_normal.logpdf(xy1, (x, y)))

cost1.errordef = Minuit.LIKELIHOOD

def cost2(x, y):
    return -np.sum(multivariate_normal.logpdf(xy2.T, (x, y)))

cost2.errordef = Minuit.LIKELIHOOD

def logpdf(xy, x, y):
    return multivariate_normal.logpdf(xy.T, (x, y))

cost3 = UnbinnedNLL(xy2, logpdf, log=True)
m = Minuit(cost1, x=0, y=0)
906 ms ± 53.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
m = Minuit(cost2, x=0, y=0)
767 ms ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
m = Minuit(cost3, x=0, y=0)
788 ms ± 58.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)