Why We Built The Best Particle Physics Analysis Framework

There are three major timescales to consider when you’re trying to carry out a particle physics analysis:

  1. How long it takes to figure out what you want to measure or search for and how to do it.
  2. How long it takes you code up the analysis.
  3. How long it takes you to run your code over your data.

There’s a multiplier on 2 and 3 based on how many mistakes you make. If you make a lot of mistakes, you have to spend a long time figuring out what they are and rerunning your analysis on the data. How often you make mistakes is related to how good of a coder you are, but also how good your framework is.

Most particle physics analyses follow a fairly similar workflow. These days, reconstruction of candidate particles (of things like electrons, muons, photons, jets, and others) are produced centrally by the experiment. Then, individual analyses apply their own selection criteria to the different types particles.

We then classify events based on the particles that we’ve selected and various kinematic variables. There can be multiple categories of interest in a given analysis, which we call signal regions. There are usually other categories that are used to calculate background expectations or to check systematic effects that we call sideband or control regions. On the low end, an analysis will look at 4-5 different regions, and the average is probably around 15. Our analysis looks at roughly 10,000 different regions.

In these various regions, we look at different kinematic distributions in a number of histograms. These can either be used to introduce cuts or to do shape comparisons.

Now, you can probably see places where we might want to change things, but here are some examples:

Anytime we make a change, we are at risk for introducing a bug. Therefore, we should design our framework to minimize the amount of code that has to change whenever we want to update our analysis. For example, if I want to add a new histogram to all 10,000 different categories, I shouldn’t have to write 10,000 new lines of code. I should only have to write one.

Likewise, if I want to compare two (or more) selections, it should be no more difficult to write down the selection than it is to implement it in the code.

So we designed a physics analysis framework that would be:

We ended up with something simple enough for a new graduate student (or a faculty member) to be able to understand. It runs very fast…we’re limited by our data input speed by quite a bit, so even with 10,000 channels, there is a lot of room for complexity. And the codebase is about a factor of 5 smaller, and the new framework is much more flexible than the old one was.

Hopefully, I’ll be able to show some examples in the future, but for now we’re keeping it under wraps.

 
0
Kudos
 
0
Kudos

Now read this

Statistics on HN usage

A few months ago, I created some figures for some various statistics about HN usage. By themselves, I didn’t think they were very interesting, and I’m running some more sophisticated analysis of comments at the moment that was going to... Continue →