The Importance and Challenges of “Open Data” at the Large Hadron Collider

A little while back I wrote a short post about some research that some colleagues and I did using “open data” from the Large Hadron Collider [LHC]. We used data made public by the CMS experimental collaboration — about 1% of their current data — to search for a new particle, using a couple of twists (as proposed over 10 years ago) on a standard technique.  (CMS is one of the two general-purpose particle detectors at the LHC; the other is called ATLAS.)  We had two motivations: (1) Even if we didn’t find a new particle, we wanted to prove that our search method was effective; and (2) we wanted to stress-test the CMS Open Data framework, to assure it really does provide all the information needed for a search for something unknown.

Recently I discussed (1), and today I want to address (2): to convey why open data from the LHC is useful but controversial, and why we felt it was important, as theoretical physicists (i.e. people who perform particle physics calculations, but do not build and run the actual experiments), to do something with it that is usually the purview of experimenters.

The Importance of Archiving Data

In many subfields of physics and astronomy, data from experiments is made public as a matter of routine. Usually this occurs after an substantial delay, to allow the experimenters who collected the data to analyze it first for major discoveries. That’s as it should be: the experimenters spent years of their lives proposing, building and testing the experiment, and they deserve an uninterrupted opportunity to investigate its data. To force them to release data immediately would create a terrible disincentive for anyone to do all the hard work!

Data from particle physics colliders, however, has not historically been made public. More worrying, it has rarely been archived in a form that is easy for others to use at a later date. I’m not the right person to tell you the history of this situation, but I can give you a sense for why this still happens today.

The fundamental issue is the complexity of data sets from colliders, especially from hadron colliders such as the Tevatron and the LHC. (Archiving was partly done for LEP, a simpler collider, and was used in later studies including this search for unusual Higgs decays  and this controversial observation, also discussed here.) What “complexity” are we talking about? Collisions of protons and/or anti-protons are intrinsically complicated; particles of all sorts go flying in all directions. The general purpose particle detectors ATLAS and CMS have a complex shape and aren’t uniform. (Here’s a cutaway image showing CMS as a set of nested almost-cylinders. Note also that there are inherent weak points: places where cooling tubes have to run, where bundles of wires have to bring signals in and out of the machine, and where segments of the detector join together.) Meanwhile the interactions of the particles with the detector’s material is messy and often subtle (here’s a significantly oversimplified view). Not every particle is detected, and the probability of missing one depends on where it passes through the detector and what type of particle it is.

Even more important, 99.999% percent of ATLAS and CMS data is discarded as it comes in; only data which passes a set of filters, collectively called the “trigger,” will even be stored. The trigger is adjusted regularly as experimental conditions change. If you don’t understand these filters in detail, you can’t understand the data. Meanwhile the strategies for processing the raw data change over time, becoming more sophisticated, and introducing their own issues that must be known and managed.

I could easily go on (did I mention that at the LHC dozens of collisions occur simultaneously?) If, when you explore the data, you fail to account for all these issues, you can mistake a glitch for a new physical effect, or fail to observe a new physical effect because a glitch obscured it. Any experimentalist inside the collaborations is aware of most of these subtleties, and is surrounded by other experts who will be quick to complain if he or she forgets to account for one of them. That’s why it’s rare for the experimenters to report a result that has this type of error embedded in it.

Now, imagine writing a handbook that would encapsulate all of that combined knowledge, for use by people who will someday analyze the data without having access to that collective human library. This handbook would accompany an enormous data set taken in changing conditions, and would need to contain everything a person could possibly need to know in order to properly analyze data from an LHC experiment without making a technical error.

Not easy! But this is what the Open Data project at CERN, in which CMS is one of the participating experiments, aims to do. Because it’s extremely difficult, and therefore expensive in personnel and time, its value has to be clear.

I personally do think the value is clear, especially at the LHC. Until the last couple of decades, one could argue that data from an old particle physics experiment would go out of date so quickly, superseded by better experiments, that it really wasn’t needed. But this argument has broken down as experiments have become more expensive, with new ones less frequent. There is no guarantee, for instance, that any machine superseding the LHC will be built during my lifetime; it is a minimum of 20 and perhaps 40 years away. In all that time, the LHC’s data will be the state of the art in proton-proton collider physics, so it ought to be stored so that experts can use it 25 years from now. The price for making that possible has to be paid.

[This price was not paid for the Tevatron, whose data, which will remain the gold standard for proton-antiproton collisions for perhaps a century or more, is not well-archived.]

Was Using Open Data Necessary For Our Project?

Even if we all agree that it’s important to archive LHC data so that it can be used by future experimental physicists, it’s not obvious that today’s theorists should use it. There’s an alternative: a theorist with a particular idea can temporarily join one of the experimental collaborations, and carry out the research with like-minded experimental colleagues. In principle, this is a much better way to do things; it permits access to the full data set, it allows the expert experimentalists to handle and manage the data instead of amateurs like us, and it should in principle lead to state-of-the-art results.

I haven’t found this approach to work. I’ve been recommending the use of our technique [selecting events where the transverse momentum of the muon and antimuon pair is large, and often dropping isolation requirements] for over ten years, along with several related techniques. These remarks appear in papers; I’ve mentioned these issues in many talks, discussed them in detail with at least two dozen experimentalists at ATLAS and CMS (including many colleagues at Harvard), and even started a preliminary project with an experimenter to study them. But everyone had a reason not to proceed. I was told, over and over again, “Don’t worry, we’ll get to this next year.” After a decade of this, I came to feel that perhaps it would be best if we carried out the analysis ourselves.

Even then, there was an alternative: we could have just done a study of our method using simulated data, and this would have proved the value of our technique. Why spend the huge amount of time and effort to do a detailed analysis, on a fraction of the real data?

First, I worried that a study on simulated data would be no more effective than all of the talks I gave and all the personal encouraging I did over the previous ten years. I think seeing the study done for real has a lot more impact, because it shows explicitly how effective the technique is and how easily it is implemented. [Gosh, if even theorists can do it…]

Second, one of the things we did in our study is include “non-isolated muons” — muons that have other particles close by — which are normally not included in theorists’ studies. Dropping the usual isolation criteria may be essential for discoveries of hidden particles, as Kathryn Zurek and I have emphasized since 2006 (and studied in a paper with Han and Si, in 2007). I felt it was important to show this explicitly in our study. But we needed the real data to do this; simulation of the background sources for non-isolated muons would not have been accurate. [The experimenters rarely use non-isolated muons in the type of analysis we carried out, but notably have been doing so here; my impression is that they were unaware of our work from 2007 and came to this approach independently.]

Stress Testing the Archive

A further benefit to using the real data was that we stress-tested the archiving procedure in the Open Data project, and to do this fully, we had to carry out our analysis to the very end. The success or failure of our analysis was a test of whether the CMS Open Data framework truly provides all the information needed to do a complete search for something unknown.

The test received a passing grade, with qualifications. Not only we did complete the project, we were able to repeat a rather precise measurement of the (well-known) cross-section for Z boson production, which would have failed if the archive and the accompanying information had been unusable. That said, there is room for improvement: small things were missing, including some calibration information and some simulated data. The biggest issue is perhaps the format for the data storage (difficult to use and unpack for a typical user).

It’s important to recognize that the persons in charge of Open Data at CMS have a huge and difficult job; they have to figure out how to write the nearly impossible handbook I referred to above. It’s therefore crucial that people like our group of theorists actually use the open data sets now, not just after the LHC is over. Now, when the open data sets are still small, is the time to figure out what information is missing, to determine how to improve the data storage, to fill out the documentation and make sure it has no gaps. We hope we’ve contributed something to that process.

The Future

Should others follow in our footsteps? Yes, I think, though not lightly. In our case, five experts required two years to do the simplest possible study; we could have done it in one year if we’d been more efficient, but probably not much less. Do not underestimate what this takes, both in terms of understanding the data and learning how to do statistical analysis that most people rarely undertake.

But there are studies that simply cannot be done without real data, and unless you can convince an experimentalist to work with you, your only choice may be to dive in and do your best. And if you are already somewhat informed, but want to learn more about how real experimental analysis is done, so you can appreciate more fully what is typically hidden from view, you will not find a better self-training ground. If you want to take it on, I suggest, as an initial test, that you try to replicate our measurement of the Z boson cross-section. If you can’t, you’re not ready for anything else.

I should emphasize that Open Data is a resource that can be used in other ways, and several groups have already done this. In addition to detailed studies of jets carried out on the real data by my collaborators, led by Professor Jesse Thaler, there have been studies that have relied solely on the archive of simulated data also provided by the CMS Open Data project. These have value too, in that they offer proving grounds for techniques to be applied later to real data. Since exploratory studies of simulated data don’t require the extreme care that analysis of real data demands, there may be a lot of potential in this aspect of the Open Data project.

In the end, our research study, like most analyses, is just a drop in the huge bucket of information learned from the LHC. Its details should not obscure the larger question: how shall we, as a community, maintain the LHC data set so that it can continue to provide information across the decades? Maybe the Open Data project is the best approach.  If so, how can we best support it?  And if not, what is the alternative?

25 responses to “The Importance and Challenges of “Open Data” at the Large Hadron Collider

  1. I am all for it if it helps to evaluate the data and bring them in competition with the original goals, for instance by judging the nature of what is thought being examined. In other words, helps to avoid falling into the naturalistic fallacy trap, meaning getting away from the original goals further and further.

  2. Pingback: The Importance and Challenges of “Open Data” at the Large Hadron Collider – I Fuckin' Love Science Teams Up With The Science Channel To Curate The Best Science Content On The Web | I Fuckin' Love Science

  3. Thomas A. Modesto, MD

    Dr. Strassler, thank you for taking your time and energy to maintain this site. For someone who loves physics but realized I was not smart enough to continue therein, it is an excellent way to learn the state of things. I particularly thank you for not pretending that some of the spectulative hypothses presented by some other physicists are actual facts with evidence behind them I have one question: are you going to write a book with these topics? If you are might I request you include some of the math involved so I can better understand where the conclusions come from. I realize you do not wish to overwhelm people with the math, but even if you put it in optional sections it would be helpful. I have read several books on QM (though not QFT) so can follow to that level. Sincerely, Tom Modesto

  4. Michael Sarnowski

    A time limit on data hording would be nice. Sometimes ancient texts and artifacts have been kept in secret a long time. Sometimes hundreds of years.

  5. Karl Hoffman

    Trying to get someone to adopt your method for ten years is the same as my trying for ten years to get someone to examine the Enhanced Big Bounce theory.

    • No, it’s not the same. I’m not touting my own personal theory that contradicts everyone else’s, in order to overthrow an entire field; I’m trying to help my colleagues do their measurements better. And I have written dozens of papers that have been vetted and accepted by the physics community, including papers about methods that have in fact been adopted at the LHC. These include new LHC trigger techniques; see for example . So (a) I’ve proven my worth, and (b) my aims are far more modest than yours. If you aim extremely high without paying any dues first, people are going to assume you aren’t actually capable of paying the dues.

  6. Matt, I have tweeted fairly extensively from this most interesting post. moving on – re: The Dead Sea Scroll travesty, I am reading about “The Temple Scroll” currently.

  7. Karl Hoffman

    It may be worth your while to look into the work Dr. Pace Vandevender is doing with quark theory.

  8. Michael Sarnowski

    I will share my data with you. It is a simple solution to other dimensions. I hope that you will look at it seriously. It predicts the gravitational constant based off constructing these other dimensions.

    • (1) This is not a site for advertising articles. (2) Real data comes from experiments, not somebody’s brain, no matter how good. You do need to understand the difference between data and theory in order to do proper science.

      • Michael Sarnowski

        Thanks for the courteous reply. I thought I would try, since I think it so important. Anyways, I look at it as putting the data through a symmetric equation and consider it still to be data.

  9. Karl Hoffman

    Real data is taken so that scientist’s brains (such as yours) can come up with a theory of how the universe works. I have never seen data publish a scientific paper.
    I am taking real data from telescopic observations that astronomer’s brains have interpreted and trying to piece it together into something meaningful. Yes, it is very controversial, but there does seem to be a developing consensus that a new idea is needed because of missing antimatter and unexplained dark matter.
    Matt, perhaps I have not “paid my dues”. I am only trying to get someone to critique the EBB theory which has been developed from 50 years of studying astronomer’s data interpretation (my dues)? Also, it is only an enhancement of Martin Bojowald’s (and others) Big Bounce theory. He has paid his dues.

    • No, I don’t think you’re quite right about that. Real data exists so that scientists’ brains (such as mine and those of my colleagues) can come up with an understanding of how the universe works. Note the use of the plural and the caution concerning the notion that we’re going to invent a new theory from scratch (rather than extend the ones we have.) Not even Einstein invented a new theory from scratch on his own (as explained in detailed scientific history books); and quantum mechanics was developed by two dozen people building off each others’ work and lots of data.

      Also, working for 50 years does not count as dues-paying. Dues-paying involves doing a calculation that is relevant for someone else’s work, showing that you understand how to interpret other people’s research and are capable of building upon it in a way that is useful for other people. Every professional scientist has done this; even young Einstein had already done it before he wrote his famous papers.

  10. Professor,
    I am a long time supporter of your Ideas and Work.
    Thanks for writing your Book. I will be gladly one of the First who will buy it.
    Yours, bob-2

  11. Karl Hoffman

    I stand corrected. Analysis of the data is intended to increase our understanding of how the universe works. That has led to creating theories such as the Big Bang and Big Bounce. However, the ultimate end is to make something useful for mankind, such as solar cells to mitigate climate change by understanding the photoelectric effect.
    From my previous email it is obvious that I did not make the EBB theory from scratch. I questioned how it was statistically possible for a preexisting universe to have the exact amount of matter needed to collapse and bounce to create our universe (I call this the Bojowald limit). The statistical probability of a partial collapse of a preexisting universe is much higher. This needs no calculation as it is intuitively obvious. This led to the EBB theory where our universe is inside a preexisting universe and ….
    As an aside, I did my dues in another way. I received a MSEE from the University of Michigan and founded MMATS Inc. after working for RCA’s computer division for 10 years. I placed the first microcomputer based controller into commercial laundry equipment (a folder for GA Braun), then into washers and then into dryers. I had to gather and analyze a lot of information to design and manufacture them. Also, we took a lot of data from dryers to create an automatic dry cycle. I had to analyze the data to understand what it meant and make a useful algorithm for an auto dry cycle.
    At that time, the salesmen questioned why we needed a new method for controlling equipment. Today you can see the results as microcomputers control just about everything.
    I would have loved to be a cosmologist or a particle physicist, but I could not see my way financially to accomplish that. Therefore, I had to be satisfied with the next best thing.

  12. Karl Hoffman

    I did not indicate where I started with EBB so you would conclude I did not start from scratch. Read Martin Bojowald’s October 2008 article in Scientific American “Follow the Bouncing Universe”. He, Lee Smolin and others have proposed similar theories. I think their work centered on Loop Quantum Gravity, spin foam, etc.
    Ask yourself about the statistical probability that the preexisting universe had the exact amount of matter needed to bounce and form our universe with no matter left over.
    Sometimes it is helpful when looking for a new tree in the forest to come out and examine the entire forest collectively for clues.

  13. Karl Hoffman


    Can we cut through this “did not pay my dues” scenario.

    What is implausible about the Enhanced Big Bounce theory as explained some comments ago?

    I presented this to a particle physicist several years ago and his comment was “I guess it is possible that our universe is inside a preexisting universe, but how is that going to help me with my calculations?”

    An impartial scientist will evaluate a posit on its merits alone and ignore the emotional side issues, which have nothing to do with the posit, that you, for instance, have raised. Perhaps some other readers of this thread could give their evaluation of the EBB posit.

    Yes, EBB is a very upending posit, but it does not make it wrong.

  14. Michael Sarnowski

    A group of engineers died and went to heaven for geniuses. Muhammad started taking them of a tour of heaven. Muhammad explained that some areas of heaven were segregated. The engineers arrived at a grand place made of ivory. At this point Muhammad said be very quite. One of the engineers whispered. Why should we be quite here. Muhammad stated. That’s where PhD’s are. They think they are the only intelligence here. We do not want to interrupt their delusions.

  15. Matt,
    I have done some calculations.
    Plot the density of the planets and our moon vs. AU from the sun. Do this out to Saturn. You will find they all fit quite closely to a curve, except the earth and moon. Their density does not fit the curve.
    So I wondered if our sun could capture earth if it were a binary companion of a brown dwarf. I did an incremental 3 body analysis using an Excel spreadsheet. Back in 2006, they only had 65,000 lines. But I proved, mathematically by calculations, that with the right initial conditions our sun will capture our earth from the brown dwarf and the brown dwarf will get a slingshot boost. (I can email the spreadsheet to you).
    Colliding with an existing body orbiting the sun only makes it easier.

  16. Karl Hoffman


    Read the February 2019 Sky & Telescope article “What Came Before the Big Bang? Sentence 4 – “The Big Bang theory doesn’t rule out the possibility that there was some pre-existing universe from which ours sprang …”. Remember Alan Guth, he is working on the pre “Big Bang” brain contemplations (sorry, no calculations yet)

    So much for the missing antimatter. There was no need for it initially as our universe did not come from nothing, rather from a pre-existing universe. (Now we have shifted the origin problem, “From whence came the pre-existing universe?)

    The problem is the sentence before this sentence. “As you extrapolate our expanding universe backwards, you eventually reach a point of infinite density where the known laws of physics break down,” The consequence of the mathematics (calculations) has caused many to accept the unproven posit – Infinite Density Matter.

    So my goodness, there it is again. Martin Bojowald’s, Lee Smolin’s, and other’s Big Bounce theory. There was no point of infinite density matter. The pre-existing universe contracted until it reached what I call the Bojowald Limit, where space can hold no more energy so the contraction ended and bounced to form our universe.

    This is where I came in (not from scratch).

    What is the statistical probability that the pre-existing universe had just the right amount of matter to form our universe with nothing left over?

    Compare the above with the statistical probability that there was only a partial collapse of a pre-existing universe.

    What conclusion can be reached?

    Search the internet for Tequestagrampy Youtube and click on Tequestagrampy – Youtube. You will find an EBB theory video.

    Any comments after watching the brief video.