A little while back I wrote a short post about some research that some colleagues and I did using “open data” from the Large Hadron Collider [LHC]. We used data made public by the CMS experimental collaboration — about 1% of their current data — to search for a new particle, using a couple of twists (as proposed over 10 years ago) on a standard technique. (CMS is one of the two general-purpose particle detectors at the LHC; the other is called ATLAS.) We had two motivations: (1) Even if we didn’t find a new particle, we wanted to prove that our search method was effective; and (2) we wanted to stress-test the CMS Open Data framework, to assure it really does provide all the information needed for a search for something unknown.
Recently I discussed (1), and today I want to address (2): to convey why open data from the LHC is useful but controversial, and why we felt it was important, as theoretical physicists (i.e. people who perform particle physics calculations, but do not build and run the actual experiments), to do something with it that is usually the purview of experimenters.
The Importance of Archiving Data
In many subfields of physics and astronomy, data from experiments is made public as a matter of routine. Usually this occurs after an substantial delay, to allow the experimenters who collected the data to analyze it first for major discoveries. That’s as it should be: the experimenters spent years of their lives proposing, building and testing the experiment, and they deserve an uninterrupted opportunity to investigate its data. To force them to release data immediately would create a terrible disincentive for anyone to do all the hard work!
Data from particle physics colliders, however, has not historically been made public. More worrying, it has rarely been archived in a form that is easy for others to use at a later date. I’m not the right person to tell you the history of this situation, but I can give you a sense for why this still happens today.