DetSimCore: fix LCG109/Gaudi v40 random-engine self-deadlock (standard sim.py hangs forever at initialization)
TL;DR
HepRndm::SynchronizedEngine<T>, newly introduced in Gaudi v40 (LCG_109), has a single-threaded self-deadlock defect: any native simulation job that uses the Gaudi random engine in the documented way -- including the standard entry point Detector/DetCRD/scripts/TDR_o1_v01/sim.py -- hangs forever on LCG_109. This MR changes two places (~10 lines) inside CEPCSW's own Simulation/DetSimCore so that all existing options run unmodified, with random-number semantics bit-for-bit equivalent to the old LCG stacks. This is not a CEPCSW usage error; the root cause is upstream in Gaudi (worth reporting), but the CEPCSW-side fix remains correct even after Gaudi is fixed and never needs to be reverted.
Environment: LCG_109 (x86_64-el9-gcc13-opt), Gaudi v40r2, CLHEP 2.4.7.2, Geant4 11.4.0.
Symptom and impact
- Any job setting
DetSimAlg.RandomSeeds = [...]hangs inDetSimAlg::initialize()(path A). - Any job setting
SetSingleton = Trueon the HepRndm engine (the standard sim.py idiom, which installs the Gaudi engine as the CLHEP singleton) hangs inDetSimSvc::initialize()insidenew G4RunManager()even without RandomSeeds (path B: the G4RunManager constructor callsCLHEP::HepRandom::saveFullState()). - The two paths cover sim.py's default configuration, so the LCG109 migration on master compiles, but native simulation cannot actually RUN with standard options. The hang shows as: process in S state, 0% CPU, single thread in
futex_wait, no error output whatsoever (we once waited 11 hours on it).
Deadlock stacks (gdb, 2026-06-10/11)
Path A (RandomSeeds -> setSeeds re-entry):
#2 HepRndm::SynchronizedEngine<CLHEP::HepJamesRandom>::setSeed(long, int) <-- second lock
#3 CLHEP::HepJamesRandom::setSeeds(long const*, int) <-- internally calls virtual setSeed
#4 HepRndm::SynchronizedEngine<CLHEP::HepJamesRandom>::setSeeds(long const*, int) <-- first lock
#5 HepRndm::Engine<CLHEP::HepJamesRandom>::setSeeds(std::vector<long> const&)
#6 DetSimAlg::initialize() (origin: DetSimAlg.cpp randSvc()->engine()->setSeeds(...))
Path B (SetSingleton=True -> G4RunManager constructor -> put re-entry):
#2 HepRndm::SynchronizedEngine<CLHEP::HepJamesRandom>::put() const <-- second lock
#3 CLHEP::HepJamesRandom::put(std::ostream&) const <-- internally calls virtual put()
#4 HepRndm::SynchronizedEngine<CLHEP::HepJamesRandom>::put(std::ostream&) const <-- first lock
#5 CLHEP::HepRandom::saveFullState(std::ostream&)
#6 G4RunManager::G4RunManager()
#7 DetSimSvc::initialize()
Root cause (verified at source level)
Gaudi/GaudiSvc/src/RndmGenSvc/HepRndmBaseEngine.h: SynchronizedEngine<T> wraps the CLHEP engine by inheritance and overrides 14 virtual methods, each taking a non-recursive mutable std::mutex before delegating to the base implementation. But CLHEP engines legitimately self-delegate: HepJamesRandom::setSeeds() internally calls the virtual setSeed(), and put(ostream) internally calls the virtual put(). Under inheritance wrapping these internal calls dispatch through the vtable back into the locked overrides => the same thread locks the same non-recursive mutex twice => deadlock.
Why nobody hit this before: the class was introduced between v36 and v40 -- we verified that libGaudiSvc.so of Gaudi v36r14 (LCG_105 era) contains zero symbols of it, which is why the very same sim.py always worked on the old stack. LHCb/ATLAS have not reported it, presumably because they never install the Gaudi engine as the CLHEP singleton feeding a native G4RunManager -- which is exactly CEPCSW's standard usage.
Minimal reproducer needs no Geant4: in any LCG_109 Gaudi job, either call engine()->setSeeds({...}) on this engine (path A), or set SetSingleton=True and call CLHEP::HepRandom::saveFullState(...) (path B) -- both hang forever.
The fix in this MR (2 commits; idea: Geant4 only ever sees a pure CLHEP engine)
Isolate the Gaudi wrapper engine completely from Geant4 so that neither deadlock path is reachable:
Patch 1 (adc3115, DetSimSvc::initialize()): before new G4RunManager(), unconditionally install a process-lifetime pure CLHEP::HepJamesRandom (Geant4's default engine, and the engine sim.py selects anyway) as the CLHEP singleton => the G4RunManager constructor's saveFullState() lands on the pure engine; path B eliminated.
Patch 2 (099e138, DetSimAlg::initialize()): replace randSvc()->engine()->setSeeds(m_randomSeeds) with direct seeding of the CLHEP singleton via CLHEP::HepRandom::setTheSeeds(...) (zero-terminated array); path A eliminated. One crucial ordering detail: in the original code the seeding ran before service("DetSimSvc"), and DetSimSvc (not being in ExtSvc) only initializes on first retrieval -- an in-place replacement would therefore seed while the singleton is still the Gaudi wrapper and deadlock all the same. This patch moves the DetSimSvc retrieval (with its error check) above the seeding block; DetSimSvc::initialize() only installs the engine and creates the run manager, consumes no random numbers, so the move cannot affect the event stream.
Why semantics are preserved:
- The design intent of
SetSingleton(generator and G4 share one stream, controlled by the job seed) is fully kept -- same HepJamesRandom, same seeds => the random sequence is bit-for-bit identical to the old LCG stacks; - Seeding happens after G4RunManager creation but before any random number is consumed (the constructor only saves state), equivalent to the seed-first semantics of the old stack;
- Jobs setting only
SetSingleton=TruewithoutRandomSeeds(pure path-B victims) are rescued by patch 1 alone; - No revert needed after the upstream Gaudi fix: seeding G4 was never supposed to detour through Gaudi service internals.
Validation
- Full LCG_109 build passes (only two files in
DetSimCoretouched, +34/-7). - A 5-event native onlyTrackerCalo job with the exact sim.py-style seeding (
HepJamesRandom+SetSingleton=True+Seeds+DetSimAlg.RandomSeeds-- the configuration that hits both deadlock paths and hangs forever without this MR) now runs to completion with normal hits (Ecal 366-1864 hits/event, Hcal 524-1079 hits/event), exit 0. - Larger-scale indirect validation: this branch served as the LCG109 native reference leg of a native-vs-Gaussino comparison campaign (full TDR_o1_v01 geometry, 6 phase-space points, including one point with 5000 events) -- stable throughout, physics output consistent with the reference.
Notes for the migration documentation
With these patches, rndmengine.Seeds no longer propagates to Geant4 (the wrapper engine is isolated); DetSimAlg.RandomSeeds becomes the sole seed source of the G4 stream. sim.py sets the same seed in both places and is unaffected; non-standard jobs that set only Seeds without RandomSeeds need to add the latter.
Remaining audit item (reco chain): until Gaudi is fixed upstream, any component that directly calls IRndmEngine::setSeeds() will still deadlock (in the sim chain DetSimAlg was the only caller, covered by this MR); before enabling the reco chain on LCG109, a grep -rn "engine()->setSeeds" Reconstruction/ Service/ check is recommended.
Suggested upstream Gaudi fix (either option; worth reporting): (1) minimal: std::mutex -> std::recursive_mutex; (2) cleaner: replace inheritance wrapping by composition/forwarding (the wrapper holds a plain CLHEP instance internally, so self-calls dispatch inside the unwrapped instance and never re-enter the wrapper). Reference: https://gitlab.cern.ch/gaudi/Gaudi/-/blob/master/GaudiSvc/src/RndmGenSvc/HepRndmBaseEngine.h
Workaround without this MR (for reference; used in our reference jobs before the fix): do not set RandomSeeds, do not set SetSingleton, and seed via detsimalg.RunCmds = ["/random/setSeeds 12340 1"] instead.