Tests 2-GPU, 4-GPU, and 8-GPU configurations using peer-accessible memory. Cross-GPU buffers are allocated via raw cudaMalloc (not PyTorch caching allocator) to ensure proper P2P accessibility.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results