Tests 2-GPU, 4-GPU, and 8-GPU configurations using peer-accessible memory. Cross-GPU buffers are allocated via raw cudaMalloc (not PyTorch caching allocator) to ensure proper P2P accessibility.