Cloud Mercato tested CPU performance using a range of encryption speed tests:
Cloud Mercato's tested the I/O performance of this instance using a 100GB General Purpose SSD. Below are the results:
I/O rate testing is conducted with local and block storages attached to the instance. Cloud Mercato uses the well-known open-source tool FIO. To express IOPS the following parametersare used: 4K block, random access, no filesystem (except for write access with root volume and avoidance of cache and buffer.
.png)


he repeated the test against the largest of the P3 instances, the p3.16xlarge.

Gartner Peer Insights content consists of the opinions of individual end users based on their own experiences, and should not be construed as statements of fact, nor do they represent the views of Gartner or its affiliates. Gartner does not endorse any vendor, product or service depicted in this content nor makes any warranties, expressed or implied, with respect to this content, about its accuracy or completeness, including any warranties of merchantability or fitness for a particular purpose. This site is protected by hCaptcha and its [Privacy Policy](https://hcaptcha.com/privacy) and [Terms of Service](https://hcaptcha.com/terms) apply.

Gartner Peer Insights content consists of the opinions of individual end users based on their own experiences, and should not be construed as statements of fact, nor do they represent the views of Gartner or its affiliates. Gartner does not endorse any vendor, product or service depicted in this content nor makes any warranties, expressed or implied, with respect to this content, about its accuracy or completeness, including any warranties of merchantability or fitness for a particular purpose. This site is protected by hCaptcha and its [Privacy Policy](https://hcaptcha.com/privacy) and [Terms of Service](https://hcaptcha.com/terms) apply.

Thanks for your detailed suggestions txbob. I certainly don’t know what’s involved in eliminating the zero copy fallback just because non-NVLinked GPUs exist in the system. It just seems like it should be possible without changes to the HW when those links are not even requested enabled. I will investigate nccl and other solutions.

I have run my code configured to use only 4 GPUs on the p3.16xlarge instance which runs very fast on the p3.8xlarge instance. The result is the same glacial performance as before.

According to this: [CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES | NVIDIA Technical Blog](https://devblogs.nvidia.com/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/) you are correct that CUDA_VISIBLE_DEVICES will enable me to run at full speed on 4 of the 8 GPUs. However, I have already verified that my code runs fast on 4 GPUs. Thanks for that suggestion. What I need is for NVidia/AWS to provide a solution that allows me to utilize UVM and Peer-to-Peer at full speed on an 8 GPU system. Any suggestion on how to get this fixed?

I removed all the failed peer mapping requests, and all the cudaMallocManaged calls, but kernel execution time is still as slow as before. And when requesting peer mapping for GPU 0, it succeeds for GPUs 1-4, so does that mean there are 5 GPUs on that CPU node? I was able to map 1-3 (or 4) to GPU 0 and 5-7 to GPU 4.

OK, thanks txbob. The kernel is running 900x slower. Apparently I just need to stop requesting unavailable peer mappings. But I will also stop using managed memory for inputs since that’s not even necessary. Seems odd that a request failure would cause a GPU wide default and the zero-copy memory is so drastically slower.

I’ve allocated all GPU global memory with cudaMallocManaged except for inter kernel global storage.

Are you saying that it is not possible with any current system to enable Peer-to-Peer over NVLink between more than 4 GPUs? I find it odd that merely trying to enable and use UVM between GPUs would drastically slow down the kernel execution (according to NVProf) which is writing to GPU global memory on a single GPU.

My code attempts to enable peer access by GPU 0 to the other 7 GPUs in the system. The first 4 pass cudaDeviceCanAccessPeer, but the last 3 fail. This causes the code to run much slower than it does on a 4 GPU instance.

he repeated the test against the largest of the P3 instances, the p3.16xlarge.

Gartner Peer Insights content consists of the opinions of individual end users based on their own experiences, and should not be construed as statements of fact, nor do they represent the views of Gartner or its affiliates. Gartner does not endorse any vendor, product or service depicted in this content nor makes any warranties, expressed or implied, with respect to this content, about its accuracy or completeness, including any warranties of merchantability or fitness for a particular purpose. This site is protected by hCaptcha and its [Privacy Policy](https://hcaptcha.com/privacy) and [Terms of Service](https://hcaptcha.com/terms) apply.

he repeated the test against the largest of the P3 instances, the p3.16xlarge.

he repeated the test against the largest of the P3 instances, the p3.16xlarge.

he repeated the test against the largest of the P3 instances, the p3.16xlarge.