Learn how Palo Alto Networks is Transforming Platform Engineering with AI Agents. Register here

Attend a Live Product Tour to see Sedai in action.

Register now
More
Close
AWS

p3.16xlarge

EC2 Instance

GPU instance with 64 vCPUs, 488 GiB memory, and 8 NVIDIA V100 GPUs with 128 GB total GPU memory. Highest GPU density in P3 family for large-scale parallel processing.

Coming Soon...

icon
Pricing of
p3.16xlarge

N/A

On Demand

N/A

Spot

N/A

1 Yr Reserved

N/A

3 Yr Reserved

Pricing Model
Price (USD)
% Discount vs On Demand
sedai

Let us help you choose the right instance

icon
Spot Pricing Details for
p3.16xlarge

Here's the latest prices for this instance across this region:

Availability Zone Current Spot Price (USD)
Frequency of Interruptions: n/a

Frequency of interruption represents the rate at which Spot has reclaimed capacity during the trailing month. They are in ranges of < 5%, 5-10%, 10-15%, 15-20% and >20%.

Last Updated On: December 17, 2024
icon
Compute features of
p3.16xlarge
FeatureSpecification
icon
Storage features of
p3.16xlarge
FeatureSpecification
icon
Networking features of
p3.16xlarge
FeatureSpecification
icon
Operating Systems Supported by
p3.16xlarge
Operating SystemSupported
icon
Security features of
p3.16xlarge
FeatureSupported
icon
General Information about
p3.16xlarge
FeatureSpecification
icon
Benchmark Test Results for
p3.16xlarge
CPU Encryption Speed Benchmarks

Cloud Mercato tested CPU performance using a range of encryption speed tests:

Encryption Algorithm Speed (1024 Block Size, 3 threads)
AES-128 CBC N/A
AES-256 CBC N/A
MD5 N/A
SHA256 N/A
SHA512 N/A
I/O Performance

Cloud Mercato's tested the I/O performance of this instance using a 100GB General Purpose SSD. Below are the results:

Read Write
Max N/A N/A
Average N/A N/A
Deviation N/A N/A
Min N/A N/A

I/O rate testing is conducted with local and block storages attached to the instance. Cloud Mercato uses the well-known open-source tool FIO. To express IOPS the following parametersare used: 4K block, random access, no filesystem (except for write access with root volume and avoidance of cache and buffer.

icon
Community Insights for
p3.16xlarge
AI-summarized insights
filter icon
Filter by:
All

he repeated the test against the largest of the P3 instances, the p3.16xlarge.

19-03-2025
benchmarking

Gartner Peer Insights content consists of the opinions of individual end users based on their own experiences, and should not be construed as statements of fact, nor do they represent the views of Gartner or its affiliates. Gartner does not endorse any vendor, product or service depicted in this content nor makes any warranties, expressed or implied, with respect to this content, about its accuracy or completeness, including any warranties of merchantability or fitness for a particular purpose. This site is protected by hCaptcha and its [Privacy Policy](https://hcaptcha.com/privacy) and [Terms of Service](https://hcaptcha.com/terms) apply.

19-03-2025

Gartner Peer Insights content consists of the opinions of individual end users based on their own experiences, and should not be construed as statements of fact, nor do they represent the views of Gartner or its affiliates. Gartner does not endorse any vendor, product or service depicted in this content nor makes any warranties, expressed or implied, with respect to this content, about its accuracy or completeness, including any warranties of merchantability or fitness for a particular purpose. This site is protected by hCaptcha and its [Privacy Policy](https://hcaptcha.com/privacy) and [Terms of Service](https://hcaptcha.com/terms) apply.

Thanks for your detailed suggestions txbob. I certainly don’t know what’s involved in eliminating the zero copy fallback just because non-NVLinked GPUs exist in the system. It just seems like it should be possible without changes to the HW when those links are not even requested enabled. I will investigate nccl and other solutions.

2018-08-02 00:00:00
benchmarking

I have run my code configured to use only 4 GPUs on the p3.16xlarge instance which runs very fast on the p3.8xlarge instance. The result is the same glacial performance as before.

2018-08-02 00:00:00
benchmarking, development

According to this: [CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES | NVIDIA Technical Blog](https://devblogs.nvidia.com/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/) you are correct that CUDA_VISIBLE_DEVICES will enable me to run at full speed on 4 of the 8 GPUs. However, I have already verified that my code runs fast on 4 GPUs. Thanks for that suggestion. What I need is for NVidia/AWS to provide a solution that allows me to utilize UVM and Peer-to-Peer at full speed on an 8 GPU system. Any suggestion on how to get this fixed?

2018-08-02 00:00:00
benchmarking, development

I removed all the failed peer mapping requests, and all the cudaMallocManaged calls, but kernel execution time is still as slow as before. And when requesting peer mapping for GPU 0, it succeeds for GPUs 1-4, so does that mean there are 5 GPUs on that CPU node? I was able to map 1-3 (or 4) to GPU 0 and 5-7 to GPU 4.

2018-07-02 00:00:00
benchmarking

OK, thanks txbob. The kernel is running 900x slower. Apparently I just need to stop requesting unavailable peer mappings. But I will also stop using managed memory for inputs since that’s not even necessary. Seems odd that a request failure would cause a GPU wide default and the zero-copy memory is so drastically slower.

2018-07-02 00:00:00
memory_usage, benchmarking

I’ve allocated all GPU global memory with cudaMallocManaged except for inter kernel global storage.

2018-07-02 00:00:00
memory_usage, benchmarking

Are you saying that it is not possible with any current system to enable Peer-to-Peer over NVLink between more than 4 GPUs? I find it odd that merely trying to enable and use UVM between GPUs would drastically slow down the kernel execution (according to NVProf) which is writing to GPU global memory on a single GPU.

2018-07-02 00:00:00
memory_usage, benchmarking

My code attempts to enable peer access by GPU 0 to the other 7 GPUs in the system. The first 4 pass cudaDeviceCanAccessPeer, but the last 3 fail. This causes the code to run much slower than it does on a 4 GPU instance.

2018-06-02 00:00:00
benchmarking, development

he repeated the test against the largest of the P3 instances, the p3.16xlarge.

19-03-2025
benchmarking

Gartner Peer Insights content consists of the opinions of individual end users based on their own experiences, and should not be construed as statements of fact, nor do they represent the views of Gartner or its affiliates. Gartner does not endorse any vendor, product or service depicted in this content nor makes any warranties, expressed or implied, with respect to this content, about its accuracy or completeness, including any warranties of merchantability or fitness for a particular purpose. This site is protected by hCaptcha and its [Privacy Policy](https://hcaptcha.com/privacy) and [Terms of Service](https://hcaptcha.com/terms) apply.

19-03-2025

he repeated the test against the largest of the P3 instances, the p3.16xlarge.

19-03-2025
benchmarking

he repeated the test against the largest of the P3 instances, the p3.16xlarge.

19-03-2025
benchmarking

he repeated the test against the largest of the P3 instances, the p3.16xlarge.

19-03-2025
benchmarking
Load More
Similar Instances to
p3.16xlarge

Consider these:

Feedback

We value your input! If you have any feedback or suggestions about this t4g.nano instance information page, please let us know.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.