By, uav-jp 10/07/2022

[Customer case] NTTPC Co., Ltd. Amazon EC2 INF1 instance, posture estimation service cost -optimization initiative

On October 14, 2021, we held an event on the Amazon EC2 event "Amazon EC2 Large Utilization -Latest Lineup, Cost Performance Optimization, Introduction Cases, etc."Fifteen years since the launch of EC2 services in 2006, the AWS provided by AWS has a wide variety of types of EC2 instances, but in order to optimize cost performance, AWS's proprietary ARM processor AWS GRAVITON2 and AWS INFERENTIA.We provide instances equipped with inference chips.At this event, he was a Cyber Agent Co., Ltd. and NTTPC Communications Co., Ltd.

For the entire event, please refer to the blog of the seminar [Amazon EC2 Large Utilization -Latest Lineup, Cost Performance Optimization, Advanced Customer Cases, etc.

In this blog, from the announcement of the event, NTTPC Communications Co., Ltd. focuses on Amazon EC2 INF1 instances with the purpose of cost optimization in the estimated posture API service (AnyMotion), comparing it to other EC2 instances.I will introduce the activities that have been verified.

Verification of inference INF1 instance in the estimated posture API service

NTTPC Communications Co., Ltd. Toshiki Takazawa

Site: [Slide]

NTTPC Communications provides an estimated posture API service, and uses AWS for infrastructure.This time, we focused on the INF1 instance as an instance to operate the inference model, compared with the GPU -equipped instance, Amazon EC2 P3 instance, Amazon EC2 G4 instance, and Amazon Elastic Inferne, and reported the verification contents.

Stock estimation API service Anymotion

Anymotion is a service for AI to estimate the position of human joints from videos and images taken with the camera and visualize physical movements through the function of analyzing skeletal data.In areas that have been relying on people's senses and visual visual visual visuals, reporting such as coaching and posture checks based on quantitative data will be possible.By providing this service as an API for companies that develop applications, the application side can evaluate and judge using analysis results, and use it to return reports and feedback to end users.He said he was considering it in various industries such as health care, entertainment, and rehabilitation.

Assignments in AI service operation

Reducing operation costs is an important consideration for operating AI services.Considering the scaling of the service, it was expected that the cost of the inference part accounted for more than 50% of the entire infrastructure cost, and it was expected to be particularly large.rice field.There are various inference instances provided by AWS, but GPU (G4DN and P3) instances, CPU (Amazon EC2 C5) instances, Amazon Elastic Inferences, which can be expected to optimize cost performance during operation.Comparison verification was conducted.

【お客様事例】株式会社 NTTPC 様 Amazon EC2 Inf1 インスタンス、姿勢推定推論サービスのコストパフォーマンス最適化の取り組み

What are Amazon EC2 INF1 and AWS Inferentia?

The Amazon EC2 INF1 instance is an instance equipped with AWS Inferentia, a machine learning inferment, which is designed by AWS to speed up the workload of machine learning inference and optimize costs. The INF1 instance has a maximum of 2.3 times the throughput and a maximum of 70% of the price per rhoming process compared to the GPU instance. AWS Inferentia chips contain four Neuron core. Neuron core implements a high -performance syntric assortment queue, which greatly speeds up general deep learning operations such as folding and conversion. The Neuron core has a large on -chip cache, reducing external memory access by fastening the calculation results in Neuron core and the weighted parameters required for inference processing in the cache, reducing the latency. You can improve the throughput.

AWS originally designed machine learning inference chip AWS INFERENTIA

AWS Neuron, a software development kit compatible with AWS INFERENTIA, can be used native from machine learning frameworks such as TENSORFLOW, Pytorch, and Apache Mxnet.Consists of compilers, runtime, and profiling tools, you can perform high performance and low latency inference.The INF1 instance needs to compile the learned models in a Neuron core in advance, but can be compiled only by changing the code of a few lines.The INF1 instance is also used as an Amazon Alexa provided by Amazon as an infrastructure that supports Amazon Advertising, which provides advertising solutions, and contributes to cost performance optimization.

Verification content

In the verification of NTTPC, I used the TensorFlow model that is actually used for the AnyMotion service.It is a custom model with an estimated posture that estimates the human joint position and returns the coordinates for the input image data of 256 x 192.In the Inf1 instance, it is necessary to compile the learned models in advance by the Neuron compiler, but since a node that does not support the Neuron compiler has been found in a part of the custom model, the node is processed on the CPU.We had you proceed with the verification.

In machine learning processing, there are two different requirements according to the use case of the application.One is a case where high real -timeability is required, and it is required to meet the given latency conditions.The other is a case called batch inference, which is more efficient and high throughput.

The instances and verification results of each comparison verified this time are as follows.

インスタンス	推論バッチサイズ	スループット [Images/秒]	レイテンシー　[ミリ秒]	時間単価 [USD]^※2	コストパフォーマンス [Images/0.0001 USD]
P50	P90
Inf1	Inf1.xlarge	1	517.80	7.38	7.68	0.308	604.05
Inf1.2xlarge	1	568.97	7.46	7.80	0.489	418.87
GPU	g4dn.xlarge	1	113.46	10.06	10.32	0.71	57.53
p3.2xlarge	1	170.88	8.40	8.66	4.194	14.67
Elastic　Inference^※1	eia1.medium	1	32.15	36.01	36.41	0.220	52.61
eia1.large	1	65.76	19.17	19.46	0.450	52.61
eia1.xlarge	1	85.89	15.51	15.86	0.890	34.74
CPU	c5.2xlarge	1	18.43	75.73	77.12	0.428	15.50

With a focus on real -time performance, and verification under the condition of the reasoning batch size to 1, the INF1 instance achieved a latency of 8 milliseconds or less.The cost performance at that time is a significantly separated other instance, and in comparison with the G4DN instance, the throughput of 4.5 times and the reasoning latency has been reduced by 25%, reducing costs by 90%.

インスタンス	推論バッチサイズ	スループット [Images/秒]	レイテンシー　 [ミリ秒]	時間単価 [USD]^※2	コストパフォーマンス [Images/0.0001 USD]
P50	P90
Inf1	Inf1.xlarge	8	1011.87	38.24	39.38	0.308	1182.71
Inf1.2xlarge	8	1147.46	39.25	40.81	0.489	844.76
GPU	g4dn.xlarge	64	224.63	464.14	496.92	0.71	113.90
p3.2xlarge	64	736.92	264.85	272.09	4.194	63.25
Elastic　Inference^※1	eia1.medium	8	67.79	148.56	150.70	0.220	110.93
eia1.large	32	201.2	304.06	446.77	0.450	160.96
eia1.xlarge	32	300.69	238.69	244.39	0.890	121.63
CPU	c5.2xlarge	16	20.59	868.74	896.24	0.428	17.32

In batch inference that emphasizes throughput performance, hardware accelerators such as GPU and Inferentia can increase usage efficiency by generally using large batch sizes.In the GPU (G4DN and P3) instance, the throughput was the largest value when the batch size was set to 64, whereas Inf1 instances were 100 % in the Inferentia when the batch size was 8.It has reached the value close to and achieves the maximum of throughput.Compared to the G4DN instance, it has also achieved 4.6 times throughput and 90% or more cost reductions.

Please refer to the details of NTTPC's verification announced this time.

summary

There are various requirements required for machine learning workloads.Whether the model to be handled is a conventional machine learning model such as XGBOOST, random forest, a deep learning model that requires hardware accelerators, or whether real -time performance is required, and process it in a batch inference.Is there no problem if the instance is dropped?Will the inference request always occur or sporadic?I think that we are also concerned about the ease of development from Time-to-Market and the development costs up to operation.This time, NTTPC has a large percentage of operating costs, which has a large proportion of inference processing, and verifies various instances around the Inf1 instance, and confirmed the high cost performance of the Inf1 instance.。

* For Amazon EC2 options in various requirements required for machine learning workloads, please refer to the Amazon EC2 machine learning workload option session within the event.

Many AWS customers, such as SNAP, Airbnb, and Sprinklr, use Amazon EC2 Inf1 instances to achieve high performance at low costs.We hope that the AWS customers who have seen this event and blog will be a hint to optimize the cost performance of inference workloads in the future.

Finally, I would like to thank NTTPC Communications Co., Ltd. for the cooperation of this event.Thank you very much for participating in the event.

This blog was in charge of Annapur Narabo's constant.

________________________________________