MLPerf, MLCommons, and improving benchmarking

MLPerf Evolves Under MLCommons: Executive Director David Kanter Discusses Future Benchmarks and Expanding Datasets

By Stephanie Wright July 8, 2020

July 8th, 2020

MLPerf is evolving

MLPerf, the independent ML benchmarking body is evolving and its Executive Director, David Kanter, was good enough to brief me on his role and spoke a little about how MLPerf is folding into MLCommons, its new parent organization. MLCommons will not just manage MLPerf but go beyond, such as establish data sets for use in MLPerf. David pointed me to his presentation at the recent Nvidia GTC in May, where he introduced MLPerf in the first half and then spoke about MLCommons and plans for it in the second half. If you are new to the MLPerf benchmark, the below extract from my recent Landscape report on AI Hardware Accelerators is based on the first half of David’s presentation, available here: https://developer.nvidia.com/gtc/2020/video/s22099

AI hardware accelerators: benchmarks

AI benchmarks from independent bodies help the AI hardware industry make progress by showing users and consumers how products on the market compare against each other. There are multiple factors that go into making a purchasing decision, such as compliance in regulated industries, the maturity of the software stack, power requirements, memory requirements, running costs, size of device, and more, but benchmarks have an important role. Their value is not just to consumers but also to the manufacturers, to show how their products perform on a level playing field and provides healthy competition that drives innovation.

The leading ML benchmark today is MLPerf, now being managed by a new entity MLCommons. The aim is to measure full system (software and hardware) performance such as execution time and throughput and is not concerned with the accuracy of the application being run. There are industry competitions where the aim is to identify the best performing AI model in terms of accuracy, and this is not the purpose of MLPerf.

Currently MLPerf has three benchmark suites: one for training and two for inference – one for general systems and one for mobile phones. The approach to performance is to measure latencies. For the training suite benchmarks there are three givens: dataset, model, and target quality. A model (e.g. ResNet) is trained on the dataset (e.g. ImageNet) until it achieves the target quality (accuracy of 75.9%). The performance metric here is time to train. Noteworthy is that increasing throughput by reducing the model precision will increase the time to train to the target accuracy (if it even achieves it), thereby increasing running cost to the user, hence throughput is not used as the benchmark metric.

MLPerf has two training divisions of its benchmarks: closed and open. The closed division specifies the model, thus making it the most suitable set of benchmarks for like-with-like comparisons. There is also an open division where the runner of the benchmark can choose their own model

For inference MLPerf measures the rate of inference. Again, there are three benchmark givens: an input (e.g. an image), the trained model (e.g. ResNet) and the required accuracy for the result (e.g. accuracy of 75.1%). Again, there are also two inference divisions, closed and open, for model specification.

Inferencing use cases vary in practice so MLPerf defines four:

Single stream: one image at a time streaming into the model (e.g. mobile phone augmented vision). MLPerf metric is latency of the response.
Multiple stream: multiple images being simultaneously presented on a regular cadence (e.g. multiple cameras in driving assistance). MLPerf metric is how many concurrent streams can be supported given a latency limit.
Server: in this scenario objects, single or multiple, appear at the input in random cadence (e.g. a cloud translation app). MLPerf metric is throughput: queries per second, given a latency limit.
Offline: batch processing where all the data is available at once (e.g. photo sorting app). MLPerf metric is just throughput: queries per second, with no time limit.

One of the challenges in any benchmarking is defining the hardware system: an AI hardware manufacturer may supply a benchmark result based on a board with multiple AI chips. To allow comparison MLPerf requires the number of chips to be specified in the submission. This allows a benchmark reader to normalize the results for one chip, and it does of course imply that there is a linear relationship in the results: for a small number of chips this is fine, if the non-linearity happens at all it will be at the other end of the scale, running a very large number of chips in a scale out scenario.

There are other AI benchmarks in operation:

AI-benchmarks: focused on smartphone tasks.
AI Matrix: initiated by Alibaba.
MLMark (an EEMBC benchmark): focused on embedded ML inference.

Improving MLPerf

In his GTC presentation David talks about how MLPerf is to be improved. In my working with sixteen vendors participating in the Kisaco Research reports on AI hardware accelerators the challenges of comparing AI chips side by side were soon apparent - I discuss that experience in another blog. What would help is if MLPerf included power consumption and also provided a figure of TOPS/Watt in ML inference mode. David mentions doing this and I hope a) this comes about, and b) the vendor participants in MLPerf supply the data.

Appendix

Author

Michael Azoff, Chief Analyst

[email protected]

Copyright notice and disclaimer

The contents of this product are protected by international copyright laws, database rights and other intellectual property rights. The owner of these rights is Kisaco Research Ltd. our affiliates or other third-party licensors. All product and company names and logos contained within or appearing on this product are the trademarks, service marks or trading names of their respective owners, including Kisaco Research Ltd. This product may not be copied, reproduced, distributed or transmitted in any form or by any means without the prior permission of Kisaco Research Ltd.

Whilst reasonable efforts have been made to ensure that the information and content of this product was correct as at the date of first publication, neither Kisaco Research Ltd. nor any person engaged or employed by Kisaco Research Ltd. accepts any liability for any errors, omissions or other inaccuracies. Readers should independently verify any facts and figures as no liability can be accepted in this regard - readers assume full responsibility and risk accordingly for their use of such information and content.

Any views and/or opinions expressed in this product by individual authors or contributors are their personal views and/or opinions and do not necessarily reflect the views and/or opinions of Kisaco Research Ltd.