AI Hardware

What we know about Apple’s on-device AI

Following Microsoft Build and Google I/O, Apple was under a lot of pressure to show its on-device AI might at its Worldwide Developers Conference 2024. And as far as the demos are concerned, Apple has done a great job of integrating generative AI into the user experience across all its devices.

By Priya Khosla June 27, 2024

One of the most impressive aspects of the demonstrations was how much of the workload is taking place on the devices themselves. Apple has been able to leverage its state-of-the-art processors as well as a slew of open research to provide high-quality, low-latency AI capabilities on its phones and computers. Here is what we know about Apple’s on-device AI.

3-billion parameter model

According to the Apple State of the Union presentation and an accompanying blog post released on June 10, Apple uses a 3-billion parameter model. Apple does not explicitly say which model it uses as its base model. But it recently released several open models, including the OpenELM family of language models, which includes a 3-billion parameter version.

OpenELM has been optimized for resource-constrained devices. For example, it has made modifications to the underlying transformer model to improve the model’s quality without increasing the parameters. The foundation model used in Apple devices might be a specialized version of OpenELM-3B.

OpenELM was trained on 1.8 trillion tokens of open datasets. According to the blog post, the new foundation model is trained on “licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web crawler, AppleBot.”

What is this licensed data? From what we know, Apple has a $25-$50 million deal with Shutterstock for images and a possible $50 million deal with major news and publishing organizations.

The model has been fine-tuned for instruction-following through reinforcement learning from human feedback (RLHF) and a “rejection sampling fine-tuning algorithm with teacher committee.” RLHF uses human-annotated data to model user preferences and train the language models to better follow instructions and became popular with the release of ChatGPT.

Rejection sampling generates multiple examples at each training step and uses the one that provides the best result to update the model. The Llama-2 team also used rejection sampling in fine-tuning their models. “Teacher committee” suggests that a larger and more capable model was used as reference to evaluate the quality of the training examples generated to fine-tune the on-device model. Many researchers use frontier models such as GPT-4 and Claude 3 as teachers in these scenarios. It is not clear which models Apple used for sample evaluation.

Optimization

Apple has used several techniques to improve the capabilities of the models while keeping them resource-efficient.

According to the blog post, the foundation model uses “grouped query attention” (GQA), a technique developed by Google Research that speeds up inference speed without exploding memory and compute requirements. (OpenELM also uses GQA.)

According to the Apple blog, the model uses “palletization,” a technique that compresses the model’s weights by using look-up tables and indices to group similar model weights together. However, the presentation mentions “quantization,” which is another compression technique that reduces the number of bits per parameter.

Source: Venturebeat, Read the full article here.