Curated on
August 12, 2024
Hugging Face, one of the world's largest AI communities, has unveiled a new inference-as-a-service, leveraging NVIDIA's NIM microservices on the NVIDIA DGX Cloud for enhanced efficiency and speed. This new service is designed to help developers deploy leading large language models, such as the Llama 3 family and Mistral AI models, with up to 5x better token efficiency. This advancement will allow the 4 million strong developer community on Hugging Face to rapidly prototype, test, and deploy models with minimal infrastructure overhead.
Announced at the SIGGRAPH conference, this service complements the existing Train on DGX Cloud feature, offering a cohesive environment for both training and inference. The serverless inference service provided through the Enterprise Hub enables greater flexibility and optimized performance. Developers can access these services using the intuitive 'Train' and 'Deploy' drop-down menus on Hugging Face model cards, making the process straightforward with just a few clicks. The integration ensures faster, more robust AI model results and production-ready applications without long-term infrastructure commitments.
NVIDIA NIM microservices, optimized for processing tokens, significantly improve the speed and efficiency of the NVIDIA DGX Cloud infrastructure. For instance, the 70-billion-parameter version of Llama 3 can achieve up to 5x higher throughput with NIM compared to standard deployment on NVIDIA H100 Tensor Core GPU systems. This means developers using Hugging Face can enjoy state-of-the-art performance and scalable GPU resources at every stage of AI development. Additionally, NVIDIA presented new generative AI models for the OpenUSD framework at SIGGRAPH, showcasing over 100 NIM microservices applicable across various industries.