
Build a PyTorch ReLU Kernel with Hugging Face Kernels (CPU + Metal)
HuggingFace
Overview
This video introduces the Hugging Face Kernels library, a tool designed to simplify the building, packaging, and distribution of custom PyTorch kernels. It explains the library's purpose: to provide a standardized interface for kernel implementation and a seamless way for users to access pre-compiled kernels without complex installation processes. The presenter walks through the end-to-end workflow, demonstrating how to build a ReLU kernel for both CPU (ARM Neon) and Metal (Apple GPU) backends. The demonstration highlights the kernel builder's ability to handle cross-platform compatibility and integration with PyTorch's `torch.compile`. The video emphasizes the benefits of this approach, such as avoiding naming conflicts, enabling multiple kernel versions, and abstracting away build complexities like CMake errors for end-users. It concludes with a brief look at the documentation for further exploration.
Save this permanently with flashcards, quizzes, and AI chat
Chapters
- Overview of the Hugging Face Kernels library and its purpose.
- Demonstration of a simple workflow for building and using a custom kernel.
- Explanation of the library's role as a standardized interface for kernel development.
- Kernels are distributed as artifacts, not pip-installed packages, allowing for flexible versioning and deployment.
- Three main user groups: kernel users, kernel builders, and kernel explorers.
- Users can pip install the `kernels` package to download and use pre-built kernels.
- The `kernel_builder` tool helps package and standardize kernel development.
- A community repository on Hugging Face Hub hosts state-of-the-art pre-compiled kernels.
- Portability: Kernels can be stored on Hugging Face Hub, local disk, or private networks.
- Version Management: Easily manage multiple versions of the same kernel without naming conflicts.
- Cross-Platform Compatibility: Simplifies building kernels for different Python, PyTorch, and hardware versions (CPU, CUDA, Metal).
- Reduced Build Complexity: Aims to eliminate CMake errors for users and developers.
- The goal is to go from source code (e.g., C++, CUDA) to compiled artifacts on Hugging Face Hub.
- The `kernel_builder` handles compliance across versions and hardware.
- Kernels registered as native PyTorch extensions are compatible with `torch.compile`.
- The `kernel_builder` uses Nix for isolated, deterministic builds with caching.
- Focus on building a ReLU kernel for CPU (ARM Neon) and Metal (Apple GPU).
- CPU implementation uses Neon intrinsics for acceleration.
- Metal implementation includes kernels for half and full float precision.
- PyTorch extension registration bridges C++/Metal/CUDA code with Python.
- The `build.yaml` file defines the kernel name, target backends (CPU, Metal), and dependencies.
- The `kernel_builder` generates the necessary build outputs.
- Leverages Hugging Face cache for pre-built artifacts, speeding up the build process.
- The build process generates compiled `.so` files and `ops` files for Python integration.
- Using `get_local_kernel` to load the built ReLU kernel from a local directory.
- The library automatically selects the correct kernel based on the runtime environment (PyTorch version, hardware).
- Demonstration of applying the ReLU kernel to a tensor on the Apple GPU.
- Verification that negative values are zeroed out and positive values remain unchanged.
- Recap of the Hugging Face Kernels library's benefits.
- Encouragement to explore the official documentation for more examples and features (e.g., layers).
- Invitation for questions and further contact.
Key takeaways
- The Hugging Face Kernels library simplifies the creation and distribution of custom PyTorch kernels.
- Kernels are managed as portable artifacts, not traditional pip packages, enabling flexible versioning and deployment.
- The `kernel_builder` tool standardizes the development process and handles cross-platform compilation (CPU, Metal, CUDA).
- Using the library abstracts away complex build configurations (like CMake) for end-users.
- Kernels built with this system integrate seamlessly with PyTorch's `torch.compile`.
- The library automatically selects the appropriate pre-compiled kernel based on the user's hardware and software environment at runtime.
- This approach resolves potential naming conflicts when using multiple versions or types of the same kernel.
- The documentation provides valuable resources for users interested in leveraging or contributing to the kernels ecosystem.