Hewlett Packard Enterprise (HPE) has announced its acquisition of Pachyderm, a startup that specializes in open-source software designed to automate reproducible machine learning (ML) pipelines for large-scale AI applications. The financial details of the deal have not been disclosed.
Reproducing training data pipelines is crucial for achieving transparency and accuracy in ML projects, especially when it comes to compliance and explainability. However, recreating ML models can be challenging due to the complexity of managing large, intricate datasets. When it’s time to update models, retraining requires using unaltered datasets to ensure consistency and accuracy. But often, data and its associated code undergo various changes between training and deployment. Pachyderm addresses this by automating the creation of reproducible ML pipelines through data lineage and versioning features. The platform is built on a distributed, immutable file system, paired with an execution layer, designed to work seamlessly together.
As AI projects grow and involve more complex datasets, data scientists need reproducible AI solutions to optimize their ML efforts, manage infrastructure costs, and ensure data remains reliable and secure at all stages of the AI process. Pachyderm’s technology enhances HPE’s AI solutions, helping accelerate AI initiatives in fields such as image, video, and text analysis, generative AI, and large-language models, which are essential for driving transformative outcomes.
HPE aims to expand its AI portfolio by integrating its supercomputing technologies with the HPE Machine Learning Development Environment. This ML software enables users to develop, iterate, and scale models from proof-of-concept to production. With the addition of Pachyderm’s reproducible AI capabilities, HPE plans to create a unified platform that will streamline the process of refining, preparing, tracking, and managing repeatable ML workflows within development and training environments.
This integration is expected to speed up the deployment of large-scale AI applications and provide several key benefits:
- Data lineage: Users will gain full visibility of where data originates and how it moves throughout the ML lifecycle, making it easier to trace errors back to their source.
- Data versioning: The ability to track data changes over time helps users understand when data was created or modified, improving efficiency in making adjustments.
- Efficient incremental data processing: Pachyderm automates incremental data processing, meaning only new or changed data needs to be processed when updating AI applications.
The use cases for these capabilities extend across industries like transportation, manufacturing, life sciences, and defense, especially in areas like natural language processing, computer vision, and image and video processing. Lockheed Martin, for example, has integrated Pachyderm’s software with HPE’s ML Development Environment into its AI Factory, a foundational ecosystem for AI development. This integration has helped Lockheed standardize its AI technologies, improving trust and performance for national security missions.
Pachyderm was founded in 2014 by Joe Doliner and Joey Zwicker as a containerized alternative to Hadoop. HPE had already invested in the company through its venture capital arm, Hewlett Packard Pathfinder, during a $28.1 million funding round in February 2022, which also saw investment from Microsoft’s M12 and Y Combinator.
This acquisition, which is expected to close later this month, is not subject to regulatory approval. In 2021, HPE also acquired Determined AI, another startup focused on building and scaling optimized ML models, further strengthening its position in the AI space.