Skip to main content

Datastrato Origin Story

· 5 min read

Datastrato

Who are We? Why did we start Datastrato?

I am Junping Du, the Founder and CEO of Datastrato. My long term open source contributions to the Apache Hadoop project enhanced its performance and adaptability within virtualized and cloud-based environments.

My cloud computing odyssey began in 2008 at VMware where I was part of the vCloud Director engineering team. This was a pioneering time in cloud services where AWS was the only option and offered basic services like S3 and EC2, without the advanced features we see today like VPCs. Over the following decade, I saw a shift in application workloads, moving from on-premise data centers to public cloud platforms.

In 2010, I turned my focus to data and pioneered VMware’s Hadoop initiatives at a time when Cloudera was just a year old and Hortonworks hadn't emerged yet. During this period, 'Big Data' was synonymous with 'Hadoop', and AWS EMR was still in beta phase. I developed the equivalent of EMR on VMware's platforms like vSphere, vCloud, and CloudFoundry, which included Hadoop Virtualization Extensions (HADOOP-8468) to the Apache Hadoop community, resulting in optimal Hadoop performance in cloud settings used in many organizations today.

My passion eventually led me to becoming a lead contributor of Hadoop at Hortonworks in 2014. There I spearheaded the development of Hadoop YARN, a critical infrastructure component of the Hadoop stack. Over the next decade, I observed an increasing trend of data transitioning from private to public cloud environments.

In 2018, on the cusp of Hortonworks' merger with Cloudera, I transitioned from Hortonworks to a top-tier global cloud vendor, tasked with developing data warehouse and data lake services from the ground up. This period marked a significant intensification in global cloud market competition. To avoid vendor lock-in and improve their leverage in price negotiations, many companies began adopting a strategy of engaging multiple cloud vendors. While this approach introduced some management complexities, as indicated by third-party reports, it effectively mitigated the risks associated with reliance on a single cloud provider. The growing trend of mergers and acquisitions, along with other international business dynamics, further accelerated the adoption of multi-cloud strategies. The reality of having to deal with multi-cloud environments has become more common as a means of dealing with both complexity and scale.

As a result, future data platforms should inherently embrace multi-cloud architecture. This perspective aligns with the ongoing transition of workloads and data from single-cloud environments to more diverse and integrated multi-cloud and hybrid cloud systems. Even today, there are no prominent and effective solutions, whether open source or commercial, for dealing with these environments.

Around the same period, I engaged in a thought-provoking dialogue on data and AI innovation with two long-standing colleagues and friends, Jerry Shao and Xun Liu. Jerry, a former colleague at Hortonworks, embarked on his big data journey early at Intel and later emerged as a prominent figure in the Apache Spark community, contributing as a Spark committer and a Project Management Committee (PMC) member at Hortonworks. He is renowned for initiating the Apache Uniffle project, the first stable open-source remote shuffle service, addressing a longstanding challenge in the Spark, MapReduce, and Flink workload ecosystem.

Xun, a longstanding collaborator in the Hadoop community, has worked extensively with me on various projects, including Apache YuniKorn, Submarine, and others. Today, YuniKorn stands as the de facto standard for deploying data services like Hadoop and Spark on Kubernetes (K8s), thanks to our collaborative efforts and contributions.

Drawing from our shared experiences and insights, we swiftly reached a consensus to embark on an exciting new venture focused on multi-cloud data initiatives. This decision was grounded in our extensive backgrounds and collective wisdom as seasoned data platform engineers.

Thus, we, three Apache members (comprising two Hadoop committers and one Spark committer), decided to leave our full-time jobs to delve into the uncharted territories of the data world. Our mission is to dismantle data silos across diverse infrastructures, whether they are multi-cloud, hybrid cloud, or others.

We've named our venture "Data Stratosphere," drawing an analogy to the stratosphere, a layer above the clouds where jets soar for smoother travel. Our goal is to offer a similar ease to 'data pilots', eliminating concerns about data isolation and the extra costs associated with multi-cloud adoption. At Datastrato, we aim to provide comprehensive management, governance, analytics, and acceleration for data, addressing the fundamental needs of enterprise data teams.

In our vision, Datastrato is more than a company; it's a commitment to smooth, cost-effective data management across diverse cloud infrastructures, guided by our extensive experience and dedication to innovation in the data world.

Apache®, Apache Hadoop®, Apache Hive™, Apache Iceberg™, Apache Kafka®, Apache Spark™, Apache Submarine™, Apache Thrift™ and Apache Zeppelin™ are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.