Bio
I am a software architect, open source leader and entrepreneur who loves collaborating with others in Open Source projects. I started the Parquet project in collaboration with the Impala team at Cloudera back when I was at Twitter. I chaired the project for many years at the Apache foundation and Parquet is now the de-facto standard for data lakes. I later contributed to the creation of the Arrow project as a founding engineer at Dremio. Before that I received my initiation contributing to OpenSource in the Apache Pig project where I evolved from contributor to committer to PMC member and eventually chaired the project in 2013. More recently I started the OpenLineage project while being the CTO and co-founder of Datakin which was later acquired by Astronomer. OpenLineage came out of Marquez, the project we co-created at Wework on the data platform team.
I blog at Sympathetic.Ink
Projects
- Apache Parquet: co-creator, PMC menber, PMC chair 2015-2021
- Apache Arrow: co-creator and PMC member
- Apache Pig: PMC member, PMC chair 2013
- Apache Iceberg: PMC member
- OpenLineage: creator and project lead at the LFAI&Data foundation
- Marquez: co-creator, former project lead at the LFAI&Data
- Brenus: a java bytecode generation library
Podcasts
It Depends
- Celebrating 10 years of Apache Parquet with Julien Le Dem and Nong Li; Ep. 34 April 2023
DC_THURS
- Data Lineage w/ Julien Le Dem (Datakin)
Data Driven NYC
- Data Observability and Pipelines: OpenLineage and Marquez
Data engineering podcast
- Data Serialization Formats with Doug Cutting and Julien Le Dem
- Solving Data Lineage Tracking And Data Discovery At WeWork
- Unlocking The Power of Data Lineage In Your Platform with OpenLineage
The Analytics Engineering Roundup
Software Engineering daily
- Understanding data lineage at scale with Julien Le Dem
- Columnar Data: Apache Arrow and Parquet with Julien Le Dem and Jacques Nadeau
Presentations
Over the years I gave a number of talks. You’ll find them in chronological order on the presentations page. You’ll also find a playlist of talks recordings on Youtube.
Nurturing Open Source communities
- Data Council 2024: Ten+ years of building open source standards.
- SBTB 2023: Ten years of building open source standards.
- Data Council 2023: Ten years of building open source standards: From Parquet to Arrow to OpenLineage
- Airflow Summit 2023: Nurturing an Open Source Community is Like Tending a Garden
- Subsurface 2023: Ten years of building open source standards: From Parquet to Arrow to OpenLineage
Open Data Lineage: OpenLineage, Marquez
- Data and AI summit 2023: Cross-Platform Data Lineage with OpenLineage
- Berlin Buzzwords 2022: Cross-Platform Data Lineage with OpenLineage
- Data and AI Summit May 2021: Data lineage and observability with OpenLineage
- Data Driven January 2021: Data pipelines observability with OpenLineage and Marquez
- Subsurface 2020: Data Lineage and observability with Marquez
- OpenCore Summit: Observability for data pipelines with Open Lineage
Data Architecture
- IEEE Infrastructure 2020: Data Platform Architecture Principles
- Strata: From flat file to deconstructed database, The evolution and future of the Big Data Ecosystem Slides Strata SF 2019 Strata NY 2018
- Data Eng Conf April 2018: From flat file to deconstructed database, The evolution and future of the Big Data ecosystem.
Columnar formats: Parquet, Arrow
- Data Works Summit 2018: The columnar roadmap, Apache Parquet and Apache Arrow
- NABD Conference 2017: The future of column-oriented data processing with Arrow and Parquet
- Strata NY 2017: The columnar roadmap, Apache Parquet and Apache Arrow Video
- Mulesoft March 2017: The future of column-oriented data processing with Arrow and Parquet
- Spark Summit 2017: Improving Python and Spark Performance and Interoperability with Apache Arrow
- Hadoop Summit 2017: The columnar roadmap, Apache Parquet and Apache Arrow
- Strata NY 2016: The future of column-oriented data processing with Arrow and Parquet
- Berlin Buzzwords 2016: Efficient Data formats for Analytics with Parquet and Arrow
- Strata London 2016: The future of column oriented data processing with Arrow and Parquet
- Data Eng Conf NY November 2016: The future of column-oriented data processing with Arrow and Parquet
- Big Data Apps meetup Jan 2016: SQL-on-Everything with Apache Drill
- Hadoop Summit 2015: How to use Parquet as a basis for ETL and analytics
- Strata 2015: How to use Parquet as a basis for ETL and analytics
- HPTS 2015: If you have your own Columnar format, stop now and use Parquet
- Twitter Open House: Parquet, An open columnar file format for Hadoop
- Efficient Data Storage for Analytics with Apache Parquet 2.0
- Hadoop Summit 2013: Parquet, Columnar storage for the people
- Strata Hadoop World 2013: Parquet, Columnar storage for the people
- Drill meetup: Parquet Overview
Embedding Pig in scripting languages
- Pig meetup: Embedding Pig in scripting languages
- Hadoop Summit 2011: PIG Scripting, Making Pig Turing-complete through embedding in a scripting language