Get the latest file from Azure Data Lake in Databricks

3/19/20221 min read

There are many ways to orchestrate a data flow in the cloud.

One such option is to have an independent process pull data from source systems and land the latest batch of data in an Azure Data Lake as a single file. The next layer where you process the data can be handled in many ways. The most independent way to do this is to have the processing layer fetch the latest file from the Data Lake on its own. This ensures the processing layer is not dependent on a previous tool or service giving the file path to it, increasing fault tolerance. 

In Databricks, there is no built in function to get the latest file from a Data Lake. There are other libraries available that can provide such functions, but it is advisable to always use standardised libraries and code as far as possible.

 Below are 2 functions that can work together to go to a directory in an Azure Data Lake and return the full file path of the last modified file. This file can then be processed as normal in Databricks. 

View GitHub Gist

Social

All content on this website is either created by us or is used under aa free to use license.. We create the posts here to help the community as best as we can. It doesn't mean we are always correct, or the methods we show here are the best. We all change and learn as we grow, so if you see something you think we could have done better, please reach out! Let's share the knowledge and be kind to each other!

DIAN

GDG

FRANS