Automated Snowflake External Table on Delta

How to automatically create an external table on a Delta table in Snowflake to enable data sharing internally as well as externally. Other alternative solutions are included as well.

SNOWFLAKE

Dian Germishuizen

7/25/20243 min read

In this article, I will show you how I automated the process of creating an external table in Snowflake that references a Delta table in cloud storage (Azure Data Lake in this instance, but the same code will work on AWS S3).

Without further adue, the code is shown below (mobile users need to click the button to see it on GitHub, sorry)

How does it work?

  • First, we declare a small army of variables to keep the body of the stored procedure clean.

  • Next, do a check on the target directory in cloud storage to make sure it exists. I always try to build in checks and balances into my code to make sure I catch exception states as early as possible.

  • Next, we need to overcome a niggly limitation in Snowflake – they have a reserved column in all external tables with the name “VALUE”. This column contains the source records in full as a JSON object. The problem is, if your source data also has a field called “VALUE”, this will cause the external table to not be valid, since it will create duplicate columns. So, the easiest way to overcome this is to rename our source data’s VALUE column, since we can’t do anything about the Snowflake reserved columns.

  • Now in order to create the external table, we need to provide a schema of what the columns and data types need to be.

  • Snowflake can do this automatically for parquet and other file types, but not Delta, another fun limitation.

  • To overcome this one, we use the aforementioned INFER_SCHEMA functionality, and tell a little white lie that our source data is parquet.

  • Then we take the information provided and use that for our Delta table schema instead. This works since Delta is just a wrapper around parquet files, so the column schema would be the same.

  • Now that we have all our pre-requisite information, we run the CREATE EXTERNAL TABLE command.

Why would one need to do this?

There are various reasons this approach can solve technical / architectural problems you are facing in your environment. The main reason I had for going down this route is as follows.

We were not able to directly import the source data into Snowflake and instead had to stage it externally to allow other data consumers to have access to the data without needing to give them access to Snowflake directly. At the same time, we had to allow internal users that do have access to Snowflake be able to access the data as well, without duplicating it into Snowflake’s proprietary storage.

The external table feature was thus an ideal choice here, since it allows you to ingest your source data and persists it only once in the Delta tables directly in cloud storage, and consume it form multiple downstream avenues such as:

  • Internal Snowflake: which can read the data without copying it into memory or its own proprietary storage

  • External Consumers: who can be provided access to the data as it sits in the cloud storage container and then use their own compute mechanisms to do so, saving you cost on Snowflake.

Now, the downside is that you will not be able to get the same performance benefits with external tables as you can with “managed” tables in Snowflake (see this link for reference).

This is something you will need to test in your environment with your data volumes to see whether this is a deal breaker.

This is too complicated, what other options do I have?

Another option you can explore for exposing data externally as well as internally is to use the Snowflake data marketplace.

  • This will allow you to securely share data to other Snowflake customers securely, but again, a limitation is that this requires you and your data consumers to be in the Snowflake ecosystem.

One last one I can think of is to scrap the data lake intermediate layer and host everything in Snowflake and simply export the data to your data lake regularly or as data changes.

  • This will give you the performance benefits mentioned earlier and reduce the solution complexity. To share the data externally, you export the data to cloud storage containers using the COPY INTO commands.

  • This mechanism in Snowflake is extremely fast and can be quite cost effective, since the time taken to export the data is quite short, even for large datasets (hundreds of millions of rows in my previous experiences too less than 5 minutes).

Closing Thoughts

There are an infinite number of combinations for how to design Lakehouse architectures to meet various needs. What works exceptionally well for one, might be useless to another.

We as data geeks need to keep our minds open to all possibilities, you never know what limitation you will face next and what you will need to overcome it.

If you have other ideas on how to solve this particular problem, let me know!