Managing files in a Fabric Lakehouse using the Python Azure SDK

In this blogpost, you’ll see how to manage files in OneLake programmatically using the Python Azure SDK. Very little coding experience is required and yet you can create very interesting automations this way. I started testing this approach when I was creating a proof-of-concept. I was working on a virtual machine where database tables were exported to Parquet files on a daily base. These Parquet files needed to be moved to OneLake. However, at the time of writing the on-premise gateway for pipelines in Fabric was not yet fully supported. Instead of using a pipeline, I started thinking about uploading the files directly to OneLake from the machine and then proceed in Fabric for further processing. The process for doing this is described below.

Authentication and authorization configuration in Azure and Fabric

Step 1: Create an app registration (service principal) in Azure and create a client secret. Write down the tenant_id, client_id and client_secret. A detailed guide on how to create the service principal can be found here:
https://learn.microsoft.com/en-us/entra/identity-platform/quickstart-register-app

Step 2: Add the service principal as contributor to your Fabric workspace.

Manage OneLake from other devices using Python

Establish the connection to OneLake

Step 1: First create a JSON file with the Service Principal credentials. I put this file in the config folder of my project

1{
2"tenant_id": "<tenant_id>",
3"client_id": "<client_id>",
4"client_secret": "<client_secret>"
5}

Step 2: Next, import the dependencies. Make sure they’re installed in the Python environment you’ll be using. From the Azure SDK we import the package to manage the datalake (OneLake) and we also import the package for authenticating with the Service Principal using a client secret. Lastly, the JSON package is imported to read our config file created in step 1.

1from azure.storage.filedatalake import DataLakeServiceClient
2from azure.identity import ClientSecretCredential
3
4import json

Step 3: Make a Credential object for the Service Principal.

1config = json.load(open("config/service_principal.json"))
2credential = ClientSecretCredential(
3tenant_id=config.get('tenant_id'),
4client_id=config.get('client_id'),
5client_secret=config.get('client_secret')

Step 4: Put the name of the workspace and the lakehouse in variables so it can be easily reused.

1workspace = '<Name of the fabric workspace>'
2lakehouse = '<Name of the lakehouse in the fabric workspace>'
3files_directory = '<Name of the folder under files in the fabric lakehouse>'

Step 5: Create a DataLakeServiceClient object. This is an object at the OneLake level. Next use this object to create a FileSystemClient object. The FileSystemClient is on the workspace level. Once we have this, the preparation is done. Now we can start doing stuff. 😊

1service_client = DataLakeServiceClient(account_url="https://onelake.dfs.fabric.microsoft.com/", credential=credential)
2file_system_client = service_client.get_file_system_client(file_system = workspace)

Playtime!

Below are some examples of what we can do with our FileSystemClient object.

Example 1: List all the folders starting from a specific path in OneLake

1paths = file_system_client.get_paths(path=f'{lakehouse}.Lakehouse/Files/{files_directory}')
2for path in paths:
3    print(path.name)

Example 2: Create a new (sub)folder on OneLake

1new_subdirectory_name = 'test'
2directory_client = file_system_client.create_directory(f'{lakehouse}.Lakehouse/Files/{files_directory}/{new_subdirectory_name}')

Example 3: Upload a file to OneLake

1vm_file_path = r'C:\test\onelake\vm_test.csv'
2onelake_filename = 'onelake_test.csv'
3directory_client = file_system_client.get_directory_client(f'{lakehouse}.Lakehouse/Files/{files_directory}/test')
4file_client = directory_client.get_file_client(onelake_filename)
5with open(file=vm_file_path, mode="rb") as data:
6    file_client.upload_data(data, overwrite=True)

Example 4: Download a file from OneLake

1onelake_filename = 'onelake_test.csv'
2vm_file_path = r'C:\test\onelake\download_onelake_test.csv'
3directory_client = file_system_client.get_directory_client(f'{lakehouse}.Lakehouse/Files/{files_directory}/test')
4file_client = directory_client.get_file_client(onelake_filename)
5with open(file= vm_file_path, mode="wb") as local_file:
6    download = file_client.download_file()
7    local_file.write(download.readall())

Example 5: Append to a CSV file on OneLake

1onelake_filename = 'onelake_test.csv'
2text_to_be_appended_to_file = b'append this text!'
3directory_client = file_system_client.get_directory_client(f'{lakehouse}.Lakehouse/Files/{files_directory}/test')
4file_client = directory_client.get_file_client(onelake_filename)
5file_size = file_client.get_file_properties().size
6file_client.append_data(text_to_be_appended_to_file, offset=file_size, length=len(text_to_be_appended_to_file))
7file_client.flush_data(file_size + len(text_to_be_appended_to_file))

Example 6: Delete a file from OneLake

1onelake_filename = 'onelake_test.csv'
2directory_client = file_system_client.get_directory_client(f'{lakehouse}.Lakehouse/Files/{files_directory}/test')
3file_client = directory_client.get_file_client(onelake_filename)
4file_client.delete_file()

Example 7: Delete a directory from OneLake

1directory_client = file_system_client.get_directory_client(f'{lakehouse}.Lakehouse/Files/{files_directory}/test')
2directory_client.delete_directory()

The data is in OneLake. What’s next?

Let’s go back to the setup of my proof-of-concept. On my virtual machine I uploaded the exported tables to a folder on OneLake named ‘Exports’. Each table had its own subfolder. It looked like this:

1Files/
2    Exports/
3        Table1/*.parquet
4        Table2/*.parquet

To expose the data in Lakehouse Tables, I made a very simple pipeline. It starts by creating a list of all the exported tables by looking at the exports folder. Next, a foreach loop is initiated with a copy activity to copy the data into a Lakehouse table. In my setup, the tables are overwritten in every load.

The structure of the pipeline is shown below:

Figure 1: Fabric Pipeline Configuration

Conclusion

As you can see, it’s very easy to manage files on OneLake with Python, especially for people who have used the Python Azure SDK for data lakes before. OneLake can be approached in an unthinkable number of ways. So there is a lot of room for creativity when developing your architectures. I would definitely recommend playing around with it. Have fun!