Managing files in a Fabric Lakehouse using the Python Azure SDK
In this blogpost, you’ll see how to manage files in OneLake programmatically using the Python Azure SDK. Very little coding experience is required and yet you can create very interesting automations this way. I started testing this approach when I was creating a proof-of-concept. I was working on a virtual machine where database tables were exported to Parquet files on a daily base. These Parquet files needed to be moved to OneLake. However, at the time of writing the on-premise gateway for pipelines in Fabric was not yet fully supported. Instead of using a pipeline, I started thinking about uploading the files directly to OneLake from the machine and then proceed in Fabric for further processing. The process for doing this is described below.
- Authentication and authorization configuration in Azure and Fabric
- Manage OneLake from other devices using Python
- The data is in OneLake. What’s next?
- Conclusion
Authentication and authorization configuration in Azure and Fabric
Step 1: Create an app registration (service principal) in Azure and create a client secret.
Write down the tenant_id, client_id and client_secret. A detailed guide on how to create the service principal can be found here:
https://learn.microsoft.com/en-us/entra/identity-platform/quickstart-register-app
Step 2: Add the service principal as contributor to your Fabric workspace.
Manage OneLake from other devices using Python
Establish the connection to OneLake
Step 1: First create a JSON file with the Service Principal credentials. I put this file in the config folder of my project
1{
2"tenant_id": "<tenant_id>",
3"client_id": "<client_id>",
4"client_secret": "<client_secret>"
5}
Step 2: Next, import the dependencies. Make sure they’re installed in the Python environment you’ll be using. From the Azure SDK we import the package to manage the datalake (OneLake) and we also import the package for authenticating with the Service Principal using a client secret. Lastly, the JSON package is imported to read our config file created in step 1.
1from azure.storage.filedatalake import DataLakeServiceClient
2from azure.identity import ClientSecretCredential
3
4import json
Step 3: Make a Credential object for the Service Principal.
1config = json.load(open("config/service_principal.json"))
2credential = ClientSecretCredential(
3tenant_id=config.get('tenant_id'),
4client_id=config.get('client_id'),
5client_secret=config.get('client_secret')
Step 4: Put the name of the workspace and the lakehouse in variables so it can be easily reused.
1workspace = '<Name of the fabric workspace>'
2lakehouse = '<Name of the lakehouse in the fabric workspace>'
3files_directory = '<Name of the folder under files in the fabric lakehouse>'
Step 5: Create a DataLakeServiceClient object. This is an object at the OneLake level. Next use this object to create a FileSystemClient object. The FileSystemClient is on the workspace level. Once we have this, the preparation is done. Now we can start doing stuff. 😊
1service_client = DataLakeServiceClient(account_url="https://onelake.dfs.fabric.microsoft.com/", credential=credential)
2file_system_client = service_client.get_file_system_client(file_system = workspace)
Playtime!
Below are some examples of what we can do with our FileSystemClient object.
Example 1: List all the folders starting from a specific path in OneLake
1paths = file_system_client.get_paths(path=f'{lakehouse}.Lakehouse/Files/{files_directory}')
2for path in paths:
3 print(path.name)
Example 2: Create a new (sub)folder on OneLake
1new_subdirectory_name = 'test'
2directory_client = file_system_client.create_directory(f'{lakehouse}.Lakehouse/Files/{files_directory}/{new_subdirectory_name}')
Example 3: Upload a file to OneLake
1vm_file_path = r'C:\test\onelake\vm_test.csv'
2onelake_filename = 'onelake_test.csv'
3directory_client = file_system_client.get_directory_client(f'{lakehouse}.Lakehouse/Files/{files_directory}/test')
4file_client = directory_client.get_file_client(onelake_filename)
5with open(file=vm_file_path, mode="rb") as data:
6 file_client.upload_data(data, overwrite=True)
Example 4: Download a file from OneLake
1onelake_filename = 'onelake_test.csv'
2vm_file_path = r'C:\test\onelake\download_onelake_test.csv'
3directory_client = file_system_client.get_directory_client(f'{lakehouse}.Lakehouse/Files/{files_directory}/test')
4file_client = directory_client.get_file_client(onelake_filename)
5with open(file= vm_file_path, mode="wb") as local_file:
6 download = file_client.download_file()
7 local_file.write(download.readall())
Example 5: Append to a CSV file on OneLake
1onelake_filename = 'onelake_test.csv'
2text_to_be_appended_to_file = b'append this text!'
3directory_client = file_system_client.get_directory_client(f'{lakehouse}.Lakehouse/Files/{files_directory}/test')
4file_client = directory_client.get_file_client(onelake_filename)
5file_size = file_client.get_file_properties().size
6file_client.append_data(text_to_be_appended_to_file, offset=file_size, length=len(text_to_be_appended_to_file))
7file_client.flush_data(file_size + len(text_to_be_appended_to_file))
Example 6: Delete a file from OneLake
1onelake_filename = 'onelake_test.csv'
2directory_client = file_system_client.get_directory_client(f'{lakehouse}.Lakehouse/Files/{files_directory}/test')
3file_client = directory_client.get_file_client(onelake_filename)
4file_client.delete_file()
Example 7: Delete a directory from OneLake
1directory_client = file_system_client.get_directory_client(f'{lakehouse}.Lakehouse/Files/{files_directory}/test')
2directory_client.delete_directory()
The data is in OneLake. What’s next?
Let’s go back to the setup of my proof-of-concept. On my virtual machine I uploaded the exported tables to a folder on OneLake named ‘Exports’. Each table had its own subfolder. It looked like this:
1Files/
2 Exports/
3 Table1/*.parquet
4 Table2/*.parquet
To expose the data in Lakehouse Tables, I made a very simple pipeline. It starts by creating a list of all the exported tables by looking at the exports folder. Next, a foreach loop is initiated with a copy activity to copy the data into a Lakehouse table. In my setup, the tables are overwritten in every load.
The structure of the pipeline is shown below:
Conclusion
As you can see, it’s very easy to manage files on OneLake with Python, especially for people who have used the Python Azure SDK for data lakes before. OneLake can be approached in an unthinkable number of ways. So there is a lot of room for creativity when developing your architectures. I would definitely recommend playing around with it. Have fun!