Saturday, April 24, 2021

Azure Data Lake Storage Gen2 Query Acceleration

Ahsan Siddique

Introduction

 
Nowadays, I was working to upgrade the architecture of my IoT-based application that was based on IoT Hub and Event Hub. I was dumping the device's telemetry data into the on-premises Postgres database but the data is growing more rapidly day by day therefore I decided to move my all old data to the Azure data lake Storage Gen2. But I also needed the old data in my application for some reasons. So, I was searching for the best solution for my application to access the specific data from the Azure data lake then I found an optimal solution for querying specific data from my files.
 

Azure Data Lake Storage Query Acceleration

 
Query acceleration is used to retrieve a subset of data from your storage account. Query acceleration supports CSV and JSON formatted data as input to each request. Query acceleration enables applications and analytics frameworks to dramatically optimize data processing by retrieving only the data that they require to perform a given operation.
 

Data flow

 
The following diagram illustrates how a typical application uses query acceleration to process data.
 
 
Prerequisites
  1. Create a general-purpose v2 storage account
  2. Enable query acceleration
  3. Azure Storage .Net SDK
Road Map
  • Azure Login
  • Enable Query Acceleration
  • Create an Azure Resource Group
  • Create Azure Data Lake Storage Gen2 Account
  • Create Azure Data Lake Container
  • Upload JSON File
  • Create a new Console App in Visual Studio 2019
  • Add References to NuGet Packages
  • Query Acceleration Main Logic

Step 1 - Login Using Azure Portal OR Azure CLI

 
Log in to the Azure portal or log in through Azure CLI. I will show you both ways. Open your command prompt and use the following command in order to login to Azure. Make sure you already have installed the Azure CLI on your local.
  1. az login
Build CI/CD Pipeline For Azure Container Instances
 
After logging in, you can see your all active subscriptions in the output and you need to set the subscription for this current context. To do so use the following command. I have multiple subscriptions and I will use my Azure pass sponsorship for this context.
  1. az account set --subscription "Visual Studio Enterprise Subscription"  //Write your subscription name
Azure Data Lake Storage Gen2 Reading Avro Files Using .Net Core
 

Step 2 - Enable Query Acceleration

  1. az feature register --namespace Microsoft.Storage --name BlobQuery

Step 3 - Create an Azure Resource Group Using Portal and CLI

 
As you know, we already logged in using CLI. Now we will create a resource group for the Azure data lake storage gen2. We will keep our all resources in this resource group that we are creating. Use the following command in order to create the resource group.
  1. az group create --name "datalake-rg" --location "centralus"
Azure Data Lake Storage Gen2 Reading Avro Files Using .Net Core
 
Azure Data Lake Storage Gen2 Reading Avro Files Using .Net Core
 

Step 4 - Create Azure Data Lake Storage Gen2 Account

 
Click on create a new resource and search storage account and choose the storage account from the result. You can also use the following commands in order to create the storage account using Azure CLI.
  1. az storage account create --name <STORAGEACCOUNTNAME> \
  2. --resource-group <RESOURCEGROUPNAME> \
  3. --location eastus --sku Standard_LRS \
  4. --kind StorageV2 --hierarchical-namespace true
Azure Data Lake Storage Gen2 Reading Avro Files Using .Net Core
 
Azure Data Lake Storage Gen2 Reading Avro Files Using .Net Core
 
Choose your subscription, resource group, and storage account name and click Next: Networking >
 
Azure Data Lake Storage Gen2 Reading Avro Files Using .Net Core
 
The networking, tab these are the default selected options and maybe you want to use another one but for this demo please use the default selected options and hits Next: Data Protection >
 
Azure Data Lake Storage Gen2 Reading Avro Files Using .Net Core
 
The data protection, tab these are the optional option you want to use but for this demo please also use the default selected options and hits Next: Data Protection >
 
Azure Data Lake Storage Gen2 Reading Avro Files Using .Net Core
 
The advance, tab is the default selected option and maybe you want to use another one but for this demo please use the default selected options and choose the enable against the Data Lake Storage Gen2 option, and hits Next: Tags > and next review + create.
 
Azure Data Lake Storage Gen2 Reading Avro Files Using .Net Core
 
After creating the azure data lake gen2 account, open the newly created azure data lake storage gen2 account.
 
Azure Data Lake Storage Gen2 Reading Avro Files Using .Net Core
 

Step 5 - Create Azure Data Lake Container

 
Select the container option from the left navigation. Click on the + Container button and create a new container.
 
 

Step 6 - Upload Json File

 
Open the newly created container and upload a JSON file inside the container. I also attached the sample Json file on the top. You can download and use it for your learning.
 

Step 7 - Create a new Console App in Visual Studio 2019

 
Open the visual studio and create a new project and add a console app (.Net Core) with C#.
 
Azure Data Lake Storage Gen2 Reading Avro Files Using .Net Core
 

Step 8 - Add References to NuGet Packages

 
First of all, in References, add the references to Azure.Storage.Blob and Newtonsoft.Json using NuGet Package Manager, as shown below.
 
 

Step 9 - Query Acceleration Main Logic

 
Open the Program class and replace the following code with an appropriate namespace. Please grab your connection string from your storage account and replace it in the code and also use your container name inside the code.
  1. using Azure.Storage.Blobs;  
  2. using Azure.Storage.Blobs.Models;  
  3. using Azure.Storage.Blobs.Specialized;  
  4. using Newtonsoft.Json;  
  5. using System;  
  6. using System.Collections.Generic;  
  7. using System.Diagnostics;  
  8. using System.IO;  
  9. using System.Threading.Tasks;  
  10.   
  11. namespace AzureDataLake.QueryAcceleration  
  12. {  
  13.     class Program  
  14.     {  
  15.         static void Main(string[] args)  
  16.         {  
  17.             MainAsync().Wait();  
  18.         }  
  19.         private static async Task MainAsync()  
  20.         {  
  21.             var connectionString = "DefaultEndpointsProtocol=https;AccountName=ahsan;AccountKey=51sarUyYuhNLCv/+w9LjcbT7914Q==;EndpointSuffix=core.windows.net";  
  22.             var blobServiceClient = new BlobServiceClient(connectionString);  
  23.             var containerClient = blobServiceClient.GetBlobContainerClient("ContainerName");  
  24.   
  25.             await foreach (var blobItem in containerClient.GetBlobsAsync(BlobTraits.Metadata,BlobStates.None, "File Prefix"))  
  26.             {  
  27.                     var blobClient = containerClient.GetBlockBlobClient(blobItem.Name);  
  28.   
  29.                     var options = new BlobQueryOptions  
  30.                     {  
  31.                         InputTextConfiguration = new BlobQueryJsonTextOptions(),  
  32.                         OutputTextConfiguration = new BlobQueryJsonTextOptions()  
  33.                     };  
  34.   
  35.                     var result = await blobClient.QueryAsync(@"SELECT * FROM BlobStorage WHERE measuringpointid = 547", options);  
  36.   
  37.                     var jsonString = await new StreamReader(result.Value.Content).ReadToEndAsync();  
  38.   
  39.                     Console.WriteLine(jsonString);  
  40.                     Console.ReadLine();  
  41.                 }  
  42.             }  
  43.         }  
  44.     }  
Run your console app and then you will see the following output in the console.