JP Voogt A Data Enthusiasts Ramblings

Learning TSQL for Analytics

Introduction

I have been asked to design a course outline for teaching aspiring data scientists T-SQL, now for the last 5 years I have been leading a team of Data Specialists who’s bread and butter is T-SQL. But not most of them did not start out as specialists, but actually were hired with little or no previous data experience. So this should be easy right, I have helped at least 10 people in the last 3 years to become proficient in T-SQL. So where do one start, well I think there is many excellent paid and free courses out there in the wild, One for instance is an excellent video series by Kendra Little which you can find here.

But lets get back to the problem statement, how would one structure a course outline for the aspiring data scientist, who would probably spend most of their time in either R/Python or even Scala. With the expansion of Massive Parallel Processing environments like Azure Synapse, Redshift and many more for SQL or even Spark and Databricks environments for Data Analysis in Languages like R/Python or Scala. The need for super efficient code is no long that important, don’t get me wrong efficiency is still important, but can be added later.

Outline

Tools of the Trade

We need to decide if we want to store our data in our own environment where we are in full control of security, or move the data to the cloud where we do not have to focus on infrastructure and just the data and our data pipeline. We also need to have a look at what tools we want to use for exploring our data.

  1. Cloud vs On-Premises
    1. Cloud Options 1. Creating an Azure Account and a Azure SQL Database
    2. On-Premises Options
      1. Installing SQL Server Express/Developer Edition
  2. Tools of the Trade (IDE)
    1. SQL Server Management Studio
    2. Azure Data Studio
    3. Visual Studio Code
    4. Spark SQL in Databricks
  3. How to start writing code
    1. Queries
    2. Notebooks

Introduction to the basics of the T-SQL language

Starting of with the basics of the T-SQL language, how to filter the data we want to see. We will also look at the what is a database object.

  1. Database Objects
    1. Database -> Schemas -> Tables -> Columns -> Rows -> Record
  2. Working with the data
    1. Selecting and limiting
      1. SELECT
      2. TOP/LIMIT
    2. Filtering Data
      1. WHERE
    3. Inline Cleaning and Modifying of our dataset
      1. UDF’s and BUILT-IN Functions
    4. Grouping
      1. GROUP BY
      2. Aggregations (SUM/MIN/MAX/AVG)
    5. Ordering
      1. ORDER BY

Working with multiple datasets and changing

In this module we will be working with multiple dataset and even make changes to the data which would persists

  1. Working with multiple datasets
    1. Joining datasets together
      1. LEFT/RIGHT/FULL JOIN
    2. SET operators
      1. UNION/INTERSECT/EXCEPT/MINUS
  2. Building basic Processes
    1. Stored Procedures
    2. Views
  3. Data Modifications
    1. Adding new rows
      1. INSERT
      2. IMPORT
    2. Removing rows
      1. DELETE
    3. changing records
      1. UPDATE
  4. Analytical Functions
    1. Window Functions
    2. Pivot and Unpivot data
    3. Analytical Aggregations
    4. Common Table Expressions (CTE)

Disclaimer: Content is accurate at the time of publication, however updates and new additions happen frequently which could change the accuracy or relevance. Please keep this in mind when using my content as guidelines. Please always test in a testing or development environment, I do not accept any liability for damages caused by this content.

If you liked this post, you can share it with your followers or follow me on Twitter!

Create SQL Code Snippets in Azure Data Studio

Introduction

I recently decided to make to move from SSMS to Azure Data Studio due to the fun additional features is brings to the table. I enjoy Visual Studio Code and, my understanding is that Azure Data Studio is built on the same codebase. It is the small things that hooked me like Zen Mode, Source Control Integration and then Notebooks of coarse. I will be writing a post on Notebooks soon.

So today I want to show you how to add your own code snippets. I use INFORMATION_SCHEMA quite often to find a tables or a columns in our databases. So this is what code snippet we will be creating today.

Prerequisites

  1. Have Azure Data Studio Installed

Creating our first Snippet

  1. Open Azure Data Studio :)
  2. Press “CTRL + SHIFT + P” to open the Command Pallet
  3. Type “User Snippets” into the Command Pallet and Press “ENTER”
  4. Now type “SQL” and Press “ENTER”
  5. This will open “sql.json” file and this is where you will add the code below

        "Find Object": {
            "prefix": "sqlInfo",
            "body": [
                "SELECT * FROM INFORMATION_SCHEMA.${1|COLUMN,TABLE|}S WHERE ${1|COLUMN,TABLE|}_NAME LIKE '%${2:Search}%'"
            ],
            "description": "Find Object Information"
        }
    
  6. Save the “sql.json” file and close it
  7. Now we finally get to use our newly created code snippet, Open a empty sql script
  8. Type in the prefix we define above “sqlInfo” and press tab
  9. Now select Table/Column as per the list we defined in our code above, Select what you are looking for and press “TAB” and press “TAB” again to move to the search criteria.
  10. Type the Table or Column Name you are looking for, then press “TAB” again to move out of the snippet context.

Offical MS Docs for Snippets

Disclaimer: Content is accurate at the time of publication, however updates and new additions happen frequently which could change the accuracy or relevance. Please keep this in mind when using my content as guidelines. Please always test in a testing or development environment, I do not accept any liability for damages caused by this content.

If you liked this post, you can share it with your followers or follow me on Twitter!

Reading SAS files into Azure Databricks

Introduction

Working in an organization where we use so many different technologies, one of my biggest frustrations is working with SAS files (*.sas7dbat). These are relatively easy to read into SQL Server using the SAS ODBC Driver, but the majority of our workloads happen in either Azure Databricks or Azure Synapse. So I was on the lookout for a new and better way to get my data into Azure Databricks.

We can also achieve this with pandas.read_sas however, I wanted to leverage the power of MPP (Massively Parallel Processing). So eventually I found an awesome package created by Forest Fang, but took some fiddling to get this working for me.

Note, you have to mount your storage for this to work, below I have instructions in the links on how to mount Azure Storage. For the pandas and the spark package, you will have to have the storage mounted for it to work.

Prerequisites

  1. Azure Subscription
  2. Azure Databricks
  3. Azure Storage Account Mounted, You can follow this article

Installing the package

When in Azure Databricks, go to your clusters

@JPVoogt

Next go to “Install New”

@JPVoogt

Now go to “Search Packages”

@JPVoogt

Type “spark-sas7dbat” in the search bar and select the package by saurfang

@JPVoogt

Click the “Install Button”

@JPVoogt

Once the installation complete, please restart your cluster

@JPVoogt

Code

Now the fun part starts, actually reading our .sas7dbat files into dataframes. Yes you are reading this correctly it is really that simple.

df = spark.read. \
      format("com.github.saurfang.sas.spark"). \
      load(<Enter the Full Path to your SAS file here>)

For me to save time as I get 100’s of these files at a given time, I wrote the below script to loop though my mounted storage point and import all of the files into a staging area on my Azure Synapse server.

for x in dbutils.fs.ls("/mnt/<ENTER MOUNT POINT HERE>"):
  print(x.path)
  print(x.name.replace(".sas7bdat",""))
  df = spark.read. \
        format("com.github.saurfang.sas.spark"). \
        load(x.path)
  
  df.write \
    .format("com.databricks.spark.sqldw") \
    .option("url", jdbc) \
    .option("forwardSparkAzureStorageCredentials", "true") \
    .option("dbTable", x.name.replace(".sas7bdat","")) \
    .option("tempDir", blob) \
    .mode("overwrite") \
    .save()

for the df.write to work in the above code you will have to configure a few things for the “forwardSparkAzureStorageCredentials” to work, however on a quick work around you can use the below to set the keys in the current session only. You will also require a temporary storage place for Azure Synapse to read the data from, as behind the scenes it use the COPY statement to import the data into Azure Synapse.

For the recommended method please go through this article

spark.conf.set("fs.azure.account.key.<Storage Account Name>.blob.core.windows.net", "<Storage Key>")
  
  
jdbc = "jdbc:sqlserver://<SynapseDBServer>.database.windows.net:1433;database=<DATABASE NAME>;user=<USERNAME>@<DATABASE NAME>;password=<PASSWORD>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
blob = "wasbs://<CONTAINER NAME>@<STORAGE ACCOUNT NAME>.blob.core.windows.net/<DIR>"

Final Words

Please do not save passwords and keys in your Notebooks.

Disclaimer: Content is accurate at the time of publication, however updates and new additions happen frequently which could change the accuracy or relevance. Please keep this in mind when using my content as guidelines. Please always test in a testing or development environment, I do not accept any liability for damages caused by this content.

If you liked this post, you can share it with your followers or follow me on Twitter!

How to Mount Azure Storage to Azure Databricks

Introduction

I found multiple articles on how to mount an Azure Storage account in Azure Databricks, but most of them refered to using the Azure Key Vault which I do not want to setup at this time. So I decided to continue my quest as I required the storage account and container to be mounted for me to read the *.sas7dbat files into a Dataframe. I also could not find clear instructions on how to use the Databricks CLI in Azure Databricks.

Prerequisites

  1. Azure Subscription
  2. Azure Storage Account with a Container ready
  3. Azure Databricks

Code

First we will need to generate a SAS Token, which we can achieve by following this guide by June Castillote here.

I prefer to create a new Scala Notebook for the next part, so that I can save it and remember how I achieved this in the future. With the below script we start off by creating all the required variables for the actual mounting function. For a more complete guide see below link in the External Links for the full article by Gauri Mahajan

val containerName = "<Enter Container Name Here>"
val storageAccountName = "<Enter Storage Account Here>"
val sas = "<Enter SAS Key Here>"
val config = "fs.azure.sas." + containerName+ "." + storageAccountName + ".blob.core.windows.net"

Once all the parameters is set, we can go ahead and mount the storage container into Azure Databricks. Remeber to change the mount name in the below script.

dbutils.fs.mount(
  source = "wasbs://" + containerName + "@" + storageAccountName + ".blob.core.windows.net/",
  mountPoint = "/mnt/<Enter Unique Mount Name Here>/",
  extraConfigs = Map(config -> sas))

Data Exploration

Once we have our Storage Account mounted, we can start exploring the data in these mounts. First off lets look at all of our mapped mounts.

display(dbutils.fs.mounts())

If you want to have a look at what files are in our Storage Account, you can use the below script.

dbutils.fs.ls("/mnt/<Enter Unique Mount Name Here>")

@JPVoogt

And if for any reason you want to unmount the Storage Account you can use the below script.

dbutils.fs.unmount("/mnt/<Enter Unique Mount Name Here>/")

Comprehensive Guide

Disclaimer: Content is accurate at the time of publication, however updates and new additions happen frequently which could change the accuracy or relevance. Please keep this in mind when using my content as guidelines. Please always test in a testing or development environment, I do not accept any liability for damages caused by this content.

If you liked this post, you can share it with your followers or follow me on Twitter!

Azure Synapse Query Monitoring

Introduction

I have gotten so use to Adam Machanic’s sp_whoisactive that when we migrated our workloads to Azure Synapse I felt completely lost. So over time I started to compile my own little queries to inspect what is/was going on, it is far from perfect, but for me it was a great start. Please feel free to make changes to the project on GitHub

Installation

You should be able to just open and run the sp_synapse_queries_combined.sql script, this will “deploy” 2 Stored Procedures

*sp_synapse_queries *sp_synapse_queries_deepdive

I wanted to add both into 1, but unfortunatly Azure Synapse does not yet allow us to create parameters in Stored Procedures with default values.

Clone or Download Here

Permissions Required

GRANT VIEW DATABASE STATE ON GRANT EXECUTE ON sp_synapse_queries GRANT EXECUTE ON sp_synapse_queries_deepdive

How to Use

So lets get started, all you need to do is run the below SQL command.

EXEC dbo.sp_whoisactive;

Ok great, what now. Well now I explain (will be shortened due to me not enjoying documentation), you will get the below output. The first results set is all the queries in the queue (YES THAT MEANS IT IS NOT DOING ANYTHING), the second result set is all the queries that is actually running at the moment.

You can also read Microsoft Docs on this, as some of the code came from there.

Column Name JP’s Description
session_id Unique session id for a query
request_id Unique Query id in a session
login_name Who is running this session
status Is the query running or in a queue
running_time How long in seconds has the query been running for
time_in_queue How long was the query in a queue before it started executing
submit_time When did you press F5
start_time When did it actually start
end_compile_time When did it finish creating the execution plan
end_time When did your query finish
label Query Label
command SQL Command that is being executed
blocking_session_id If populated something is blocking your query
app_name Application Session is coming from
resource_class Resource Class, you can read up on this
deepdive Magic statement we will be explaining

Yes I knew you would not be able to stop thinking about the magic command, so what is it? Well essentially it is just another stored procedure called sp_whoisactive_deepdive prepopulated with the basic deep dive parameters

EXEC dbo.sp_whoisactive_deepdive @request_id = 'QID10741526', @distributions = 0, @tempdb = 0

If you run it as is, you will get 2 more result sets as below.

  1. The Query Steps, a nice way of looking at what step your query is busy with
  2. Waits, what your query has locks on and what it could be waiting on

If you want to dive even deeper, you can set the @distributions parameter to 1, this will then produce the below.

  1. The Query Steps on all distribution Nodes
  2. The Data Movement steps your query is busy with

Now when you are feeling adventurous turn on the @tempdb parameter and you will see what you query is doing to TempDB on Synapse.

GitHub Project

Disclaimer: Content is accurate at the time of publication, however updates and new additions happen frequently which could change the accuracy or relevance. Please keep this in mind when using my content as guidelines. Please always test in a testing or development environment, I do not accept any liability for damages caused by this content.

If you liked this post, you can share it with your followers or follow me on Twitter!