I am creating this blog for some of the advanced SQL concepts which are generally not covered in normal blogs. So, if you are looking for some basic SQL tutorial, you can try other blogs like w3school and then come here for other concepts.
Windows Functions(aka Analytical SQL functions) –
A windows function runs a calculation on set of rows that are related to current row. A window can be defined using the over clause. The over clause can be used with –
- Partition By – Using this we can define the partition to form group of rows.
- Order By – Using this we can order the rows within a partition.
SELECT dept_name, emp_no, salary, avg(salary) OVER (PARTITION BY dept_name) FROM emp_salary;
–The above sql will calculate the average of salary by deptname. So, all rows with same deptname will have same avg salary.
The result will be similar to what you can expect from a group by clause with aggregate functions, but the major difference will be the number of rows. Group by will aggregate and reduce the number of rows, whereas the number of rows are not changed in the Partition by with Over clause.
The rows or range functions can also be used to determine the subset of rows within a partition. We need to specify the start and the end point with the rows and range functions.
-- specifies the entire result set from the partition
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
-- specifies 5 rows, starting 4 rows prior to the current row through the current row from the partition
BETWEEN 4 PRECEDING AND CURRENT ROW
-- specifies all of the rows from the current row to the end of the partition
BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
-- specifies all of the rows from the start of the partition through the current row
UNBOUNDED PRECEDING
There may be use cases when you want to use data from preceding or following rows. In that case you can use LAG and LEAD. Just specify LAG(column_name, nth row)
Select employee_id, department_id, salary, lead(salary, 1,0) over (partition by department_id order by salary nulls last) next_salary, lag(salary, 1,0) over (partition by department_id order by salary nulls last) previous_salary from employees order by department_id;
You can use other aggregates similar to average with the OVER clause.
SELECT Sales_Order_ID, Product_ID, Order_Qty
,SUM(Order_Qty) OVER(PARTITION BY Sales_Order_ID) AS Total
,AVG(Order_Qty) OVER(PARTITION BY Sales_Order_ID) AS “Avg”
,COUNT(Order_Qty) OVER(PARTITION BY Sales_Order_ID) AS “Count”
,MIN(Order_Qty) OVER(PARTITION BY Sales_Order_ID) AS “Min”
,MAX(Order_Qty) OVER(PARTITION BY Sales_Order_ID) AS “Max”
FROM Sales.Sales_Order;
Along with the aggregations, you can also use Row number, Rank and dense rank functions. ROW_NUMBER() shows the rows of each row starting from 1. If the partition by is used then the row number will start again from 1 in that specific window. Then we have RANK which gives a rank based of the variable used to order. If the value is same for two rows, then the RANK would be the same whereas the ROW_NUMBER would be different. And finally DENSE_RANK which is slightly different than the RANK such that when rank repeats, the next rows are still counted and then the next rank would be a non consecutive number(example – 1,1,3). Whereas the dense rank would be consecutive (example – 1,2,3).
Sequences –
A Sequence is a user defined object which automatically generates unique numbers. It is shareable among different users and is mostly used to create primary key value. We can use sequences anywhere to generate unique numbers. We can increase the efficiency of using the Sequence by caching them in memory.
CREATE SEQUENCE [schema.]sequence_name
[ AS datatype ]
[ START WITH value ]
[ INCREMENT BY value ]
[ MINVALUE value | NO MINVALUE ]
[ MAXVALUE value | NO MAXVALUE ]
[ CYCLE | NO CYCLE ]
[ CACHE value | NO CACHE ];
Once a sequence is created, you can use it to call the values using the currval and nextval. Whenever you call the currval, it will provide you the current value of that sequence. And once the nextval is called it will provide you the new value and the current value will again be set to that.
Indexes –
Indexes are schema objects for speeding the retrieval of rows by using pointers. So, this is used to improve the performance of the queries. Because indexes have pointers it directly points to a specific row and does not need to go through all the rows. Index is made on a table, so if a table is dropped, the index on that table is also dropped. Indexes are used when we refer to the indexed column in the where clause.
CREATE [UNIQUE|BITMAP] INDEX index_name ON table_name (column_name1, ….);
Function Based Indexes – If we have a function referring to those columns, then the indexes do not run and a full table scan occurs. In these conditions we use the functions based indexes. These functions can be SQL or user defined functions
materialized views –
We use the materialized views for the below reasons –
- To reduce the complexity of the query.
- To improve the performance of the query.
- To limit the access to the tables.
Properties of a materialized view –
- It is like a table where the data is stored and can be accessed without running a query to the base tables like normal views.
- It removes the complex joins and stores the data in the materialized views and hence improves the performance.
- Since this table has stored data, it needs to be refreshed on timely basis which can be with a frequency or on demand.
CREATE MATERIALIZED VIEW mv_test2
WITH (distribution = hash(i_category_id), FOR_APPEND)
AS
SELECT MAX(i.i_rec_start_date) as max_i_rec_start_date, MIN(i.i_rec_end_date) as min_i_rec_end_date, i.i_item_sk, i.i_item_id, i.i_category_id
FROM syntheticworkload.item i
GROUP BY i.i_item_sk, i.i_item_id, i.i_category_id
Exists –
The exists operator is used to check if the record exists in the sub query. If the record exists then the sub query returns true.
SELECT column_name(s)
FROM table_name
WHERE EXISTS
(SELECT column_name FROM table_name WHERE condition);
with clause(Common Table expression) –
When we have a reference to same query block multiple times, it is better to use the with clause to improve the performance and readability. When we use a costly query more than once, then we should replace that with the with clause. The with clause saves the data in a temporary table space where the data is used from instead of querying from base tables.
WITH query_name1 AS (
SELECT column_names from table_name
)
, query_name2 AS (
SELECT column_names
FROM query_name1
…
)
SELECT …