Indexing

 Indexing

Introduction to Indexing:
TC for below queries:
select * from students; O(n)
select * from students where id = 100; O(n) without any indexing.


Data get stored in disks.

select * from students where id = 500; 

This data will be stored in memory over some location in some blocks.


In this case we need to search all the blocks and when data is found then we will send it to CPU through RAM.

If we anyway know the memory location for data then we don't need to search it and don't need to read all blocks of disk as reading disk is very time consuming task.


Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
Notes
-----
1 ns = 10^-9 seconds
1 us = 10^-6 seconds = 1,000 ns
1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns
Credit
------
By Jeff Dean: http://research.google.com/people/jeff/
Originally by Peter Norvig: http://norvig.com/21-days.html#answers
Contributions
-------------
'Humanized' comparison: https://gist.github.com/hellerbarde/2843375
Visual comparison chart: http://i.imgur.com/k0t1e.png
Interactive Prezi version: https://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/latency.txt


Now if we need to store this data by using any data structure for storing the memory location and data. So we can use Hash Map.

If we use this data structure then TC will reduce from O(n) to O(1).
We can store indexes using hashmap which takes only O(1) TC to search information in table.

select * from student where psp = 500;

After indexing psp column TC will be become O(1)
So as you can see that psp is not unique hence hashmap is not a good idea hence we need to use the hashmap concept and need to store data in array where key will be same for different users but there data will be stored in array for uniqueness.


This method will help is collision of data is not too high it means if psp 80 is obtained by too many people then in block or hash key we will have less memory address to search. But if this collision increase a lot then this method will become ineffective.

Some properties of Indexing and limitations with respect to data structures.
  • Indexes makes query lightning fast.
  • Using indexing we can increase performance of our DB
  • You can understand the power of indexing with this comparison that if a table is cofortable to read 1 million records in seconds then by using indexing on column of that table will increase speed to read 1 billion records in same time atleast and it can go beyond as per efficient usage.

Range Queries:

select * from students where psp between 40.2 and 60.5;

1. Can hashmap work on range based queries?
Storing of data in hashmap is not sorted by default as per hashmap nature.
So due to this nature if we use hashmap to store data and then ask to search as per range query then we need to search entire hashmap and memories associated them as a result TC will become O(n).

To resolve this issue we need to have a data structure which will store data in sorted order and in the form of key value pairs so that we can have advantage on searching O(n).
Data structure is know as Tree Map. So here we will use nodes to store key value pair and each node will link to another nodes and will sort the data.

So Tree Map (self balanced in nature): It is based on Balanced Binary Search Tree.


So by using this data structure we have TC as O(log n) where height of tree is n. So it means n will be total nodes and log n will be height of it.




So now can perform range based search in this type of data structure. So now for traversal we need to use inorder traversal.
As binary tree have nodes count as two for each parent node as max childs hence it's height is always calculated as log n with base 2.

So in this situation as you can see that we need to traversal the range for searching data with respect to height of data structure hence to reduce the height of data structure we need to use B+ tree concept where each parent node can have more than 2 child nodes as per requirement.


B trees means Balanced Tree and B+ Trees means balanced binary+ tree.


Some characteristics of B+ Trees:
  • It can have more than two child nodes for each parent nodea and even parent node can be part of multiple nodes more than 2.
  • Data always get stored in leaf nodes(end nodes).
  • It balances data in such a way that leaf nodes stay on same level(height).
  • Leaf nodes are also interconnected as linked list. Advantage is that you don't need to check parent node always to reach to other leaf nodes.
  • This is the most efficient way of reading the data.
  • This is the actual data structure which is used by SQL
  • Since we can have multiple children for every node, the overall height of B+ Trees < Tree Maps.
  • TC = O(H) < Tree map.

Notes:
By default indexing are created on primary key due to uniqueness but if you want you can have it on custom columns as well.

  • Clustered Indexing - Indexing on primary key.
  • Custom index on any other column: Non-clustered Indexing.

Pros and Cons of indexing:


Cons:
  • Extra space is needed
  • Write/update/delete operations will be slower. As we need to keep syncing data over memory and even in index metadata or list.
  • DBA are the specialized persons which will be used for creating indexing as they have database knowledge and know how a DBMS system is working.

Cardinality means - No. of unique values.
Selectivity- No. of unique values / no. of values.


As you can see in above scenario what we cannot index drinks as unique value is only 2 and records are in billion as a result if we index this column and a cluster of memory will be associated to each key as tea and coffee will be the only key. So indexing in this case will degrade the performance as you will force cpu to search for all the memories even though it is not needed due to wrong column indexing.


Indexing on multiple columns.

We can have indexing on n number of columns as per requirement.
This multi column can be a non unique column.

use sakila;

desc actor; -- It will provide complete detail about the table and indexes used in it.

show indexes from actor; -- It will show the indexed columns and its cardinality.

desc film;

select * from film
where title = 'AIRPORT POLLOCK'; -- We have index on title column hence we are having index advantage on it.

-- to check it we can use the keyword named explain

explain select * from film
where title = 'AIRPORT POLLOCK'; -- this keyword will tell that if a query is using index functionality for the column or not.
-- # id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
-- '1', 'SIMPLE', 'film', NULL, 'ref', 'idx_title', 'idx_title', '514', 'const', '1', '100.00', NULL



explain select * from film
where description = 'A Fanciful Documentary of a Frisbee And a Lumberjack who must Chase a Monkey in A Shark Tank';
-- # id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
-- '1', 'SIMPLE', 'film', NULL, 'ALL', NULL, NULL, NULL, NULL, '1000', '10.00', 'Using where'

-- summary as this column description is not using index hence need to check entire rows and if result matched then returned.
-- but in case of title no search is made just index information is used to fetch the data.
select * from film;

-- keyword MUL in desc means that a column is used for indexing and it has duplicate data in it.
-- keyword PRI means that column is primary key and have unique data and used for indexing.

explain analyze select * from customer
where last_name = 'DAVIS';
-- -> Index lookup on customer using idx_last_name (last_name='DAVIS')  (cost=1.1 rows=1) (actual time=0.647..0.651 rows=1 loops=1)
-- even though column will have duplicate information still index functionality will be used to give optimized result.
 
explain analyze select * from customer
where first_name = 'MARGARET';
-- -> Filter: (customer.first_name = 'MARGARET')  (cost=61.1 rows=59.9) (actual time=0.0771..0.654 rows=1 loops=1)
-- -> Table scan on customer  (cost=61.1 rows=599) (actual time=0.0705..0.582 rows=599 loops=1)
 
-- select multiple columns for data filteration. Search based on multiple columns

explain analyze select * from customer
where first_name = 'JANE' and last_name = 'DELVALLE';
-- -> Filter: (customer.first_name = 'JANE')  (cost=0.52 rows=0.2) (actual time=0.0711..0.0736 rows=1 loops=1)
 --    -> Index lookup on customer using idx_last_name (last_name='DELVALLE')  (cost=0.52 rows=2) (actual time=0.0655..0.0697 rows=2 loops=1)

explain select * from customer
where first_name = 'JANE' and last_name = 'DELVALLE';
-- # id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
-- '1', 'SIMPLE', 'customer', NULL, 'ref', 'idx_last_name', 'idx_last_name', '182', 'const', '2', '10.00', 'Using where'

explain analyze select * from customer
where last_name = 'DELVALLE' and first_name = 'JANE';
-- -> Filter: (customer.first_name = 'JANE')  (cost=0.52 rows=0.2) (actual time=0.0483..0.0497 rows=1 loops=1)
--  -> Index lookup on customer using idx_last_name (last_name='DELVALLE')  (cost=0.52 rows=2) (actual time=0.0446..0.047 rows=2 loops=1)

select count(*), last_name from customer 
group by last_name
order by count(*) desc;

insert into customer values (600, 1 ,'Jane', 'DELVALLE', 'JANE.DELVALLE@sakilacustomer.org', 604, 1, '2006-02-14 22:04:37', '2006-02-15 04:57:20');

-- We have index on customer_id and on last_name and let us try to find a customer which has last name as DELVALLE and customer is as 600;
explain select * from customer
where customer_id = 600 and last_name = 'DELVALLE';
-- # id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
-- '1', 'SIMPLE', 'customer', NULL, 'const', 'PRIMARY,idx_last_name', 'PRIMARY', '2', 'const', '1', '100.00', NULL

explain analyze select * from customer
where customer_id = 600 and last_name = 'DELVALLE';
-- -> Rows fetched before execution  (cost=0..0 rows=1) (actual time=100e-6..200e-6 rows=1 loops=1)
-- here customer_id will be used for indexing as it is a primary key and will have unique data hence it will be preferred.

Indexing on String:

Now suppose you have stored email id of a person and you use this email id for lot of queries so you though to create index on this column as it will be unique as well.

So there are different ways of doing it.

select * from details where email_id = 'puneetkumar.singh@gmail.com';

Different approaches for storing this email id in index.

  • Store entire email_id in index.
    • In this case if we have records in billions and more then space consumed by this index will be used due to string size and records count. This huge size may go to GBs and TBs.
  • Creating index on partial string data which will be unique.
    • In this case you can have duplicate of values as it will become unique once entire string will be created.
    • Example: abc@gmail.com and abc@scaler.com so here both are unique but abc is common but this abs will not be too many in number where scaler.com and gmail.com can be in millions as they are extention of emails and that will be common and as domain specific.
    • So abc count will be less then these domain names of email and will not be in millions.
    • In this case abc will be stored in multiple memory locations but in searchable range and can permit query to be run optimized manner.
    • Here will have considerable advantage on saving the tons of space.
    • So basically we usually create indexing on first 4 to 6 characters of any string as it itself make the indexing efficient and help to save lots of space.

-- syntax to create an index and to drop an index.
 
 -- create index idx_col1_col2
 -- on table_name (col1, col2);
 
 -- drop index idx_name on table_name;
 
 desc film;
 
 explain analyze select *
 from film
 where title = 'AGENT TRUMAN';
 -- > Index lookup on film using idx_title (title='AGENT TRUMAN')  (cost=0.35 rows=1) (actual time=0.0546..0.0611 rows=1 loops=1)
 
 -- dropping full text indexed title from table film.
 
 drop index idx_title on film;
 
explain analyze select *
from film
where title = 'AGENT TRUMAN';
-- > Filter: (film.title = 'AGENT TRUMAN')  (cost=103 rows=100) (actual time=0.0475..1.73 rows=1 loops=1)
     -- > Table scan on film  (cost=103 rows=1000) (actual time=0.0313..1.61 rows=1000 loops=1)
-- As we can see TC increased and cost also increased significantly with idx on 1 record was checked and fetched
-- but now entire table is searched.

-- re-creating index but this time by using only first 5 characters of title columns
create index idx_title on film(title(5));
 explain analyze select *
 from film
 where title = 'AGENT TRUMAN';
 -- > Filter: (film.title = 'AGENT TRUMAN')  (cost=0.35 rows=1) (actual time=0.0493..0.0555 rows=1 loops=1)
     -- > Index lookup on film using idx_title (title='AGENT TRUMAN')  (cost=0.35 rows=1) (actual time=0.0467..0.0528 rows=1 loops=1)
-- as you can see now we once again achieved the cost efficience similar or matching to full string indexing.
-- but now we are saving the space and created space efficient index on string so it a great advantage.
-- Hence proved that firt 4 to 6 character of any string is enough for indexing.     

-- trying to create indexes on 1 character, 2 character, 3 character , 4 , 5 and 6 and checking where perfomance
-- stabilizes and cost get saturated.
 drop index idx_title on film;
 create index idx_title on film(title(1));
 explain analyze select *
 from film
 where title = 'AGENT TRUMAN';
-- > Filter: (film.title = 'AGENT TRUMAN')  (cost=13.6 rows=46) (actual time=0.048..0.126 rows=1 loops=1)
     -- > Index lookup on film using idx_title (title='AGENT TRUMAN')  (cost=13.6 rows=46) (actual time=0.031..0.119 rows=46 loops=1)
 
 drop index idx_title on film;
 create index idx_title on film(title(2));
 explain analyze select *
 from film
 where title = 'AGENT TRUMAN';
   -- > Filter: (film.title = 'AGENT TRUMAN')  (cost=0.35 rows=1) (actual time=0.0291..0.0328 rows=1 loops=1)
     -- > Index lookup on film using idx_title (title='AGENT TRUMAN')  (cost=0.35 rows=1) (actual time=0.0277..0.0313 rows=1 loops=1)

 drop index idx_title on film;
 create index idx_title on film(title(10));
 explain analyze select *
 from film
 where title = 'AGENT TRUMAN';
 -- > Filter: (film.title = 'AGENT TRUMAN')  (cost=0.35 rows=1) (actual time=0.0552..0.0612 rows=1 loops=1)
     -- > Index lookup on film using idx_title (title='AGENT TRUMAN')  (cost=0.35 rows=1) (actual time=0.0523..0.0581 rows=1 loops=1)
 
 
 drop index idx_title on film;
 create index idx_title on film(title(6));
 explain analyze select *
 from film
 where title = 'AGENT TRUMAN';
 -- > Filter: (film.title = 'AGENT TRUMAN')  (cost=0.35 rows=1) (actual time=0.0393..0.0439 rows=1 loops=1)
     -- > Index lookup on film using idx_title (title='AGENT TRUMAN')  (cost=0.35 rows=1) (actual time=0.0375..0.042 rows=1 loops=1)
 -- hence proved that 4 to 6 characters are sufficient.

Comments

Popular posts from this blog

Aggregate Queries and Views

Schema Design

Transactions in SQL