Learn MsSql Server

Saturday, 18 January 2014

What is Fill Factor ?

Fill factor is the value that determines the percentage of space on each leaf-level page to be filled with data. In an SQL Server, the smallest unit is a page, which is made of Page with size 8K. Every page can store one or more rows based on the size of the row. The default value of the Fill Factor is 100, which is same as value 0. The default Fill Factor (100 or 0) will allow the SQL Server to fill the leaf-level pages of an index with the maximum numbers of the rows it can fit. There will be no or very little empty space left in the page, when the fill factor is 100.

If the page is completely filled and new data is inserted in the table which belongs to completely filled page, the “page split” event happens to accommodate new data. When new data arrives, SQL Server has to accommodate the new data, and if it belongs to the page which is completely filled, SQL Server splits the page in two pages dividing the data in half. This means a completely filled page is now two half-filled pages. Once this page split process is over, the new data is inserted at its logical place. This page split process is expensive in terms of the resources. As there is a lock on the page for a brief period as well, more storage space is used to accommodate small amounts of data. The argument usually is that it does not matter as storage is cheaper now a day. Let us talk about Fill Factor in terms of IO in the next paragraph.

It is absolutely true that storage is getting cheaper every day. .Now let us assume that there is Table 1 which contains the data worth of 1000 pages. All these pages are 100% filled. If we run a query to retrieve all the data from the SQL Server, it should retrieve all the 1000 pages. If pages are only 50% filled to accommodate Table 1, we will need 2000 pages, which means SQL Server has to read twice the amount of the data from the disk, leading to higher usage of memory, CPU and IO bandwidth.

Reading previous paragraph gives us an impression that having Fill Factor 100 is the best option as there won’t be any empty space in the page and IO will be always optimal. Again, this is true if we never do any insert, update or delete in the table. If we insert new data, ‘Page Split’ will happen and there will be few pages which will have 50% Fill Factor now. An OLTP system which is continuously modified ‑ this has always been a challenge with regard to having the right Fill Factor. Additionally, note that Fill Factor only applies to a scenario when the index was originally created or index was rebuilt; it does not apply subsequently when page split happens. As discussed earlier, page split increases IO as well storage space.

So,higher Fill Factor and high transaction server implies higher page split. However, Fill Factor can help us to reduce the number of the page splits as the new data will be accommodated in the page right away without much difficulty and further page splits. We can measure the page split by the watching performance monitor counter “SQLServer:AccessMethods:Page Splits/Sec”.

Additionally, you can measure the same using the following sys query.

SELECT cntr_value
FROM MASTER.dbo.sysperfinfo
WHERE counter_name ='Page Splits/sec' AND
OBJECT_NAME LIKE'%Access methods%'

Fill factor is usually measured at the server level as well as table level. Below we have the scripts for the same.

Here is the script to measure the Fill Factor at the server level:

SELECT *
FROM sys.configurations
WHERE name ='fill factor (%)'

And, here is the script to measure the Fill Factor at the table/index level:

USE YourDatabaseName;
SELECT OBJECT_NAME(OBJECT_ID) Name, type_desc, fill_factor
FROM sys.indexes

Friday, 17 January 2014

How Does SQL Server Store Data?

This is one of the important topic for database developer.We all know data is stored in table in sql server.All of us are interested to know little more about it.How Sql Server stores data into table.I will try to explain in simple word.
Microsoft SQL Server databases are stored on disk in two files: a data file and a log file.
Data file named as .MDF (Master data file) and .LDF (Log data file).Next thing ,What kind of data is stored in .MDF and .LDF file.

Whats .MDF file stores :
The SQL MDF file contains tables, stored procedures and user accounts. The MDF file is "attached" to the database. The MDF file is automatically created when the administrator creates a new database. The file continues to grow each time a user creates a new record in the tables. The user tables are the main storage objects in the MDF file. These objects can hold millions of records for the company.

MDF files can grow to several megabytes. Databases with millions of rows can even grow into gigabytes of information. For this reason, database administrators who maintain large, enterprise servers should ensure that the hard drive has enough storage space to hold the growing MDF file. This is especially important when the hard drive houses several databases for the company. With each MDF file growing to several gigabytes, hard drive space is quickly consumed.The database administrator can use the MDF file as a backup. The MDF file is copied to an external media device such as an external hard drive, USB flash drive, CD or DVD.

Whats .LDF file stores : LDF is a file extension for a log file used with Microsoft SQL Server. LDF files contain logging information for all transactions completed by the server. LDF files are used to time stamp any transactions to the SQL Server database, allowing the SQL database to be easily recoverable in the case of data loss.

e.g .

I am creating table MyMemory .Now I am going to explain how data will store internally

Query :

Create table MyMemory (ID int identity(1,1),Name varchar(100))

Insert :

Insert into MyMemory (Name) values ('Shrikant')

Insert into MyMemory (Name) values ('Sai')

Insert into MyMemory (Name) values ('Santosh')

Insert into MyMemory (Name) values ('deelip')

Now ,This data will store in table .It will look like

ID Name

1 Shrikant

2 Sai

3 Santosh

4 deelip

We are interested to know how it will store internally.We have DBCC command which will help us to know how data is stored internally.

Data Files Are Broken Up Into 8KB Pages

These pages are the smallest unit of storage both in memory and on disk. When we write the very first row into a table, SQL Server allocates an 8KB page to store that row – and maybe a few more rows, depending on the size of our data. In our Friends example, each of our rows is small, so we can cram a bunch of ‘em onto a page. If we had bigger rows, they might take up multiple pages even just to store one row. For example, if you added a VARCHAR(MAX) field and stuffed it with data, it would span multiple pages.

Each page is dedicated to just one table. If we add several different small tables, they’ll each be stored on their own pages, even if they’re really small tables.

DBCC IND('Shri_Test_17012014', 'MyMemory', -1);

-->Database name - Shri_Test_17012014

--> Table Name - MyMemory

Result of this command will tell us how many pages it has occupied for storing data. Above table has occupied 2 pages means 16KB of data.

PagePID ObjectID Table Name [object_Name(Object_id)]

611164 1280827725 MyMemory

611163 1280827725 MyMemory

How the Data File and Log File are Accessed :

Data files, on the other hand, are a jumbled mess of stuff. You’ve got tables and pages all over the place, and your users are making unpredictable changes all over the place. SQL Server’s access for data files tends to be random, and it’s a combination of both reads and writes. The more memory your server has, the less data file reads happen – SQL Server will cache the data pages in memory and just work off that cache rather than reading over and over. This is why we often suggest stuffing your SQL Server with as much memory as you can afford; it’s cheaper than buying good storage.

Log files are written to sequentially, start to finish. SQL Server doesn’t jump around – it just makes a little to-do list and keeps right on going. Eventually when it reaches the end of the log file, it’ll either circle back around to the beginning and start again, or it’ll add additional space at the end and keep on writing. Either way, though, we’re talking about sequential writes. It’s not that we never read the log file – we do, like when we perform transaction log backups. However, these are the exception rather than the norm.

Tuesday, 10 December 2013

Difference Between Index Scan and Index Seek.

Index Scan :
Index Scan scans each and every record in the index. `Table Scan` is where the table is
processed row by row from beginning to end. If the index is a clustered index then an
index scan is really a table scan. Since a scan touches every row in the table whether or
not it qualifies, the cost is proportional to the total number of rows in the table.
Hence, a scan is an efficient strategy if the table is small.

Index Seek :
Since a seek only touches rows that qualify and pages that contain these qualifying
rows, the cost is proportional to the number of qualifying rows and pages rather than to
the total number of rows in the table.

Before we go over the concept of scan and seek we need to understand what SQL Server does before applying any kind of index on query. When any query is ran SQL Server has to determine that if any particular index can be applied on that particular query or not.SQL Server uses search predicates to make decision right before applying indexes to any given query.

Predicate is an expression that evaluates to TRUE, FALSE, or UNKNOWN. Predicates are used in the search condition of WHERE clauses and HAVING clauses, the join conditions of FROM clauses, and other constructs where a Boolean value is required.

Suppose,
TableX has five columns : Col1,Col2,Col3,Col4,Col5
Index1 on TableX contains two columns : Col2,Col3
Query1 on TableX retrieves two columns : Col1,Col3

Now when Query1 is ran on TableX it will use search predicates Col1,Col3 to figure out if it will use Index1 or not. As Col1,Col3 of Query1 are not same as Col2,Col3 or Index1 there are good chances that Query1 will not use Index1. This scenario will end up in table scan. If Query1 would have used Index1 it would have resulted in table seek.

Friday, 6 December 2013

What is BigData?

Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.

While the term may seem to reference the volume of data, that isn't always the case. The term big data -- especially when used by vendors -- may refer to the technology (which includes tools and processes) that an organization requires to handle the large amounts of data and storage facilities.

The term big data is believed to have originated with Web search companies who had to query very large distributed aggregations of loosely-structured data.

An Example of Big Data

An example of big data might be petabytes (1,024 terabytes) or exabytes(1,024 petabytes) of data consisting of billions to trillions of records of millions of people -- all from different sources (e.g. Web, sales, customer contact center, social media, mobile data and so on). The data is typically loosely structured data that is often incomplete and inaccessible.

When dealing with larger datasets, organizations face difficulties in being able to create, manipulate, and manage big data. Big data is particularly a problem in business analytics because standard tools and procedures are not designed to search and analyze massive datasets.

What is Hadoop?

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative.

Hadoop was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any node in the cluster. Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant. The current Apache Hadoop ecosystem consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS) and a number of related projects such as Apache Hive, HBase and Zookeeper.

The Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising. The preferred operating systems areWindows and Linux but Hadoop can also work with BSD and OS X.

Thursday, 5 December 2013

Getting number from alphanumeric string in SQL Server

Suppose ,
We have string which is containing numbers,character,special character.Requirement is we want only numbers .We don't want characters ,special character etc.I created function which will only return number.

Create function fnOnlyNumbers(@Value varchar(255))
returns varchar(255)
begin
if PATINDEX('%[0-9]%',@Value)>0
begin
while isnumeric(@Value)=0
begin
set @Value=ltrim(rtrim(REPLACE(REPLACE (@Value, SUBSTRING (@Value ,PATINDEX ( '%[!@#$a-z`?><.,/A-Z() *&]%' , @Value ),1),'') ,' ','')))
end
end
return case when isnumeric(@Value)=1 then @Value else Null end
end

How to run :
select dbo.fnOnlyNumbers('sss1$2>>34 ')
Output : 1234

Note : If we are not passing any value then we will get Null value.

Wednesday, 20 November 2013

Understanding Lead and Lag Functions in SQL Server 2012

Lead :

functions access data from a subsequent row (lead)

Lag :

functions access data from a previous row (Lag )

Go through following script.

Create table #sk_lead_leg(

[id] [int] IDENTITY(1,1) ,

[Department] [nchar](100) NOT NULL,

[Code] [int] NOT NULL

)

insert into #sk_lead_leg(Department ,Code )

values

('X',100),

('Y',200),

('Z',300)

select * from #sk_lead_leg order by id asc

SELECT id,Department,Code,

LEAD(Code,1) OVER (ORDER BY Code ) LeadValue,

LAG(Code,1) OVER (ORDER BY Code ) LagValue

FROM#sk_lead_leg

We can see the output.

Lead : Last value is null because it is started from second value.

Lag : First Value is null because it is started before first value.