AlwaysON Secondary database going to “Not Synchronizing/ Suspect” State!

In this blog post I will share an issue we had with a database which is configured with AlwaysON. Before proceeding any further, the environment which we’ve got is:

Each node has Windows Server 2008R2(With all the service packs and hot fixes recommended for AlwaysON)
Running on top of VMware VShpere 5.1
SQL Server 2012(SP1) Enterprise Edition
RAM: 10 GB (8 GB assigned to SQL Server).
2 VCPU’s.
Availability Mode- Synchronous Commit

Issue: Daily around 5 AM, the secondary database is going to “Not Synchronizing/Suspect” state and until we fix this the T-Log on primary grows and all that normal jazz once the AlwaysON databases get out of Sync…(See below)

1
So, what’s happening?
The App team is performing data load daily around 4.30 AM. Okay…So what’s bad about that? They are loading ~30 Million Records daily, in a single transaction. Oops!!!…
From SQL Server error logs, we see the below message:
Message

AlwaysOn Availability Groups data movement for database ‘Test_DB’ has been suspended for the following reason: “system” (Source ID 2; Source string: ‘SUSPEND_FROM_REDO‘). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.
This message is always accommodated with another message(Shown below):
Message
The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.

Ummm…This doesn’t looks good. If you are thinking, what Locks have to do with AlwaysON secondaries, let me tell you this. With Database Mirroring and AlwaysON Rollbacks/Redo thread will also take a lock on the secondary side to avoid any other transaction to interrupt REDO process, thus guaranteeing consistency. If for some reason SQL Server is not able to acquire locks for redo thread it won’t synchronize the database starting that point. (It’s by design).

In our case what’s happening was SQL Server was running out of memory and was not able to acquire any further locks(Remember, each lock structure in SQL Server will need certain amount of memory).Basically, it says “Since I wasn’t able to acquire a lock during the REDO, I don’t know what else happened at that time and I can’t guarantee the database to be consistent. So…am not going to synchronize from this point and I will suspend the data movement and also take the database to Suspect state”).

From AlwaysON standpoint, Suspending Synchronization when the REDO thread encounters any error is by design and is done on purpose by SQL Server.

To avoid this, all they(App team) have to do is optimize their load process to better manage lock acquisition.(We are not being granted any more memory on these boxes unfortunately).

Bottom Line: Avoid huge transactions on tiny SQL Servers. Try to split the transactions into multiple chunks especially when dealing with millions/billions of rows.  That helps in general many ways, not just in this particular scenario.

Have a safe and happy long weekend guys!

 

Advertisements

Denali Databases Status in Object Explorer

Howdyyy…Hope all is well folks! It’s been a long time since I wrote something….Was quite busy with some crazy stuff going around!! Anyways this would be a another small/Quick Blog post regarding a small enhancement msft offered in DENALI SSMS.

Okay, let’s jump into the content! I’ve created few SUSPECT databases on one of my SQL Instances and let me show you how DENALI CTP3 SSMS is showing in it’s Object Explorer and how SQL Server 20008R2 SSMS shows in it’s Object Explorer.

SQL Server 2008R2 SSMS:

SQL Server DENALI CTP3 SSMS:

As you can see, in DENALI the Database(s) are appended with (Recovery Pending) status which is more self explanatory where as in 2008R2 SSMS, the Databases are shown as nothing! Even though this is not great enhancement, it’s good to see this in our OE 🙂

How to Create a Suspect Database???….

Yes, You heard it right! How to Create a SUSPECT Database. You might be thinking, why would someone make a Database Suspect! I got you! Well, This post is not to teach you how to make your Production Database as “SUSPECT” 😉  I thought of sharing this Info, to fight a PANIC situation where you see a SUSPECT Database in your environment.  Create a SUSPECT Database in your play area and practice how to recover from that situation, get familiarize with a SUSPECT DB and play with it and try to recover from it far before a real Disaster hits your Prod Database(I hope not).

In this post, I’m not going to show how to recover from “SUSPECT DB”, Instead just to simulate a scenario where your DB goes into SUSPECT State. Let’s begin!!!…

Couple of months ago, I’ve written a small blog post(see here) on how to simulate 824 Error using Hex Editor. I’ll be using the same tool to achieve our goal here in this post.

Simple Method:

Step 1: Create a Dummy Database to play with.

Step 2 : Shutdown your SQL Server DB Engine Service.

Step 3: Open your LDF file with Hex Editor and edit it manually with few Bogus entries.

Step 4: Start your SQL Instance, you should be seeing your Database marked as Suspect by SQL Server.

Alternate, more practical method:

Step 1: I’ve created a dummy database to play with “Suspect_tst”.

Step2: I’ve started an explicit transaction(but didn’t committed or Rolled Back) and left it as an Open transaction and did a Checkpoint.

Step3: Force Shut down your SQL Server.

Step4: Open your ldf file in Hex Editor and choose a random row and fill them with some random bogus values same as in method 1.

Step5: Start your SQL Server, You will be seeing a Suspect Database 🙂

You can see the below Screenshot what i meant in Steps 1 and 2.

Now, as of Step 3, all you’ve to do is just open a new query and type “SHUTDOWN  WITH NOWAIT;“(Never ever run this on Production, Trust me, you’ll immediately loose your job:D)

for Step 5, as you can see in the below screenshot, I’ve opened ldf file and filled in 2nd Row with all 1’s.

Save and now Start your SQL Server Instance! Once SQL has been started, query your DB Status as shown below.

Tadaaaaa! We did it….:)

Now, you know how to simulate this most disastrous situation, play with this database for recovery, loosing as less amount data as you can. Cheers!