AlwaysON Secondary database going to “Not Synchronizing/ Suspect” State!

In this blog post I will share an issue we had with a database which is configured with AlwaysON. Before proceeding any further, the environment which we’ve got is:

Each node has Windows Server 2008R2(With all the service packs and hot fixes recommended for AlwaysON)
Running on top of VMware VShpere 5.1
SQL Server 2012(SP1) Enterprise Edition
RAM: 10 GB (8 GB assigned to SQL Server).
2 VCPU’s.
Availability Mode- Synchronous Commit

Issue: Daily around 5 AM, the secondary database is going to “Not Synchronizing/Suspect” state and until we fix this the T-Log on primary grows and all that normal jazz once the AlwaysON databases get out of Sync…(See below)

1
So, what’s happening?
The App team is performing data load daily around 4.30 AM. Okay…So what’s bad about that? They are loading ~30 Million Records daily, in a single transaction. Oops!!!…
From SQL Server error logs, we see the below message:
Message

AlwaysOn Availability Groups data movement for database ‘Test_DB’ has been suspended for the following reason: “system” (Source ID 2; Source string: ‘SUSPEND_FROM_REDO‘). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.
This message is always accommodated with another message(Shown below):
Message
The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.

Ummm…This doesn’t looks good. If you are thinking, what Locks have to do with AlwaysON secondaries, let me tell you this. With Database Mirroring and AlwaysON Rollbacks/Redo thread will also take a lock on the secondary side to avoid any other transaction to interrupt REDO process, thus guaranteeing consistency. If for some reason SQL Server is not able to acquire locks for redo thread it won’t synchronize the database starting that point. (It’s by design).

In our case(In few cases, you might be able to just resume data movement manually and things should be back to normal) what’s happening was SQL Server was running out of memory and was not able to acquire any further locks(Remember, each lock structure in SQL Server will need certain amount of memory).Basically, it says “Since I wasn’t able to acquire a lock during the REDO, I don’t know what else happened at that time and I can’t guarantee the database to be consistent. So…am not going to synchronize from this point and I will suspend the data movement and also take the database to Suspect state”).

From AlwaysON standpoint, Suspending Synchronization when the REDO thread encounters any error is by design and is done on purpose by SQL Server.

To avoid this, all they(App team) have to do is optimize their load process to better manage lock acquisition.(We are not being granted any more memory on these boxes unfortunately).

Bottom Line: Avoid huge transactions on tiny SQL Servers. Try to split the transactions into multiple chunks especially when dealing with millions/billions of rows.  That helps in general many ways, not just in this particular scenario.

Have a safe and happy long weekend guys!

 

Advertisement

SQL Server AlwaysOn Availability Groups Terminology….!

In SQL Server 2012, with AlwaysOn being introduced there are lot of new terms/words which we need to get used to as we support SQL Sever 2012. Well, Just getting used to those terms is not enough…we’ve to understand the terminology. In this short Post let me define what I’ve understood so far with these new fancy terms.

Availability Groups(AG) : – Group of Databases which move together from one Instance to other. Each Instance can have multiple AGs, where each AG contains multiple Databases.

Listener :- It’s a virtual entity which moves around to the current Primary Server for a given AG.

Replicas :- All the SQL Servers involved in your AG are considered as Replicas. Even current Primary Server is treated as a Replica, not just the secondaries…! So, replicas are not Just secondary SQL Instances. They are differentiated by using Primary Replica and Secondary Replica.

Note: AGs run on top of Windows Clustering. So is it a new clustering?? Nope! Same old Windows Clustering, but with a flavor of no Shared Storage. FYI, SQL Instances in my lab are 3 Standalones which are built on top of a 3 Node Windows Cluster with no shared Storage!

Note: your SQL Server Instances can be clustered as well, which adds more complexity, but is needed for some customers based on their own business needs. Typically, MSFT calls this scenario as AlwaysOn Failover Clustered Instance.

AlwaysOn Failover Cluster is not same as AlwaysOn Availability Groups!

How to add a new Database to an existing AlwaysOn Availability Group?

In this post, let us see how to add a new database to an existing AlwaysOn Availability Group. Let’s assume we’ve a sales AG already in place and application team requested us(DBA) team to add one more Database to this Group as they need them to be available all together. For this Example let’s assume the new Database Name is “Sales_Q1” – Of Course this would be more realistic Name in your Real world deployments!

Please see my current Availability Grp Status Below:

At a glance, you can see 3 Databases currently Sales_1, Sales_2 and Sales_3 participating in my AG. Also you can see only 2 nodes out of 3 are Powered Up(See SreeSQLDR status  as Down) and the Primary Replica is SreeSQLA.

Now, let me create a new Database named “Sales_Q1”. See below

As you can see I’ve just created a new Database(Simple Recovery Model) which is not yet added to my AG. Now, let’s try to add this Database and see how SQL behaves by default.

Step1: Right Click on Availability Databases and select Add Database.

Step2:

As you can see Wizard is smart enough to recognize your Database needs Full Recovery Model to participate in your AG.

Step3: I changed the Recovery Model to Full and refreshed the Wizard, now….See below for what it says

Cool…now it says you need a Full backup to be taken.

Step4: Now I tool full backup and placed on a share where all the nodes have access to. Now it says you are good to go as you can see below 🙂

Step 5:  Now it’s time to choose how do you want to start synch. I’ve selected Full by providing the share(I placed by backup here) where all my nodes participating in my AG have access to.

Step 6:  Time to Join all other Nodes by connecting. Notice I can’t proceed further(Next button is greyed out)

As you can see in the above Screenshot I’m not allowed to proceed any further. (Remember from my first screenshot? I didn’t powered UP my DR Node…) Just wanted to show you how this wizard behaves if any one of the nodes participating in AG is down. Once, I powered up my DR machine, I was able to connect to that Instance and was able to proceed Next.

Step7: It validated all my configurations and did it’s magic behind the scenes 🙂

Note: You can script out the entire process while you are at summary Screen.
Finally, see below for how it looks once I added new DB to our AG.

You can see Sales_q1 added to the AG successfully. Perfect! Hope this helps.