AlwaysON Secondary database going to “Not Synchronizing/ Suspect” State!

Published by

sreekanth bandarla

August 29, 2014

In this blog post I will share an issue we had with a database which is configured with AlwaysON. Before proceeding any further, the environment which we’ve got is:

Each node has Windows Server 2008R2(With all the service packs and hot fixes recommended for AlwaysON)
Running on top of VMware VShpere 5.1
SQL Server 2012(SP1) Enterprise Edition
RAM: 10 GB (8 GB assigned to SQL Server).
2 VCPU’s.
Availability Mode- Synchronous Commit

Issue: Daily around 5 AM, the secondary database is going to “Not Synchronizing/Suspect” state and until we fix this the T-Log on primary grows and all that normal jazz once the AlwaysON databases get out of Sync…(See below)

So, what’s happening?
The App team is performing data load daily around 4.30 AM. Okay…So what’s bad about that? They are loading ~30 Million Records daily, in a single transaction. Oops!!!…
From SQL Server error logs, we see the below message:
Message
AlwaysOn Availability Groups data movement for database ‘Test_DB’ has been suspended for the following reason: “system” (Source ID 2; Source string: ‘SUSPEND_FROM_REDO‘). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.
This message is always accommodated with another message(Shown below):
Message
The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.

Ummm…This doesn’t looks good. If you are thinking, what Locks have to do with AlwaysON secondaries, let me tell you this. With Database Mirroring and AlwaysON Rollbacks/Redo thread will also take a lock on the secondary side to avoid any other transaction to interrupt REDO process, thus guaranteeing consistency. If for some reason SQL Server is not able to acquire locks for redo thread it won’t synchronize the database starting that point. (It’s by design).

In our case(In few cases, you might be able to just resume data movement manually and things should be back to normal) what’s happening was SQL Server was running out of memory and was not able to acquire any further locks(Remember, each lock structure in SQL Server will need certain amount of memory).Basically, it says “Since I wasn’t able to acquire a lock during the REDO, I don’t know what else happened at that time and I can’t guarantee the database to be consistent. So…am not going to synchronize from this point and I will suspend the data movement and also take the database to Suspect state”).

From AlwaysON standpoint, Suspending Synchronization when the REDO thread encounters any error is by design and is done on purpose by SQL Server.

To avoid this, all they(App team) have to do is optimize their load process to better manage lock acquisition.(We are not being granted any more memory on these boxes unfortunately).

Bottom Line: Avoid huge transactions on tiny SQL Servers. Try to split the transactions into multiple chunks especially when dealing with millions/billions of rows. That helps in general many ways, not just in this particular scenario.

Have a safe and happy long weekend guys!

Rate this:

4 responses to “AlwaysON Secondary database going to “Not Synchronizing/ Suspect” State!”

Diana

November 27, 2014

I have the exact same issue with the database. How do you put the database back online, out of the suspect state, and back into the availability group?

Reply
sreekanth bandarla

December 18, 2014

I was able to set things back to normal by just Resuming data movement. It failed in few occasions, where I had to re-initialize from scratch. But hey, I used a diff backup instead of Full Backup to resynch the database which was much quicker.
But, once I identified the root cause, Suggested app team to change how they are loading into tables and we never had this issue again 🙂

Reply
1. Ernesto Solis Tello
  
  September 5, 2016
  
  I had the same problem in my database . The solution was :
  1. Cleanup Unit Log
  2. The availability Groups – > Availability Databases : right button on the base and click Resume data movement
  
  Reply
  1. Ganesh
    
    January 28, 2019
    
    How to Cleanup Unit Log

I’m Sreekanth

Welcome to MSSQLTREK, my little corner of the internet where I geek out about SQL Server, databases, and all things cloud. This blog is my space to share lessons learned, cool tips, real-world troubleshooting stories, and the occasional deep dive into data and performance. Whether you’re a fellow Cloud Architect, DBA, Developer, or just cloud-curious, come along for the ride — let’s explore, optimize, and build awesome data solutions together!

Let’s connect

Join the party!

Stay updated with my posts…

MSSQLTREK