SQL Server 2019 – VLDB struck in “In Recovery”!

Getting straight to the point, Chances are very high that you will run into the issue that we are going to see in this blog post when you try to restore a database larger than 30 TB running on SQL Server 2019 CU1.

Environment:

SQL Server 2019 CU1
DB1 – 6 TB.
DB2 – 30+TB.

Behavior:

Restoring 6 TB database – Everything goes normal as it should.
Restoring 30 TB database – Database stays in “In Recovery” state indefinitely. Infamous “Non yielding scheduler” stack dump gets created when the recovery process enters roll forward(REDO) phase. To give you an idea, look at below log entries.

Message

Recovery of database ‘VLDB’ (8) is 0% complete (approximately 138 seconds remain). Phase 2 of 3.
Recovery of database ‘VLDB’ (8) is 22% complete (approximately 7 seconds remain). Phase 2 of 3.

***Unable to get thread context for spid 0
* *******************************************************************************
** BEGIN STACK DUMP:
* MM/DD/YY HH:MM:SS spid 4368
** Non-yielding Scheduler
** *******************************************************************************
Stack Signature for the dump is 0x00000000000001BC
External dump process return code 0x20000001.
External dump process returned no errors.
Process 0:0:0 (0x110c) Worker 0x000001B39D42C160 appears to be non-yielding on Scheduler 1. Thread creation time: 13234458222710.
Approx Thread CPU Used: kernel 0 ms, user 70375 ms. Process Utilization 16%. System Idle 79%. Interval: 70579 ms.
Process 0:0:0 (0x110c) Worker 0x000001B39D42C160 appears to be non-yielding on Scheduler 1. Thread creation time: 13234458222710.
Process 0:0:0 (0x110c) Worker 0x000001B39D42C160 appears to be non-yielding on Scheduler 1. Thread creation time: 13234458222710.
Process 0:0:0 (0x110c) Worker 0x000001B39D42C160 appears to be non-yielding on Scheduler 1. Thread creation time: 13234458222710.
Process 0:0:0 (0x110c) Worker 0x000001B39D42C160 appears to be non-yielding on Scheduler 1. Thread creation time: 13234458222710.
Process 0:0:0 (0x110c) Worker 0x000001B39D42C160 appears to be non-yielding on Scheduler 1. Thread creation time: 13234458222710.

Note: I noticed this exact behavior consistently on multiple servers irrespective of what method we use to restore this DB.
Native restore – Fails
Attach/Detach – Fails
Restored using EMC DDBoost – Fails
Restore virtual database by presenting virtual data copies using third party tools such as Delphix – Fails.

Fix:
Drop the database which got struck in “In recovery” and apply 2019 CU4 or above.

I am not sure at what size SQL 2019 CU1 starts to panic, but it was happy with 6TB DB and caused misery for 30 TB DB. BTW…At the time of writing this blog post, we have CU5 available which I strongly recommend if you are planning to patch.

 

Happy Monday!