In my last post Parallel Task Scheduling (1) – Jobs, I descibed the simplest way to schedule tasks to run in parallel. It’s inefficient. Most importantly, if we have 1000 + independent tasks to schedule and each of which needs few minutes to run, it would be very possible that we end up having few hundreds jobs existing in msdb at a time before they are removed. DBAs will complain about the distractive and intermittent job creations, and also, we may not want few hundreds tasks running at same time to create CPU contention on the server with the SQL Server engine.
To overcome the issues descibed above, why not to schedule limit number of jobs to run the same code which can load different tasks from a task list. Each job is scheduled to run every 10 seconds. In this case, there would be no frequent job creation and deletion overtime and max degree of parallelism is same as number of jobs created so that we have a predictable performance in such scheduling system. I have seen such structures in many companies. It works well in many circumstances.
I give each part of this process different names which are borrowed from SQLOS
- Scheduler: A logical component that issues workers, loads tasks, and run tasks. Obviously that’s SQL Agent in thi case. So there is no design consideration for this component
- Worker: An independent thread issued by scheduler to run tasks. This is SQL job in this case. There is only implementation consideration in which you need to decide number of workers(jobs)
- Queue: A table created in an user database lists the tasks to be running
- Queue Reader: It’s a peice of code that reads one pending task from the queue, loads it to worker, and then run it
- It must read a task that has not been run (pending) and loaded by other queue reader from other workers – The same task should never be loaded to different works.
- After a task is finished, the next task in the queue should be loaded immediately rather than waiting for arrival of next schedule after 10 seconds (minimum interval of SQL Server job scheduling is 10 seconds)
So, under this umbrella, the coding effort would be implementation of Queue and Queue Reader. Now, let’s create queue table and execution log table
create table Tasks ( TaskID int identity(1,1) primary key, TaskBody varchar(max) not null, Status varchar(20) not null default('Pending') -- can be Pending, Running, and Finished ) go exec sp_tableoption 'dbo.Tasks', 'large value types out of row', 1 /* This option removes TaskBody field from rows. Enabling this can effectively reduce the size of the row hance reduce the size of the b-tree when you have massive amount scheduled tasks and finished tasks. optionally, you can remove tasks from the queue once it's completed */ go create index IX_Status on Tasks(Status) -- this index is optional as well since cardinality of the status is very low go Create table Logs ( LogID bigint identity(1,1) primary key, TaskID int not null, TaskBody varchar(max), SPID int not null default(@@SPID), StartTime datetime not null default(getdate()), EndTime datetime ) go
Then create a wrap-up procedure to execute a task and hide the execution logging logic and error handling logics.
create procedure ExecuteTask(@TaskID int, @TaskBody nvarchar(max)) as begin declare @LogID bigint insert into Logs(TaskID, TaskBody) values(@TaskID, @TaskBody) select @LogID = scope_identity() begin try exec(@TaskBody ) end try begin catch -- your error handling logics end catch update Logs set EndTime = getdate() where LogID = @LogID end
Now, it’s time to design Queue Reader logic which will be put in the “Command field” of a job step
create procedure TaskScheduler ( @WaitingTime int = 30 -- when queue is empty, it waits for 30 second by default before quitting. ) as begin declare @StartTime datetime, @TaskID int, @TaskBody nvarchar(max) select @StartTime = getdate() while(1=1) begin begin transaction update top (1) t set @TaskID = TaskID, @TaskBody = TaskBody, Status = 'Running' from Tasks t with (readpast) where Status = 'Pending' if @@rowcount = 1 begin exec ExecuteTask @TaskID, @TaskBody delete from Tasks where TaskID = @TaskID select @StartTime = getdate() end else waitfor delay '00:00:00:100' -- wait for 100 milliseconds commit -- we can commit anyways here if datediff(second, @StartTime, getdate()) > @WaitingTime break; -- if there is no task in the queue in last @WaitingTime seconds end end
In the line 16 above, I put readpast hints there. While update, a record can only be accessed by session at the same time. For instance, if the first record in the queue table is being read by the update statement by a session, other sessions will be locked and not be able to access it until the transaction is finished by the first session. With readpast hint, in this case, the second session will skip the first record and reach the second record without blocking. Before transaction is complete, the record locked by the session gets removed. This guarentees that one record is read by only one session once. This is the key of queue reader implementation under this architecture.
You may worry about putting waitfor command in the transaction causing blocking issue due to unreleased locks holding by the transaction. That’s correct. But the fact is that it only wait when there is no record in the queue table where in this case only IX lock is applied to the table in the duration of the transaction. IX lock will not block data reading and modification over the table. It will cause block when there is a lock escalation or schema modification.
Now create another wrap-up procedure then we can start testing
create procedure ExecAsync (@SQL varchar(max)) as begin insert into Tasks(TaskBody) values(@SQL) end
Now open 2 windows and run
open another window and run
exec ExecAsync 'Waitfor delay ''00:00:10''' go 10
Wait for a shor time and then run following statement
select * from Logs with (nolock) select * from tasks with (nolock)
LogID TaskID TaskBody SPID StartTime EndTime ------ ----------- ------------------------ ----------- ----------------------- ----------------------- 1 1 Waitfor delay '00:00:10' 60 2012-01-04 15:49:46.630 2012-01-04 15:49:56.630 2 2 Waitfor delay '00:00:10' 54 2012-01-04 15:49:46.647 2012-01-04 15:49:56.647 3 3 Waitfor delay '00:00:10' 54 2012-01-04 15:49:57.290 2012-01-04 15:50:07.290 4 4 Waitfor delay '00:00:10' 60 2012-01-04 15:49:57.290 2012-01-04 15:50:07.293 5 5 Waitfor delay '00:00:10' 54 2012-01-04 15:50:07.293 2012-01-04 15:50:17.293 6 6 Waitfor delay '00:00:10' 60 2012-01-04 15:50:07.293 2012-01-04 15:50:17.293 7 7 Waitfor delay '00:00:10' 60 2012-01-04 15:50:17.297 NULL 8 8 Waitfor delay '00:00:10' 54 2012-01-04 15:50:17.297 NULL (8 row(s) affected) TaskID TaskBody Status ----------- ------------------------ -------------------- 7 Waitfor delay '00:00:10' Running 8 Waitfor delay '00:00:10' Running 9 Waitfor delay '00:00:10' Pending 10 Waitfor delay '00:00:10' Pending (4 row(s) affected)
Now you can see task executions are distributed to 2 sessions. What about the synchronization? You have the options to check the task completion in the Tasks table or Log table.