Parallel Task Scheduling (2) – Jobs

In my last post Parallel Task Scheduling (1) – Jobs, I descibed the simplest way to schedule tasks to run in parallel. It’s inefficient. Most importantly, if we have 1000 + independent tasks to schedule and each of which needs few minutes to run, it would be very possible that we end up having few hundreds jobs existing in msdb at a time before they are removed. DBAs will complain about the distractive and intermittent job creations, and also, we may not want few hundreds tasks running at same time to create CPU contention on the server with the SQL Server engine.

To overcome the issues descibed above, why not to schedule limit number of jobs to run the same code which can load different tasks from a task list. Each job is scheduled to run every 10 seconds. In this case, there would be no frequent job creation and deletion overtime and max degree of parallelism is same as number of jobs created so that we have a predictable performance in such scheduling system. I have seen such structures in many companies. It works well in many circumstances.

I give each part of this process different names which are borrowed from SQLOS

Scheduler: A logical component that issues workers, loads tasks, and run tasks. Obviously that’s SQL Agent in thi case. So there is no design consideration for this component
Worker: An independent thread issued by scheduler to run tasks. This is SQL job in this case. There is only implementation consideration in which you need to decide number of workers(jobs)
Queue: A table created in an user database lists the tasks to be running
Queue Reader: It’s a peice of code that reads one pending task from the queue, loads it to worker, and then run it
- It must read a task that has not been run (pending) and loaded by other queue reader from other workers – The same task should never be loaded to different works.
- After a task is finished, the next task in the queue should be loaded immediately rather than waiting for arrival of next schedule after 10 seconds (minimum interval of SQL Server job scheduling is 10 seconds)

So, under this umbrella, the coding effort would be implementation of Queue and Queue Reader. Now, let’s create queue table and execution log table

create table Tasks
(
	TaskID int identity(1,1) primary key, 
	TaskBody varchar(max) not null,
	Status varchar(20) not null default('Pending') -- can be Pending, Running, and Finished
)
go
exec sp_tableoption 'dbo.Tasks', 'large value types out of row', 1
/* 
This option removes TaskBody field from rows.
Enabling this can effectively reduce the size of the row hance reduce the size of the b-tree when you have massive amount scheduled tasks and finished tasks.
optionally, you can remove tasks from the queue once it's completed
 */
go
create index IX_Status on Tasks(Status) 
-- this index is optional as well since cardinality of the status is very low
go
Create table Logs
(
	LogID bigint identity(1,1) primary key,
	TaskID int not null,
	TaskBody varchar(max),
	SPID int not null default(@@SPID),
	StartTime datetime not null default(getdate()),
	EndTime datetime
)
go

Then create a wrap-up procedure to execute a task and hide the execution logging logic and error handling logics.

create procedure ExecuteTask(@TaskID int, @TaskBody nvarchar(max))
as
begin
	declare @LogID bigint
	insert into Logs(TaskID, TaskBody)
		values(@TaskID, @TaskBody)
	select @LogID = scope_identity()
	begin try
		exec(@TaskBody )
	end try
	begin catch
		-- your error handling logics
	end catch
	update Logs set EndTime = getdate() where LogID = @LogID
end

Now, it’s time to design Queue Reader logic which will be put in the “Command field” of a job step

create procedure TaskScheduler 
(
	@WaitingTime int = 30 -- when queue is empty, it waits for 30 second by default before quitting.
)
as
begin
	declare @StartTime datetime, @TaskID int, @TaskBody nvarchar(max)
	select @StartTime = getdate()
	while(1=1)
	begin
		begin transaction
		update top (1) t
			set @TaskID = TaskID,
				@TaskBody = TaskBody,
				Status = 'Running'
		from Tasks t with (readpast)
		where Status = 'Pending'
		if @@rowcount = 1
		begin
			exec ExecuteTask @TaskID, @TaskBody
			delete from Tasks where TaskID = @TaskID
			select @StartTime = getdate()
		end
		else
			waitfor delay '00:00:00:100' -- wait for 100 milliseconds
		commit -- we can commit anyways here
		if datediff(second, @StartTime, getdate()) > @WaitingTime
			break; -- if there is no task in the queue in last @WaitingTime seconds
	end
end

In the line 16 above, I put readpast hints there. While update, a record can only be accessed by session at the same time. For instance, if the first record in the queue table is being read by the update statement by a session, other sessions will be locked and not be able to access it until the transaction is finished by the first session. With readpast hint, in this case, the second session will skip the first record and reach the second record without blocking. Before transaction is complete, the record locked by the session gets removed. This guarentees that one record is read by only one session once. This is the key of queue reader implementation under this architecture.
You may worry about putting waitfor command in the transaction causing blocking issue due to unreleased locks holding by the transaction. That’s correct. But the fact is that it only wait when there is no record in the queue table where in this case only IX lock is applied to the table in the duration of the transaction. IX lock will not block data reading and modification over the table. It will cause block when there is a lock escalation or schema modification.
Now create another wrap-up procedure then we can start testing

create procedure ExecAsync (@SQL varchar(max))
as
begin
	insert into Tasks(TaskBody) values(@SQL)
end

Now open 2 windows and run

exec TaskScheduler

open another window and run

exec ExecAsync 'Waitfor delay ''00:00:10'''
go 10

Wait for a shor time and then run following statement

select * from Logs with (nolock)
select * from tasks with (nolock)

Results are

LogID  TaskID      TaskBody                 SPID        StartTime               EndTime
------ ----------- ------------------------ ----------- ----------------------- -----------------------
1      1           Waitfor delay '00:00:10' 60          2012-01-04 15:49:46.630 2012-01-04 15:49:56.630
2      2           Waitfor delay '00:00:10' 54          2012-01-04 15:49:46.647 2012-01-04 15:49:56.647
3      3           Waitfor delay '00:00:10' 54          2012-01-04 15:49:57.290 2012-01-04 15:50:07.290
4      4           Waitfor delay '00:00:10' 60          2012-01-04 15:49:57.290 2012-01-04 15:50:07.293
5      5           Waitfor delay '00:00:10' 54          2012-01-04 15:50:07.293 2012-01-04 15:50:17.293
6      6           Waitfor delay '00:00:10' 60          2012-01-04 15:50:07.293 2012-01-04 15:50:17.293
7      7           Waitfor delay '00:00:10' 60          2012-01-04 15:50:17.297 NULL
8      8           Waitfor delay '00:00:10' 54          2012-01-04 15:50:17.297 NULL

(8 row(s) affected)

TaskID      TaskBody                 Status
----------- ------------------------ --------------------
7           Waitfor delay '00:00:10' Running
8           Waitfor delay '00:00:10' Running
9           Waitfor delay '00:00:10' Pending
10          Waitfor delay '00:00:10' Pending

(4 row(s) affected)

Now you can see task executions are distributed to 2 sessions. What about the synchronization? You have the options to check the task completion in the Tasks table or Log table.

Parallel Task Scheduling (2) – Jobs

You May Also Like

Leave a Reply Cancel reply