Determining number of nodes or cores available in an SGE Queue

Linux
Author

Vinh Nguyen

Published

June 16, 2011

To determine the status of a queue in SGE, one can issue the command qstat -g c to get such information like number of CPU available and the current CPU and memory load. However, this information can be misleading when nodes can be cross-listed in multiple Q's. A Q can say X number of nodes are unused, when in reality, they are in use in a different Q. Consequently, a submitted parallel job asking for X cores can wait in limbo for quite some time depending on the cluster's load. The following sgeQload.R R script uses some commands explained in the cheat sheet to output the number of cores really available:

#! /bin/env Rscript

## This script shows me the number of cores available for each Q.
## Since many Q's on BDUC contain overlapping nodes, information from "qstat -g c" could be misleading and lead to submitted jobs that are waiting...
## This script utilizes R, qconf

### References
## http://moo.nac.uci.edu/~hjm/bduc/sge-quick-reference_v3_cheatsheet.pdf
## http://www.troubleshooters.com/codecorn/littperl/perlreg.htm

qstatgc <- system("qstat -g c", intern=TRUE)
qstatgc.list <- strsplit(qstatgc, split="\\s+", perl=TRUE)[c(-2, -3)] ## remove --- line and all.q
qstatgc.list[[1]] <- qstatgc.list[[1]][-1] ## CLUSTER QUEUE is one thing -> QUEUE
qstat <- t(sapply(qstatgc.list[-1], function(x) as.numeric(x[-1])))
colnames(qstat) <- qstatgc.list[[1]][-1]
rownames(qstat) <- sapply(qstatgc.list[-1], function(x) x[1])
qstat <- cbind(qstat, NCPU=NA, LOAD=NA, AVAILABLE=NA)


for(Q in rownames(qstat)){
 host.list <- strsplit(grep("hostlist", system(paste("qconf -sq", Q), intern=TRUE), value=TRUE), split="\\s+", perl=TRUE)[[1]][-1]
 host.vec <- NULL
 for(host in host.list){
 host.vec <- c(host.vec, strsplit(strsplit(gsub("\\", "", paste(system(paste("qconf -shgrp", host, sep=" "), intern=TRUE), collapse=" "), fixed=TRUE), "hostlist", fixed=TRUE)[[1]][2], "\\s+", perl=TRUE)[[1]])
 }
 host.vec <- unique(host.vec)
 host.vec <- host.vec[host.vec != ""]
 host.vec <- gsub(".bduc", "", host.vec, fixed=TRUE)

 qhost <- system("qhost", intern=TRUE)[c(-2, -3)]
 qhost.matrix <- do.call(rbind, strsplit(qhost[-1], "\\s+", perl=TRUE))
 colnames(qhost.matrix) <- strsplit(qhost[1], "\\s+", perl=TRUE)[[1]]
 NCPU <- sum(as.numeric(qhost.matrix[qhost.matrix[, "HOSTNAME"] %in% host.vec, "NCPU"]))
 LOAD <- sum(as.numeric(qhost.matrix[qhost.matrix[, "HOSTNAME"] %in% host.vec, "LOAD"]))
 qstat[Q, "NCPU"] <- NCPU
 qstat[Q, "LOAD"] <- LOAD
 qstat[Q, "AVAILABLE"] <- NCPU-LOAD
}

qstat

Note that this script is specific to the cluster I use. It should be modified for other clusters. It does not work immediately on another cluster I have access to.