- Notes (Informational and Non-Normative)
- Interrupt priorities
- Prioritized wait queues
The goal of the real-time working group is to improve the real-time capabilities of the Linux operating system to address the range of real-time requirements for consumer electronics devices.
Consumer electronics devices need to perform many functions interactively and smoothly to meet user-expectations. For example, streaming data needs to be presented to the user without interruptions that would degrade the user's experience. Likewise, a recording device needs to record streaming audio/video data without loss of data. Responses to interactive user commands need to be completed quickly enough that the interface seems responsive to the user.
As an illustration, consider a set-top box playing a movie. The flow of audio and video data through the system has real-time requirements. As each frame of data is accessed from a hard drive it is delivered to subsequent systems (such as decoders, decompressors, decryptors, etc.) which manipulate the data stream and prepare the data for output. The hardware controlling the television signal must receive each frame of data at a fixed frequency in order to present each image in sequence. If a process in the chain of handlers cannot run due to overload of the system or scheduling problems, a frame might be delayed and lost, or the delay might result in the appearance of “glitches” or jerkiness in the images presented to the user.
This specification is primarily concerned with the Linux kernel, but Linux systems often provide support for functional API's using a mixture of C-library and the Linux kernel system calls. This is especially true for many of POSIX API's, including those that are part of this specification. The appropriate support for the various POSIX functionality elements in this specification may or may not exist in the various C-libraries available for Linux.
Broadly speaking, this specification focuses on the following improvements to the Linux kernel; all designed to make Linux a better real-time operating system.
- Enhance support for fixed priority preemptive scheduling.
- From an operating system perspective, this implies that the operating system should allow the system designer/administrator to specify the priorities of different activities in the system. Additionally, the operating system should minimize priority inversion in the system.
- Enhance coverage of POSIX real-time API's supported by Linux
- POSIX API's, including various real-time API's have become de-facto standard in the operating system communities and are supported by many general purpose as well as specialized real-time operating systems. Good support for POSIX real-time API's provides a convenient, well-understood API for middleware and application developers.
The Real-Time Working Group recognizes that there is a strong community of developers who believe that the "correct way" to do real-time in Linux is to do it with a "real-time kernel" such as RTAI or RTLinux such that any "real-time task" is effectively prioritized above any Linux activity. This version of the specification has not focused on these "hybrid" techniques, but it does not prevent the use of these hybrid techniques either.
- Context switch:
the act of replacing the running task A by another task B. The state of task A is saved, and A is placed in the ready queue; the state of task B is restored, and B becomes a running task.
- Critical section:
- a brief interval during which a task accesses shared data. During this interval, no other tasks are allowed to access the same data. One way of ensuring this is to prevent preemption during the critical section. If the critical section defers preemption for a bounded interval, the resulting priority inversion is bounded.
- Deadline-monotonic priority setting:
- the task with the shortest relative deadline is assigned the highest priority.
- Deadline requirement:
- a real-time requirement that requires the response to be completed within a fixed time interval from the triggering event. A
relative deadline is the duration of the above mentioned interval; an absolute deadline is the moment in time at which the response must be completed.
- Fixed priority preemptive scheduling:
the task with the highest priority is always running. If a task with a higher priority becomes runnable, the running task will be preempted immediately. The priority of a task, process or Interrupt Service Routine (ISR) is explicitly determined at creation, or by an explicit set-priority command. No implicit priority changes by the scheduler are assumed. For an exception to this rule see Priority inheritance. Note: Fixed priority scheduling is typically designed to be used for a single coherent application.
- the time range of a certain interval. We can talk about the granularity of a timing requirement, of a non-interruptible code segment, etc.
- Hard deadline requirement:
- missing the deadline is considered an error.
- Hard real-time system:
- system with hard real-time requirements.
- Hybrid real-time system:
- system with both hard and soft real-time requirements. CE systems are expected to be of this kind. The challenge in designing a hybrid real-time system is to get good soft real-time performance while meeting the hard real-time requirements. This is typically achieved by using a technique called reservation.
- Interrupt latency:
- time passed between interrupt occurrence and activation of interrupt handler.
- Interrupt masking:
- Making certain interrupts invisible to the software.
- Interrupt response time (worst-case):
- (worst-case) time passed between interrupt occurrence and either completion of interrupt service routine (ISR) or wake up of dependent task.
- Jitter – absolute:
- deviation of the occurrence of an event (e.g. completion of frame) from expected occurrence.
- Jitter – relative:
- deviation of the interval between two successive occurrences of an event (e.g. completion of frame) from expected interval.
- Mutual exclusion:
- prevent multiple tasks or ISRs from accessing the same data concurrently. Mutual exclusion is used to protect the integrity of the data.
- a running thread or process can be temporarily suspended. The state of the thread or process (including e.g., program counter, and register values) is saved. Until the thread is resumed, it remains runnable (active, ready). When the process or thread is later resumed, the saved state is restored.
- Priority inheritance:
- if a high priority-task blocks for a critical section, a low-priority task that has holds the lock for the section gets a priority boost. It inherits the priority of the blocked task. This prevents the unbounded priority inversion that could occur if medium priority tasks would preempt the lower-priority task that holds the lock. Note that, with priority inheritance, the medium-priority tasks suffer priority inversion as well. But, and this is important in real-time systems, the priority inversion for all tasks is bounded (can be determined without knowing the exact run-time schedule)
- Priority inversion:
- the highest priority task is not running. There can be several reasons for priority inversion. One of them is the absence of full preemption. Priority inversion is one of the main reasons for deadlines being missed.
- Real-time requirement:
- a requirement on the completion time of a response, generally measured relative to the event that triggered the response.
- Real-time system:
- system with one or more real-time requirements.
- Response time (worst-case):
- (worst-case) time passed between event occurrence and completion of the response to that event. The event may be an interrupt. The response typically involves an interrupt handler and one or more synchronized tasks.
- Runnable task:
- (also ready/active task) a task that can run from a logical perspective, but is prevented from running physically.
- a synchronization primitive often used to achieve mutual exclusion.
- Soft deadline:
- missing deadlines is sometimes acceptable. Compared to hard deadlines, where there is no reason to consider the value of a late result, the value of a late result for a soft deadline is of interest. The value of the result may, for instance, decrease linearly after the deadline.
- Soft real-time requirement:
- soft deadline, or average-case response time requirement. Note that hard and soft real-time requirements are orthogonal to the temporal granularity that is required. Meeting a soft requirement in the microsecond domain may be more difficult than meeting a hard requirement in the milliseconds domain.
- Soft real-time system:
- system with soft real-time requirements
Even though Linux allows preemptive priority scheduling of processes, the Linux kernel itself is non-preemptible through version 2.4.x. A preemptible kernel allows a (low-priority) process to be preempted even while it is executing in the kernel.
Fixed priority preemptive scheduling for real-time tasks is considered a basic characteristic of real-time operating systems. This scheduling behavior is also specified in the POSIX real-time scheduling specifications. Linux has supported POSIX real-time scheduling classes (SCHED_FIFO and SCHED_RR) for a long time, but real-time tasks in Linux can still experience significant amounts of priority inversion. One major cause of this priority inversion has been that while Linux supports fixed priority preemptive scheduling, the Linux kernel itself has been non-preemptible up until version 2.4.x. Making the Linux kernel preemptible is a significant step towards enhancing the real-time capabilities of the Linux kernel.
During normal operation of the system the kernel shall be preemptible except under the circumstances indicated below.
- Normal operation of the system shall include the execution of the system after the boot process is complete and before shutdown is initiated.
- Preemption may be disabled when the kernel is running inside a critical section.
- Preemption may be disabled when the kernel is running in a non process context.
- Preemption may be disabled when the kernel is being interactively debugged.
Notes (Informational and Non-Normative)
While the specification takes steps towards improving the real-time characteristics, the user still needs to be aware of the potential for significant priority inversion due to the possibilities of preemption being disabled in the kernel under various circumstances. There are various techniques to further improve the kernel preemptibility.
One such technique is to divide (large) critical sections in the kernel into multiple (shorter) critical sections. Various kernel patches that do this are often termed as "low-latency" or "lock-breaking" patches. Several such patches have been available for 2.4 Linux kernel. Many of those patches were adapted and integrated into 2.6 Linux kernel.
Another technique is to enable preemption within some critical sections by protecting the critical section using a semaphore or mutex (ideally with priority inheritance support).
Support for preemptible Linux kernel is integrated into the CELF kernel.
Patches for 2.4 Linux kernels are generally also available from http://kpreempt.sourceforge.net
Support for Linux kernel preemption is also integrated in 2.6 Linux kernels.
The 2.4 Linux scheduler maintains the active queue as a list of processes sorted by priority. Inserting a process in this list is an O(n) operation, where n is the number of active (runnable) tasks. Since insertion in the active queue is a component of preemption time, this makes preemption time dependent on the number of active processes in the system. Schedulers with performance that is independent of the number of active processes have been available for years, and the standard Linux 2.6 scheduler meets that requirement.
Preemption time is an important factor in real-time performance, and directly effects real-time performance metrics such as interrupt response time. When the scheduler can take time that is proportional to the number of tasks in the system, then the interrupt response time is adversely effected when there are a large number of tasks in the system. While many CE devices may not benefit from a constant-time scheduler (given that they often have a limited number of tasks), it is expected that higher end CE-devices may very well have a significant number of tasks, where using a constant-time scheduler would be a necessary requirement to achieve good real-time behavior.
The Linux kernel source SHALL include a scheduler whose performance is independent of the number of tasks (i.e., O(1)) in the system.
The reference O(1) scheduling mechanism is O(1) in the number of active tasks. It is not independent of the number of priorities. It is also worth noting that a kernel that uses a "constant-time" scheduler may still have context switching performance that depends on the number of tasks. This specification only requires that the algorithmic performance of the operations on the active queue (such as wakeup, schedule, etc.) does not depend on the number of tasks in the system. Other factors such as maintenance of the TLB, maintenance of the cache, demand paging working sets, and OS hints may cause large dependencies on the number or complexity of active processes. These issues are of great interest for real-time systems with small scheduling granularity, but are outside the scope of this specification.
It is also important to note that a constant-time scheduler offers no benefits when the number of active tasks in the system is known to be small. Furthermore, it is likely that a constant-time scheduler has higher overheads.
An O(1) scheduler is integrated into the CELF kernel.
The scheduler in 2.6 Linux kernels is also O(1). It maintains an array of double-ended linked lists, with each array entry corresponding to a priority. It can, thus, enqueue and dequeue tasks without a search that depends on the number of tasks. The O(1) scheduler in the CELF kernel is a backport of the 2.6 Linux kernel's O(1) scheduler.
Interrupt service routines executing at hardware interrupt level (or hard-irq context, as it is called in Linux) are effectively prioritized above tasks in the system. Interrupt Threads allows an ISR to be executed within a kernel thread context.
Even with a preemptible kernel, preemption is disabled when the kernel is executing in interrupt (hard-irq) context. When the interrupt load is small, this does not have a significant effect on the latencies for real-time tasks.
However, for various reasons, it may not always be possible to keep the interrupt load small. While system implementors are encouraged to keep interrupt service routines brief, this is not always possible. For example, sometimes a device does not support DMA properly, and a driver may have to move a significant amount of data in hard interrupt context. Or, a device that is generating very frequent requests may cause numerous brief interrupt service routines to sum to an extended amount of processing in the hard interrupt context. Whatever the reason, heavy use of interrupt context can adversely affect latencies for real-time tasks.
Especially in a system like Linux, where "off the shelf" device drivers are a major attraction, it is important to have a mechanism that makes unmodified interrupt service routines execute in a thread context. With this capability, a system designer can prioritize the execution of interrupt service routines lower than critical real-time tasks.
- The kernel SHALL provide a configurable option to execute the interrupt service routine in a kernel thread. With this option enabled, a separate kernel thread SHALL be used to service each interrupt line (IRQ).
- The kernel SHALL provide a "flag" that can be used by a device-driver to specify that a particular interrupt service routine should not be executed in a thread context.
- The kernel MAY provide configuration time specification of thread priorities for threads supporting the IRQ-thread option for each IRQ. When no such specification is provided, the thread priorities will default to SCHED_FIFO scheduling class with priority set to one less than the maximum real-time scheduling priority.
- No modifications SHOULD be required to device drivers to execute their interrupt service routine in thread context.
Device drivers may implement parts of their interrupt service routines in kernel threads, either by using the workqueue abstraction in 2.6 Linux (and backported to CELF 2.4 kernel), or by directly using kernel threads. This facility adds the flexibility of executing ISRs of unmodified device drivers in kernel threads. Additionally, it allows the system designer (except when expressly disallowed by the device driver) to determine whether the interrupt service routine should execute in hard-irq context or in thread context.
It is worth noting that executing interrupt service threads in kernel threads adds little or no value unless the Linux kernel is preemptible. Therefore, it is recommended that this facility be used in conjunction with preemptible kernel. Also, in most cases, it would be desirable to also use the SoftIRQ Threads facility as well.
Kernel patches that implement Interrupt Threads on 2.6 Linux kernel have been posted to the CELF-RTWG mailing list. These patches are also available through the CELF-RTWG public wiki pages.
Linux uses a softirq mechanism to execute code in system context with interrupts enabled. This is often used by the Linux kernel as a way to execute functions that are not directly related to user tasks.
The soft IRQ facility runs its load in interrupt state until the load becomes too great, then it schedules it in thread context at a low priority. Neither heavy use of interrupt mode nor hard-to-predict demotion of interrupt-priority work to a low priority is good practice for a real-time OS.
- The kernel SHALL provide a configurable option to run the soft IRQ actions in one or more kernel threads.
The CELF kernel includes a mechanism to emulate Soft IRQ actions using a workqueue, which essentially executes the actions using a kernel thread.
Patches to execute Soft IRQ actions using kernel threads for 2.6 Linux kernel have been posted on the CELF-RTWG mailing list, and are also available on the CELF-RTWG public wiki site.
The passage of time is, almost by definition, of interest to real-time software. Specific examples include:
- Periodic threads with or without deadlines
- Watchdog timers
- Sporadic servers
- Other CPU-consumption budgeting.
- Deadlines for aperiodic processing
The POSIX specification is mature and generally accepted. It includes a set of time-related APIs that provide a strong basis to build software that needs these services. The POSIX timers functions are adopted for this specification without modification.
- The kernel MUST conform to the POSIX specification IEEE Std 1003.1-2001 for timers. This includes functions marked with the following margin codes:
- The TMR margin code for POSIX timers
- The CS margin code for clock selection
- The MON margin code for a monotonic clock
- The CPT margin code for process CPU time clocks is not required.
- The kernel timer support MAY include support for one or more clocks with high-resolution utilizing hardware counters or timers to achieve timer resolution with 100us or lower.
The list of required functions includes:
- clock_gettime, clock_settime, and clock_getres
- timer_create, timer_delete, timer_settime, timer_gettime, timer_getoverrun
- nanosleep, clock_nanosleep
Support for POSIX timers is included in the CELF kernel. This also includes support for high-resolution timers on selected architectures.
Support for POSIX timers is also included in 2.6 Linux kernels. A patch for high-resolution timers is available at http://high-res-timers.sourceforge.net.
Message passing is a convenient and powerful communication mechanism that is widely used by real-time applications.
Linux has supported the System V Message Passing API for a long time, but the System V Message Passing facility does not provide for prioritized delivery of messages. Other Linux IPC mechanisms, often inherited from Unix, such as pipes, named pipes, Unix domain sockets, etc. have the same problem.
The message passing API defined in the POSIX specification is simple and widely accepted and supports prioritized delivery of messages.
The kernel MUST conform to the POSIX specification IEEE Std 1003.1-2001 for Message Queues. This includes functions marked with the following margin codes:
- The MSG margin code for POSIX message queues functionality
- The MSG_TMO margin code for POSIX message queues functionality with timeout capability
The list of required functions includes:
- mq_open, mq_close, mq_unlink
- mq_send, mq_timedsend
- mq_receive, mq_timedreceive
- mq_setattr, mq_getattr
There is some code for POSIX message passing in the CELF kernel, but it is clearly not functional. It appears to be a stale snapshot of the POSIX message passing patches available from http://www-users.mat.uni.torun.pl/~wrona/posix_ipc. This project has since made significant enhancements to the code, and support is available in the form of patches for both 2.4 and 2.6 Linux kernels. As of 4/30/2004, the message queue support has been merged into the 2.6.6 kernel tree, and necessary library support has been merged into glibc 2.3.4 tree.
Priority inheritance protocol is the easiest-to-use standard priority inversion avoidance algorithm. It is widely supported by standard RTOSs, and priority inheritance is specified in POSIX.
Mutex locks in user state SHALL support priority inheritance protocol as specified in POSIX specification IEEE Std. 1003.1-2001 under the PRIO_INHERIT symbolic constant.
Priority inheritance protocol is supported by the Robust Fast Mutex project: http://developer.osdl.org/dev/robustmutexes/
This section represents some of the work that was actively discussed in the real-time working group, but is not being formally specified at this time. Nonetheless, these techniques are considered as useful extensions to the specifications and fall within the scope of this specification.
This section represents interrupts that are prioritized, not the schedulable interrupt services routines addressed in the IRQ threads and Soft IRQ threads sections. In its simplest form prioritized interrupts are an issue only for the kernel’s interrupt dispatching code, more specifically, on the code that operates the PIC (Programmable Interrupt Controller). It lets the system be configured with a priority order on interrupts. While an interrupt is being serviced, that interrupt and all lower-priority interrupts are masked, but higher priority interrupts may preempt the ISR. This gives more control to a real-time application than the alternatives: masking all interrupts while ISRs execute, or masking only the interrupt that is being serviced. Patches have been submitted to LKML and CELF-RTWG mailing list that implement Interrupt priorities on 2.6.
An alternate approach to dealing with interrupt priorities is provided by the ADEOS project (http://www.opersys.com/adeos). ADEOS supports the creation of an interrupt pipeline through the use of ADEOS domains. Interrupts for a lower-priority domain are not serviced unless all higher priority domains relinquish control of the CPU. ADEOS can be used to implement interrupt handlers that reside in a domain that is higher priority than Linux domain. This can be used, for e.g., to achieve very low interrupt latencies for specific devices. Also, ADEOS can be used to implement higher level facilities such as an RTAI kernel that can be used to provide "hybrid" real-time solutions.
Prioritized wait queues
A real-time system with FIFO wait queues cannot be protected from priority inversion, but it may not make sense to simply require that all wait queues become priority queues. Priority queues are expensive. The implementation choices are O(n) with simple implementation, O(log n) for a balanced tree or heap, or O(1) time for an implementation like the standard O(1) scheduler.
One possibility is to require that all queues in the kernel (and possibly in kernel mode), be ordered on priority of the queue entries (by default, on the priority of the task that enqueued the entry). This is aggressive, but it is the correct approach from a strict real-time viewpoint. Requiring priority queues only for blocking POSIX functions would probably require fewer queues to become priority queues, and it might be a simple way to specify "the important queues."