The kernel is the "core" of any computer system: it is the "software" which allows users to share computer resources.
The kernel can be thought as the main software of the OS (Operating System), which may also include graphics management.
For example, under Linux (like other Unix-like OSs), the XWindow environment doesn't belong to the Linux Kernel, because it manages only graphical operations (it uses user mode I/O to access video card devices).
By contrast, Windows environments (Win9x, WinME, WinNT, Win2K, WinXP, and so on) are a mix between a graphical environment and kernel.
Many years ago, when computers were as big as a room, users ran their applications with much difficulty and, sometimes, their applications crashed the computer.
To avoid having applications that constantly crashed, newer OSs were designed with 2 different operative modes:
| Applications /|\ | ______________ | | | User Mode | | | ______________ | | | | Implementation | _______ _______ | Abstraction Detail | | Kernel Mode | | | _______________ | | | | | | | | | | \|/ Hardware |
Kernel Mode "prevents" User Mode applications from damaging the system or its features.
Modern microprocessors implement in hardware at least 2 different states. For example under Intel, 4 states determine the PL (Privilege Level). It is possible to use 0,1,2,3 states, with 0 used in Kernel Mode.
Unix OS requires only 2 privilege levels, and we will use such a paradigm as point of reference.
Once we understand that there are 2 different modes, we have to know when we switch from one to the other.
Typically, there are 2 points of switching:
System calls are like special functions that manage OS routines which live in Kernel Mode.
A system call can be called when we:
| | ------->| System Call i | (Accessing Devices) | | | | [sys_read()] | | ... | | | | | system_call(i) |-------- | | | [read()] | | | | ... | | | | system_call(j) |-------- | | | [get_pid()] | | | | | ... | ------->| System Call j | (Accessing kernel data structures) | | | [sys_getpid()]| | | USER MODE KERNEL MODE Unix System Calls Working
System calls are almost the only interface used by User Mode to talk with low level resources (hardware). The only exception to this statement is when a process uses ''ioperm'' system call. In this case a device can be accessed directly by User Mode process (IRQs cannot be used).
NOTE: Not every ''C'' function is a system call, only some of them.
Below is a list of System Calls under Linux Kernel 2.4.17, from [ arch/i386/kernel/entry.S ]
.long SYMBOL_NAME(sys_ni_syscall) /* 0 - old "setup()" system call*/ .long SYMBOL_NAME(sys_exit) .long SYMBOL_NAME(sys_fork) .long SYMBOL_NAME(sys_read) .long SYMBOL_NAME(sys_write) .long SYMBOL_NAME(sys_open) /* 5 */ .long SYMBOL_NAME(sys_close) .long SYMBOL_NAME(sys_waitpid) .long SYMBOL_NAME(sys_creat) .long SYMBOL_NAME(sys_link) .long SYMBOL_NAME(sys_unlink) /* 10 */ .long SYMBOL_NAME(sys_execve) .long SYMBOL_NAME(sys_chdir) .long SYMBOL_NAME(sys_time) .long SYMBOL_NAME(sys_mknod) .long SYMBOL_NAME(sys_chmod) /* 15 */ .long SYMBOL_NAME(sys_lchown16) .long SYMBOL_NAME(sys_ni_syscall) /* old break syscall holder */ .long SYMBOL_NAME(sys_stat) .long SYMBOL_NAME(sys_lseek) .long SYMBOL_NAME(sys_getpid) /* 20 */ .long SYMBOL_NAME(sys_mount) .long SYMBOL_NAME(sys_oldumount) .long SYMBOL_NAME(sys_setuid16) .long SYMBOL_NAME(sys_getuid16) .long SYMBOL_NAME(sys_stime) /* 25 */ .long SYMBOL_NAME(sys_ptrace) .long SYMBOL_NAME(sys_alarm) .long SYMBOL_NAME(sys_fstat) .long SYMBOL_NAME(sys_pause) .long SYMBOL_NAME(sys_utime) /* 30 */ .long SYMBOL_NAME(sys_ni_syscall) /* old stty syscall holder */ .long SYMBOL_NAME(sys_ni_syscall) /* old gtty syscall holder */ .long SYMBOL_NAME(sys_access) .long SYMBOL_NAME(sys_nice) .long SYMBOL_NAME(sys_ni_syscall) /* 35 */ /* old ftime syscall holder */ .long SYMBOL_NAME(sys_sync) .long SYMBOL_NAME(sys_kill) .long SYMBOL_NAME(sys_rename) .long SYMBOL_NAME(sys_mkdir) .long SYMBOL_NAME(sys_rmdir) /* 40 */ .long SYMBOL_NAME(sys_dup) .long SYMBOL_NAME(sys_pipe) .long SYMBOL_NAME(sys_times) .long SYMBOL_NAME(sys_ni_syscall) /* old prof syscall holder */ .long SYMBOL_NAME(sys_brk) /* 45 */ .long SYMBOL_NAME(sys_setgid16) .long SYMBOL_NAME(sys_getgid16) .long SYMBOL_NAME(sys_signal) .long SYMBOL_NAME(sys_geteuid16) .long SYMBOL_NAME(sys_getegid16) /* 50 */ .long SYMBOL_NAME(sys_acct) .long SYMBOL_NAME(sys_umount) /* recycled never used phys() */ .long SYMBOL_NAME(sys_ni_syscall) /* old lock syscall holder */ .long SYMBOL_NAME(sys_ioctl) .long SYMBOL_NAME(sys_fcntl) /* 55 */ .long SYMBOL_NAME(sys_ni_syscall) /* old mpx syscall holder */ .long SYMBOL_NAME(sys_setpgid) .long SYMBOL_NAME(sys_ni_syscall) /* old ulimit syscall holder */ .long SYMBOL_NAME(sys_olduname) .long SYMBOL_NAME(sys_umask) /* 60 */ .long SYMBOL_NAME(sys_chroot) .long SYMBOL_NAME(sys_ustat) .long SYMBOL_NAME(sys_dup2) .long SYMBOL_NAME(sys_getppid) .long SYMBOL_NAME(sys_getpgrp) /* 65 */ .long SYMBOL_NAME(sys_setsid) .long SYMBOL_NAME(sys_sigaction) .long SYMBOL_NAME(sys_sgetmask) .long SYMBOL_NAME(sys_ssetmask) .long SYMBOL_NAME(sys_setreuid16) /* 70 */ .long SYMBOL_NAME(sys_setregid16) .long SYMBOL_NAME(sys_sigsuspend) .long SYMBOL_NAME(sys_sigpending) .long SYMBOL_NAME(sys_sethostname) .long SYMBOL_NAME(sys_setrlimit) /* 75 */ .long SYMBOL_NAME(sys_old_getrlimit) .long SYMBOL_NAME(sys_getrusage) .long SYMBOL_NAME(sys_gettimeofday) .long SYMBOL_NAME(sys_settimeofday) .long SYMBOL_NAME(sys_getgroups16) /* 80 */ .long SYMBOL_NAME(sys_setgroups16) .long SYMBOL_NAME(old_select) .long SYMBOL_NAME(sys_symlink) .long SYMBOL_NAME(sys_lstat) .long SYMBOL_NAME(sys_readlink) /* 85 */ .long SYMBOL_NAME(sys_uselib) .long SYMBOL_NAME(sys_swapon) .long SYMBOL_NAME(sys_reboot) .long SYMBOL_NAME(old_readdir) .long SYMBOL_NAME(old_mmap) /* 90 */ .long SYMBOL_NAME(sys_munmap) .long SYMBOL_NAME(sys_truncate) .long SYMBOL_NAME(sys_ftruncate) .long SYMBOL_NAME(sys_fchmod) .long SYMBOL_NAME(sys_fchown16) /* 95 */ .long SYMBOL_NAME(sys_getpriority) .long SYMBOL_NAME(sys_setpriority) .long SYMBOL_NAME(sys_ni_syscall) /* old profil syscall holder */ .long SYMBOL_NAME(sys_statfs) .long SYMBOL_NAME(sys_fstatfs) /* 100 */ .long SYMBOL_NAME(sys_ioperm) .long SYMBOL_NAME(sys_socketcall) .long SYMBOL_NAME(sys_syslog) .long SYMBOL_NAME(sys_setitimer) .long SYMBOL_NAME(sys_getitimer) /* 105 */ .long SYMBOL_NAME(sys_newstat) .long SYMBOL_NAME(sys_newlstat) .long SYMBOL_NAME(sys_newfstat) .long SYMBOL_NAME(sys_uname) .long SYMBOL_NAME(sys_iopl) /* 110 */ .long SYMBOL_NAME(sys_vhangup) .long SYMBOL_NAME(sys_ni_syscall) /* old "idle" system call */ .long SYMBOL_NAME(sys_vm86old) .long SYMBOL_NAME(sys_wait4) .long SYMBOL_NAME(sys_swapoff) /* 115 */ .long SYMBOL_NAME(sys_sysinfo) .long SYMBOL_NAME(sys_ipc) .long SYMBOL_NAME(sys_fsync) .long SYMBOL_NAME(sys_sigreturn) .long SYMBOL_NAME(sys_clone) /* 120 */ .long SYMBOL_NAME(sys_setdomainname) .long SYMBOL_NAME(sys_newuname) .long SYMBOL_NAME(sys_modify_ldt) .long SYMBOL_NAME(sys_adjtimex) .long SYMBOL_NAME(sys_mprotect) /* 125 */ .long SYMBOL_NAME(sys_sigprocmask) .long SYMBOL_NAME(sys_create_module) .long SYMBOL_NAME(sys_init_module) .long SYMBOL_NAME(sys_delete_module) .long SYMBOL_NAME(sys_get_kernel_syms) /* 130 */ .long SYMBOL_NAME(sys_quotactl) .long SYMBOL_NAME(sys_getpgid) .long SYMBOL_NAME(sys_fchdir) .long SYMBOL_NAME(sys_bdflush) .long SYMBOL_NAME(sys_sysfs) /* 135 */ .long SYMBOL_NAME(sys_personality) .long SYMBOL_NAME(sys_ni_syscall) /* for afs_syscall */ .long SYMBOL_NAME(sys_setfsuid16) .long SYMBOL_NAME(sys_setfsgid16) .long SYMBOL_NAME(sys_llseek) /* 140 */ .long SYMBOL_NAME(sys_getdents) .long SYMBOL_NAME(sys_select) .long SYMBOL_NAME(sys_flock) .long SYMBOL_NAME(sys_msync) .long SYMBOL_NAME(sys_readv) /* 145 */ .long SYMBOL_NAME(sys_writev) .long SYMBOL_NAME(sys_getsid) .long SYMBOL_NAME(sys_fdatasync) .long SYMBOL_NAME(sys_sysctl) .long SYMBOL_NAME(sys_mlock) /* 150 */ .long SYMBOL_NAME(sys_munlock) .long SYMBOL_NAME(sys_mlockall) .long SYMBOL_NAME(sys_munlockall) .long SYMBOL_NAME(sys_sched_setparam) .long SYMBOL_NAME(sys_sched_getparam) /* 155 */ .long SYMBOL_NAME(sys_sched_setscheduler) .long SYMBOL_NAME(sys_sched_getscheduler) .long SYMBOL_NAME(sys_sched_yield) .long SYMBOL_NAME(sys_sched_get_priority_max) .long SYMBOL_NAME(sys_sched_get_priority_min) /* 160 */ .long SYMBOL_NAME(sys_sched_rr_get_interval) .long SYMBOL_NAME(sys_nanosleep) .long SYMBOL_NAME(sys_mremap) .long SYMBOL_NAME(sys_setresuid16) .long SYMBOL_NAME(sys_getresuid16) /* 165 */ .long SYMBOL_NAME(sys_vm86) .long SYMBOL_NAME(sys_query_module) .long SYMBOL_NAME(sys_poll) .long SYMBOL_NAME(sys_nfsservctl) .long SYMBOL_NAME(sys_setresgid16) /* 170 */ .long SYMBOL_NAME(sys_getresgid16) .long SYMBOL_NAME(sys_prctl) .long SYMBOL_NAME(sys_rt_sigreturn) .long SYMBOL_NAME(sys_rt_sigaction) .long SYMBOL_NAME(sys_rt_sigprocmask) /* 175 */ .long SYMBOL_NAME(sys_rt_sigpending) .long SYMBOL_NAME(sys_rt_sigtimedwait) .long SYMBOL_NAME(sys_rt_sigqueueinfo) .long SYMBOL_NAME(sys_rt_sigsuspend) .long SYMBOL_NAME(sys_pread) /* 180 */ .long SYMBOL_NAME(sys_pwrite) .long SYMBOL_NAME(sys_chown16) .long SYMBOL_NAME(sys_getcwd) .long SYMBOL_NAME(sys_capget) .long SYMBOL_NAME(sys_capset) /* 185 */ .long SYMBOL_NAME(sys_sigaltstack) .long SYMBOL_NAME(sys_sendfile) .long SYMBOL_NAME(sys_ni_syscall) /* streams1 */ .long SYMBOL_NAME(sys_ni_syscall) /* streams2 */ .long SYMBOL_NAME(sys_vfork) /* 190 */ .long SYMBOL_NAME(sys_getrlimit) .long SYMBOL_NAME(sys_mmap2) .long SYMBOL_NAME(sys_truncate64) .long SYMBOL_NAME(sys_ftruncate64) .long SYMBOL_NAME(sys_stat64) /* 195 */ .long SYMBOL_NAME(sys_lstat64) .long SYMBOL_NAME(sys_fstat64) .long SYMBOL_NAME(sys_lchown) .long SYMBOL_NAME(sys_getuid) .long SYMBOL_NAME(sys_getgid) /* 200 */ .long SYMBOL_NAME(sys_geteuid) .long SYMBOL_NAME(sys_getegid) .long SYMBOL_NAME(sys_setreuid) .long SYMBOL_NAME(sys_setregid) .long SYMBOL_NAME(sys_getgroups) /* 205 */ .long SYMBOL_NAME(sys_setgroups) .long SYMBOL_NAME(sys_fchown) .long SYMBOL_NAME(sys_setresuid) .long SYMBOL_NAME(sys_getresuid) .long SYMBOL_NAME(sys_setresgid) /* 210 */ .long SYMBOL_NAME(sys_getresgid) .long SYMBOL_NAME(sys_chown) .long SYMBOL_NAME(sys_setuid) .long SYMBOL_NAME(sys_setgid) .long SYMBOL_NAME(sys_setfsuid) /* 215 */ .long SYMBOL_NAME(sys_setfsgid) .long SYMBOL_NAME(sys_pivot_root) .long SYMBOL_NAME(sys_mincore) .long SYMBOL_NAME(sys_madvise) .long SYMBOL_NAME(sys_getdents64) /* 220 */ .long SYMBOL_NAME(sys_fcntl64) .long SYMBOL_NAME(sys_ni_syscall) /* reserved for TUX */ .long SYMBOL_NAME(sys_ni_syscall) /* Reserved for Security */ .long SYMBOL_NAME(sys_gettid) .long SYMBOL_NAME(sys_readahead) /* 225 */
When an IRQ comes, the task that is running is interrupted in order to service the IRQ Handler.
After the IRQ is handled, control returns backs exactly to point of interrupt, like nothing happened.
Running Task |-----------| (3) NORMAL | | | [break execution] IRQ Handler EXECUTION (1)| | | ------------->|---------| | \|/ | | | does | IRQ (2)---->| .. |-----> | some | | | |<----- | work | BACK TO | | | | | ..(4). | NORMAL (6)| \|/ | <-------------|_________| EXECUTION |___________| [return to code] (5) USER MODE KERNEL MODE User->Kernel Mode Transition caused by IRQ event
The numbered steps below refer to the sequence of events in the diagram above:
Special interest has the Timer IRQ, coming every TIMER ms to manage:
The key point of modern OSs is the "Task". The Task is an application running in memory sharing all resources (included CPU and Memory) with other Tasks.
This "resource sharing" is managed by the "Multitasking Mechanism". The Multitasking Mechanism switches from one task to another after a "timeslice" time. Users have the "illusion" that they own all resources. We can also imagine a single user scenario, where a user can have the "illusion" of running many tasks at the same time.
To implement this multitasking, the task uses "the state" variable, which can be:
The task state is managed by its presence in a relative list: READY list and BLOCKED list.
The movement from one task to another is called ''Task Switching''. many computers have a hardware instruction which automatically performs this operation. Task Switching occurs in the following cases:
* We schedule another task to prevent "Busy Form Waiting", which occurs when we are waiting for a device instead performing other work.
Task Switching is managed by the "Schedule" entity.
Timer | | IRQ | | Schedule | | | ________________________ |----->| Task 1 |<------------------>|(1)Chooses a Ready Task | | | | |(2)Task Switching | | |___________| |________________________| | | | /|\ | | | | | | | | | | | | | | | | |----->| Task 2 |<-------------------------------| | | | | | |___________| | . . . . . . . . . . . . . . . | | | | | | | | ------>| Task N |<-------------------------------- | | |___________| Task Switching based on TimeSlice
A typical Timeslice for Linux is about 10 ms.
| | | | Resource _____________________________ | Task 1 |----------->|(1) Enqueue Resource request | | | Access |(2) Mark Task as blocked | | | |(3) Choose a Ready Task | |___________| |(4) Task Switching | |_____________________________| | | | | | | | | | Task 2 |<------------------------- | | | | |___________| Task Switching based on Waiting for a Resource
Until now we viewed so called Monolithic OS, but there is also another kind of OS: ''Microkernel''.
A Microkernel OS uses Tasks, not only for user mode processes, but also as a real kernel manager, like Floppy-Task, HDD-Task, Net-Task and so on. Some examples are Amoeba, and Mach.
PROS:
CONS:
My personal opinion is that, Microkernels are a good didactic example (like Minix) but they are not ''optimal'', so not really suitable. Linux uses a few Tasks, called "Kernel Threads" to implement a little microkernel structure (like kswapd, which is used to retrieve memory pages from mass storage). In this case there are no problems with perfomance because swapping is a very slow job.
Standard ISO-OSI describes a network architecture with the following levels:
The first 2 levels listed above are often implemented in hardware. Next levels are in software (or firmware for routers).
Many protocols are used by an OS: one of these is TCP/IP (the most important living on 3-4 levels).
The kernel doesn't know anything (only addresses) about first 2 levels of ISO-OSI.
In RX it:
frames packets sockets NIC ---------> Kernel ----------> Application | packets --------------> Forward - RX -
In TX stage it:
sockets packets frames Application ---------> Kernel ----------> NIC packets /|\ Forward ------------------- - TX -
Segmentation is the first method to solve memory allocation problems: it allows you to compile source code without caring where the application will be placed in memory. As a matter of fact, this feature helps applications developers to develop in a independent fashion from the OS e also from the hardware.
| Stack | | | | | \|/ | | Free | | /|\ | Segment <---> Process | | | | Heap | | Data uninitialized | | Data initialized | | Code | |____________________| Segment
We can say that a segment is the logical entity of an application, or the image of the application in memory.
When programming, we don't care where our data is put in memory, we only care about the offset inside our segment (our application).
We use to assign a Segment to each Process and vice versa. In Linux this is not true. Linux uses only 4 segments for either Kernel and all Processes.
____________________ ----->| |-----> | IN | Segment A | OUT ____________________ | |____________________| | |____| | | | Segment B | | Segment B | | |____ | | |____________________| | |____________________| | | Segment C | | |____________________| ----->| Segment D |-----> IN |____________________| OUT Segmentation problem
In the diagram above, we want to get exit processes A, and D and enter process B. As we can see there is enough space for B, but we cannot split it in 2 pieces, so we CANNOT load it (memory out).
The reason this problem occurs is because pure segments are continuous areas (because they are logical areas) and cannot be split.
____________________ | Page 1 | |____________________| | Page 2 | |____________________| | .. | Segment <---> Process |____________________| | Page n | |____________________| | | |____________________| | | |____________________| Segment
Pagination splits memory in "n" pieces, each one with a fixed length.
A process may be loaded in one or more Pages. When memory is freed, all pages are freed (see Segmentation Problem, before).
Pagination is also used for another important purpose, "Swapping". If a page is not present in physical memory then it generates an EXCEPTION, that will make the Kernel search for a new page in storage memory. This mechanism allow OS to load more applications than the ones allowed by physical memory only.
____________________ Page X | Process Y | |____________________| | | | WASTE | | SPACE | |____________________| Pagination Problem
In the diagram above, we can see what is wrong with the pagination policy: when a Process Y loads into Page X, ALL memory space of the Page is allocated, so the remaining space at the end of Page is wasted.
How can we solve segmentation and pagination problems? Using either 2 policies.
| .. | |____________________| ----->| Page 1 | | |____________________| | | .. | ____________________ | |____________________| | | |---->| Page 2 | | Segment X | ----| |____________________| | | | | .. | |____________________| | |____________________| | | .. | | |____________________| |---->| Page 3 | |____________________| | .. |
Process X, identified by Segment X, is split in 3 pieces and each of one is loaded in a page.
We do not have:
| | | | | | Offset2 | Value | | | /|\| | Offset1 | |----- | | | /|\ | | | | | | | | | | \|/| | | | | ------>| | \|/ | | | | Base Paging Address ---->| | | | | ....... | | ....... | | | | | Hierarchical Paging