Tuesday, August 14, 2012

What is a process and thread?

Hi all,

After a long gap I am writing my professional blog, I have read and interesting article on threads and process which I thought will be worth sharing with you.  I really like the example given in the last which explains what is a process n thread in layman term.


Processes and threads


When we are managing a system and particular user activity, our focus is commonly a process. It is at the process level that we normally monitor user activity. However, the operating system doesn't schedule processes to run anymore. The operating system schedules threads to run. The difference for some people is too subtle to care about; for others the difference is earth-shatteringly different. From a programmer's perspective, the idea of a threaded application is mind-bogglingly different from our traditional view of how applications run. Traditionally, an application will have a startup process that creates a series of individual processes to manage the various tasks that the application will undertake. Individual processes have some knowledge of the overall resources of the application only if the startup process opened all necessary files and allocated all the necessary shared memory segments and any other shared resources. Individual processes communicate between each other using some form of inter-process communication (IPC) mechanism, e.g., shared-memory, semaphores, or message-queues. Each process has its own address space, its own scheduling priority, its own little universe. The problem with this idea is that creating a process in its own little universe is an expensive series of routines for the operating system to undertake. There's a whole new address space to create and manage; there's a whole new set of memory-related structures to create and manage. We then need to locate and execute the code that constitutes this unique, individual program. All the information relating to the shared objects set up by the startup process needs to be copied to each individual process when it was created. As you can see, there's lots of work to do in order to create new processes. Then we consider why an application creates multiple processes in the first place.



An application is made up of multiple processes because an application has several individual tasks to accomplish in order to get the job done; there's reading and writing to the database, synchronizing checkpoint files, updating the GUI, and a myriad of stuff to do in order to get the job done. At this point, we ask ourselves a question. Do all these component tasks interface with similar objects, e.g., open files, chunks of data stored in memory, and so on? The answer commonly is YES! Wouldn't it be helpful if we could create some form of pseudo-process whereby the operating system doesn't have as much work to do in order to carve out an entire new process? And whereby the operating system could create a distinct entity that performed an individual task but was in some way linked to all the other related tasks? An entity that shared the same address space as all other subtasks, allowing it access to all the same data structures as the main get-the-job-done task. An entity that could be scheduled independently of other tasks (we can refresh the GUI while a write to the database is happening), as long as it didn't need any other resources. An entity that allows an application to have parallel, independent tasks running concurrently. If we could create such an entity, surely the overall throughput of the application would be improved? The answer is that with careful programming such an entity does exist, and that entity is a thread. Multithreaded applications are more natural than distinct, separate processes that are individual, standalone, and need to use expensive means of communication in order to synchronize their activities. The application can be considered the overall task of getting the job done, with individual threads being considered as individual subtasks. Multithreaded applications gain concurrency among independent threads by subdividing the overall task into smaller manageable jobs that can be performed independently of each other. A single-threaded application must do one task and then wait for some external event to occur before proceeding with the same or the next task in a sequence. Multithreaded applications offer parallelism if we can segregate individual tasks to work on separate parts of the problem, and all while sharing the same underlying address space created by the initial process. Sharing an address space gives access to all the data structures created by the initial process without having to copy all the structural information as we have to between individual processes. If we are utilizing a 64-bit address space, it is highly unlikely that an individual thread will run out of space to create its own independent data structures, should it need them. It sounds like it was remarkable that we survived without threads. I wouldn't go so far as to say it's remarkable that we survived, but it can be remarkable the improvements in overall throughput when a single-threaded application is transformed into a multi-threaded application. This in itself is a non-trivial task. Large portions of the application will need to be rewritten and possibly redesigned in order to transform the program logic from a single thread of execution into distinct and separate branches of execution. Do we have distinct and separate tasks within the application that can be running concurrently with other independent tasks? Do these tasks ever update the same data items? A consequence of multiple threads sharing the same address space is that it makes synchronizing activities between individual threads a crucial activity. There is a possibility that individual threads are working on the same block of process private data making changes independent of each other. This is not possible where individual processes have their own independent private data segments. Multithreaded applications need to exhibit a property known as thread safe. This idea is where functions within an application can be run concurrently and any updates to shared data objects are synchronized. One common technique that threads use to synchronize their activities is a form of simple locking. When one thread is going to update a data item, it needs exclusive access to that data item. The locking strategy is known as locking a mutex. Mutex stands for MUTual EXclusion. A mutex is a simple binary lock. Being binary, the lock is either open or closed. If it is open, this means the data item can be locked and then updated by the thread. If another thread comes along to update the data item, it will find the mutex closed (locked). The thread will need to wait until the mutex is unlocked (open), whereby it knows it now has exclusive access to the data item. As you can see, even this simple explanation is getting quite involved. Rewriting a single-threaded application to be multi-threaded needs lots of experience and detailed knowledge of the pitfalls of multithreaded programming. If you are interested in taking this further, I strongly suggest that you get your hands on the excellent book Threadtime: The Multithreaded Programming Guide by Scott J. Norton and Mark D. Dipasquale.



One useful thing about having a multithreaded kernel is that you don't need to use this feature if you don't want to. You can simply take your existing single-threaded applications and run them directly on a multi-threaded kernel. Each process will simply consist of a single thread. It might not be making the best use of the parallel features of the underlying architecture, but at least you don't need to hire a team of mutex-wielding programmers.



The application may consist of a single process, which is the visible face of the application. As administrators, we can still manage the visible application. Internally, the single process will create a new thread for each individual task that it needs to perform. Because of the thread model used each user-level thread corresponds to a kernel thread; because the kernel can see these individual threads, the kernel can schedule these individual tasks independently of each other (a thread visible to the kernel is known as a bound thread). This offers internal concurrency in the application with individual tasks doing their own thing as quickly as they can, being scheduled by the kernel as often as they want to run. Tasks that are interrelated need to synchronize themselves using some form of primitive inter-task locking strategy such as mutexes mentioned above. This is the job of application programmers, not administrators. The application programmer needs to understand the importance of the use of signals; we send signals to processes. Does that signal get sent to all threads? The answer is "it depends." A common solution used by application programmers is to create a signal-handling thread. This thread receives the signal while all other threads mask signals. The signal-handling thread can then coordinate sending signals to individual threads (using system calls such as pthread_kill). This is all internal to the process and of little direct concern to us. As far as administering this application, we manage the process; we can send a process signals, we can increase its priority, we can STOP it we can kill it. We are managing the whole set of tasks through the process, while internally each individual thread of execution is being scheduled and managed by the kernel.



A process is a "container" for a whole set of instructions that carry out the overall task of the program. A thread is an independently scheduled subtask within the program. It is an independent flow of control within the process with its own register context, program counter, and thread-local data but sharing the host process's address space, making access to related data structures simpler.



An analogy I often use is a beehive. From the outside, it is a single entity whose purpose is to produce honey. The beehive is the application, and, hence, the beehive can be thought of as the process; it has a job to do. Each individual bee has a unique and distinct role that needs to be performed. Individual bees are individual threads within the process/beehive. Some bees coordinate their activities with miraculous precision but completely independently to the external world. The end product is produced at amazing efficiency, more effective than if we subdivided the task of producing honey between independent hives. Imagine the situation: Every now and then, the individual hives would meet up to exchange information and make sure the project was still on-track, and then they would go back to doing their own little part of the job of making honey. Honey-by-committee wouldn't work. The beehive is the process, and the bees are the threads: amazing internal efficiencies when programmed correctly, but retaining important external simplicity. We as information-gatherers (honey-monsters) will interface with the application/process (beehive) in order to extract information (honey) from the system. There's no point in going to individual bees and trying to extract honey from them; it's the end product that we are interested in, not how we got there.



No comments:

Post a Comment