Automating things using the Linux command line

This page is about doing arbitrary computations efficiently with Linux. The focus is on batch processing lots of jobs in parallel.

Permanent Link: topicbacklinks

If you feel potential for automation but you're unable to automate it yourself, feel free to ask IT. Some researchers already did work for 3 days which could be done in a minute. You might want to have a look at LinuxCrashcourse first.

Find information about installed packages and their available commands in the institute here, including highly specialized Neuro-analysis tools.

The page is divided into sections. Feel free to skip anything you already know.

What is displayed to you in a terminal is written as this, and what you type into a terminal is written as that, for example user@host > echo output.

Expand all Collapse all

Terminology


Terminology Program Type What it does
XFCE Session Manager It provides a graphical session including a desktop and a panel with a start menu and open windows. You'll need this only to start a Terminal Emulator for this tutorial.
Terminal
(More precise: XFCE-Terminal Emulator)
Terminal Emulator It runs non-graphical applications like shells. The Terminal Emulator is responsible for sending your key presses to those applications and displaying their output in a useful way. Things that the Terminal emulator is responsible for and which the running application has no idea of:
  • The text font
  • The actual colors being used
  • Copy and paste (the application cannot distinguish between pasted text and text being typed)
  • Remembering the output and providing a scrollbar for looking into output from the past.
The non-graphical application being run directly by the Terminal is usually a "shell".
Bash Shell The shell interprets commands typed by a user and runs the actual programs that are doing e.g. calculations. The commands can be put into shell scripts for easier modification and repeated invocations. Bash is the default shell within the institute's network. You'll most likely never have contact with another one. Shells are programming languages optimized for interactive use and for running the actual programs that you want.
Shell script Programs made of commands a shell can interpret. This includes simple commands, shell specific commands and control structures (conditions,loops,...). Shell scripts are simple text files which can be created and changed easily.
Prompt A prompt is a (usually) small piece of text which is displayed by the Shell to tell you that it's OK to enter commands now. It usually looks like username@hostname> . When there's no prompt, the shell is currently running a program and can't accept commands at that time.

To use a shell: Open the start menu and select Terminal Emulator. The shell being automatically started is Bash. From now on, the name Bash will be used instead of shell; since in the institute, it's the only shell being used.

Back to FAQ start


"Hello World" in Bash


The "Hello World" program in Bash looks like this:

user@host> echo Hello World
Hello World

Explanation:
  • echo is the the same as "print" or "write" in other languages. It takes arguments and shows them - without or with very little actual processing being done.
    • It's handy to test expressions or to show the content of variables (See section Variables).
    • echo is not a Bash command per se but an application being run by Bash. The application's path is /bin/echo. There are thousands of those commands available even on a basic Linux system.
  • The command is given two strings: "Hello" and "World". It's not a single one "Hello World", as one could expect.
    • Bash splits up command arguments on white spaces.
    • This means: To send a white space to a command read more about that in the Quoting section.
    • Other effects: You can use an arbitrary number of white spaces between "Hello" and "World" - the output will be the same.
  • There are only a few pure Bash commands but every program installed as a software packages becomes available as a command here. In a shell, most commands are just regular software binaries. For example the firefox command will start a web browser.
    • Examples:
      • Yes, echo is part of a software package. However, it is guaranteed to exist on every UNIX system - including your Internet router at home and your cell phone.
      • Matlab (Command: matlab )
      • FSL (Command example: fsl_reg )


Back to FAQ start


How to know which arguments to give to which command


Commands are little programs that usually include a detailed documentation. At some point you'll know which program to use but you don't know how to tell the program the specific mode of operation you need.

Example: We need a list of files ordered by their last modification time - one file per line.
  • A Google search for "list files linux" will tell us that ls is the command of choice here.
  • Most Linux commands do have a short help which is displayed by typing [command] --help (more likely) or [command] -help .
  • Most Linux commands go even further and provide a separate help page via man [command]
    • The man command opens an interactive text reader
    • Use cursor keys and pgup/pgdown to scroll
    • Use q to quit
    • Use /[text] (slash followed by some text) to search for text
  • You'll quickly find the necessary options: -1 , -t and depending on if you want ascending or descending order: -r
  • Solution: ls -1 -t -r

Hints:
  • Options given to commands usually look like --someoption (long option) or -o (short option) and are location-independent ( -a -o equals -o -a ). However, this is just a convention most programmers of Linux software abide by. Some commands may behave differently.
  • USE GOOGLE . It's amazing how much specific knowledge is accessible there.
    • Try to be as general as possible in your search query. This the way to get knowledge that doesn't help you instantly but will get you an idea how to approach the problem.
    • Example: You want to reverse 10000 MPEG1 video files. Do this via clicking in Final Cut Pro would be time consuming. A command line solution is better.
      • Bad google query: "How to reverse mpeg1 video files"
      • Good google query: "Linux command line reverse video"
      • You'll get some command lines as one of the first results explaining how to use ffmpeg and sox to do that.
      • Side note: ffmpeg (alias avconv) is an extremely powerful video processing software. sox is an extremely powerful audio processing tool. Both are installed on every workstation in the institute.
      • In this case: Make sure to understand, how video processing works. If in doubt ask your department's technician or an IT staff member.
      • In general: Make sure, you understand the solution. If in doubt ask your department's technician or an IT staff member.


Back to FAQ start


Two commands at once


Two commands can be run in two ways:
  • Putting them on separate lines (Command 1, Enter, Command 2, Enter)
  • Separating them by ;

Examples:

user@host > echo Hello
Hello
user@host > echo World
World

user@host > echo Hello;echo World
Hello
World

Explanation:
  • Both examples are semantically equivalent. However, the prompt display is different. In a shell script, there would be no prompt and the output would look exactly the same.


Back to FAQ start


Variables


The use of variables is valuable in the calculation of multiple test subjects. Variables are little containers, you can put something in. Later the variable is referenced and a computation is done with the current content of the variable. Example:

user@host> subject=GT4K

user@host> preprocess $subject

user@host> process $subject

user@host> postprocess $subject

Explanation:
  • There are no commands called preprocess/process/postprocess by default available in the institute. You'd have to write those commands yourself. That's actually easier than it sounds.
  • After the variable was assigned the Value GT4K, using the Syntax $subject later will make the shell replace this variable reference by the variable's content before the actual command is running.
  • The variable is only valid in one shell. No collisions will occur if you use the same variable name in different shells, i.e. in different Terminal Windows.
  • There's a catch: The subject ID can not contain white spaces or other characters other than alphanumeric ones in this example.


Back to FAQ start


Quoting


If you need to send white spaces or special characters (other than alphanumerical ones) to a command, you'll have to use quotes. There are various types of quotes - you will most likely only need two. Here are some examples:

user@host > process_subject "GT4K Session 2"

user@host > process_subject 'GT4K Session 100$'

user@host > subject='GT4K Session 100$'; process_subject "$subject"

Explanation:
  • Example one uses double-quotes. You'll need those most of the time. Double quotes prevent the Bash from interpreting a given string into multiple arguments. Without double quotes, process_subject would get 3 useless arguments instead of a single correct one.
  • Example two uses single quotes which cover most of the special characters, e.g. $ signs.
  • Example three shows a problem with variables, you might run into. Since a variable may contain special characters, you have to be careful when using it. This example shows the safe way - using double quotes. This will work even with special characters in the variable which a double quote could not handle directly.

Here's one more thing, you might have to struggle with:

user@host > myvariable=test

user@host > echo $myvariable_sometext

Explanation:
  • In this example, Bash tries to use a variable called "$myvariable_sometext" and not "$myvariable" plus some literal text "_sometext". This is because "_" can be used as part of a variable name. The command will just print an empty line.
  • user@host > echo ${myvariable}_sometext will handle this correctly.


Back to FAQ start


Using filenames


Linux allows you to use all characters as file names with two exceptions: / (which is the path separator) and ASCII 0x00 (which you'll most likely never see anyway). In a lot of cases, commands have to be given filenames as arguments which means: Quoting has to be used. The example from the Quoting section applies:

user@host > process_subject "GT4K Session 2"

Explanation:
  • The process_subject command will see a single argument: "GT4K Session 2" including the white spaces. It can in turn successfully open the file with this name.


Back to FAQ start


Globber Expressions


Unix shells are the most powerful tools for selecting groups of files. The idea in most cases is to quickly select a group of files (all files with a .txt extension, all files starting with the letter a ,...) and Simple example:

user@host > ls -l abc*

Explanation:
  • abc* is a called a globber expression. Expressions like these are replaced by filenames matching it.
  • abc* means "All files and directories in the current directory whose names begin with abc followed by an arbitrary number of characters".
  • The ls command will show all files and directories within the current directory starting with abc as separate arguments.
    • The option -l formats the output as a list.
    • The ls command will never see the globber expression. It does not know that...
      • ... there is such a thing as a globber expression
      • ... how to interpret them
      • ... that the filenames were presented using an expression and not via manual typing.
  • Globber expressions can be more complex, but that is beyond the scope of this manual.
  • Certain special characters like * , ? , [ and ] make a string a globber expression. If you want to use them as literal argument for commands, quoting must be used.
  • These expressions are a good example of the sharing of labor in Linux. A command doesn't need to search files for itself or interpret * - this is done by the shell. There are thousands of commands who rely on filenames being given as arguments. None of those need to know about how to look for files. This saves precious development time for you too - if you decide to write scripts working with files.
    • This sharing of labor is a key difference between a DOS/Windows command line and a Linux shell.

Now something cooler:

user@host > rm {01..52}_\[2014\]*/[aA]*/somefile.txt

In this case, the expression will find
  • objects (since we're using rm without -r, we're most likely looking for files - not directories) called somefile.txt ...
  • ... which are located in a directory starting with the letter a (minor or capital) ...
  • ... which is in turn a subdirectory of a directory with a name starting a two digit number less or equal 52, continuing with the literal string _[2014] and optionally ending with an arbitrary string of characters ...
  • ... which resides in the current directory

For beginners it's sometimes difficult to understand expressions and even more difficult to see them as useful applications. This particular expression was used by myself to remove temporary files a directory tree containing weekly performance statistics of computers. It took ~10 seconds to make it up because the directory structure was designed with such expressions in mind.

Hint: IT uses the Linux command line all the time. If you need a complex expression or an expression doesn't behave as expected, don't hesitate to ask.

Back to FAQ start


Loops


Automation usually means doing something simple repeatedly. A loop is the tool of choice for that. To use them, we'll need all the things we've learned in the previous sections. We usually want to process multiple data objects using the same action. It sequentially assigns different values to a given variable and runs programs that usually take this variable as an argument. Example:

user@host > for subject in {001..003};do echo $subject;done
001
002
003

Explanation:
  • {001..003} is a word generator. It acts as if you typed 001 002 003 in its place. OK, not much typing is saved in this example. Now imagine replacing 003 by 999.
  • The word do has to be there.
  • The word done marks the end of the loop. Only commands between do and done are running within the loop.
  • We didn't use double-quotes around $subject. This is usually a bad idea. It works in this case since the word generator only makes numeric strings.

Usually, there is already a set of data. Sometimes, the files are not just numbers or there are numbers missing. In this case, a word generator doesn't help - actual filenames are needed. Example:

user@host > for subject in *;do process_subject "$subject";done

Explanation:
  • The process_subject command is called once for each file and for each directory in the directory, you're currently in.
  • You should try to keep you subject's base directory free of other files because they would be caught by the * expression otherwise.


Back to FAQ start


Writing a command of your own


Usually processing data involves multiple steps (commands) per dataset (e.g. test subject). There are multiple reasons to place these commands in a script:
  • You can keep all the necessary processing commands together in one place.
  • You can write comments explaining your thoughts when writing the script. Lines starting with # are comments.
  • RevisionControl can be used to track changes in the command set being used for your data over time.
  • A set of scripts can be kept per study/project.

Example script:

#!/bin/bash
#
# This is the process_subject script
#
# It will do the following:
# ...

subject="$1"

echo "Test subject $subject is being processed"
# Some preprocessing
process_step1 "$subject"
# The most important part:
process_step2 "$subject"

Hints:
  • Lines starting with # are comments.
    • You should write a little comment header on top of your script to describe the purpose of it.
    • It's good practice to document your intentions when running a command in a comment line above the command. Try to describe why not what (globber expressions are an exception).
    • The first line of a script is not a comment but a hint for the operating system, which programming language the script is written in (shebang). The line must always be #!/bin/bash in case of a Bash script.
  • There's a script directory in every AFS study storage block by default. Save the script in this directory to keep it close to your data. The path will the look like this:
    /afs/cbs.mpg.de/projects/afs000_afstest/scripts/process_subject .
  • $1 is the first parameter given to the command. Calling process_subject GT4K would cause $1 to be GT4K which would cause the variable $subject to be GT4K which would send this ID to the process_step1 and process_step2 commands.

Commands are being looked for in certain locations on the computer. You study's script directory is not among them because you might have multiple of them and this would make the Bash very slow. For now you'll have to write the full command name to call it:

user@host > /afs/cbs.mpg.de/projects/afs000_afstest/scripts/process_subject GT4K

There are two ways to make the command's name shorter. The advantage of one way is the disadvantage of the other:
  • See Expanding the search path
  • and Using a central command location.


Back to FAQ start


Expanding the search path


The environment a will change into a specific study storage block and include the scripts folder into the command search path. Example:

user@host> a afs000
I: AFS study environment (Project id 'afs000')
I: Type 'envlist' to show all environment components incl. versions.
I: Type 'exit' to leave the environment
afsproject=afs000_user@host > process_subject GT4K

Advantage:
  • The change only affects a single Bash instance. You can work with multiple studies and multiple sets of scripts in multiple windows.


Back to FAQ start


Using a central command location


There is one location solely under your control which you can use to shorten the command: the ~/bin directory in your home folder. You can either store a script directly there or use a technique called symlinking (basically making sure, a given file exists at multiple places at the same time):

user@host > mkdir ~/bin;cd ~/bin

user@host > ln -s /afs/cbs.mpg.de/projects/afs000_afstest/scripts/process_subject process_subject

From now on, you can type

user@host > process_subject GT4K

Advantage:
  • No preparation is necessary before working with a single study. The shortened command is available as soon as you log into your Linux session.


Back to FAQ start


Parallel Loops


Today's processors have multiple independent processing units (CPU cores) that can do things in parallel. Most of the institute's workstations have at least 4 of them. Bash can take advantage of this. However, to e.g. run 1000 jobs but only 4 of them at any given time, it's easier to use a helper program called fparallel. The original name of this shell tool is parallel. fparallel is a for the institute adapted version of this tool.

user@host > fparallel -c 'process_subject {arg}' -P 4 abc* --args

Explanation:
  • This tool will call a command called process_subject once for every file/directory starting with the string abc in the current directory.
  • This will even work, if there are hundreds of thousands of files starting with abc in that directory.
  • 4 process_subject instances will run in fparallel. Whenever a job finishes, another argument (in this case another file starting with abc ) is taken from the job queue and being processed.
  • Running more instances than your number of CPU cores in fparallel might knock out your computer.
    • Keep in mind that you might need processing power for your graphical session. Reserving one CPU core for that might be a good idea.

If you need more resources have a look here: ComputeClusterSlurm.

Back to FAQ start


Pipelines and text processing


Unix operating systems (like the one on your workstation) make it very easy to process text based data. A lot of commands in a basic Linux setup can do just that - process text. Typically, you will combine several commands to process text-based data by typing text on the command line and receive output.

This example converts all given lowercase letters to uppercase:

user@host > tr '[a-z]' '[A-Z]'

r

R

To process text files instead, a text file can be connected to a command's input channel like this:

user@host > tr '[a-z]' '[A-Z]' < text_to_be_uppercased.txt

This example will however just print the results on your terminal. If you need the results in a file, you need to connect a file to the output interface of the command:

user@host > tr '[a-z]' '[A-Z]' < text_to_be_uppercased.txt > uppercased_text.txt

The next step is to use multiple processing commands and avoid using intermediate files by connection the output channel of one command to the input channel of the next. This is called a pipeline.

Example

  • You've got a text file containing ZIP Codes of Germany.
    • I downloaded it from http://download.geonames.org/export/zip - it's called DE.txt
    • Every line looks like this: DE 04103 Leipzig Sachsen ... (ZIP Code, tab, city name, other stuff). There are >16000 lines
  • You need every name of a city with a 4 as second digit of its ZIP Code

This is what we'll do:
  • We only need the fields 2 (ZIP Code) and 3 (city name) and throw away the rest: cut -d$'\t' -f2-3
  • Then we apply a filter to only get ZIP Codes with 4 as second digit: grep '^.4'
  • Since we're only interested in the city names, the ZIP Codes are thrown away: cut -d$'\t' -f2
    • Remember: The city name was field 3 in the beginning but field 1 was thrown away - it's field 2 now and will be field 1 after this step
  • There are a lot of city names given multiple times now. We're going to remove the duplicates. However, the command for that requires that all duplicates are in adjacent lines - we have to sort first:
    • To sort all the city names: sort
    • To remove the duplicates: uniq

These commands are now combined into a command pipeline which means, the output of one command is the input of the next:

user@host > cut -d$'\t' -f2-3 <DE.txt | grep '^.4' | cut -d$'\t' -f2 | sort | uniq > targetfile.txt

Hints:
  • The pipeline only connects STDOUT to STDIN. All error messages are displayed in the terminal - independently of its position in the pipeline.

Pros and Cons

Con: Creating the pipeline in the example can be more complicated than defining a filter in Excel.

Pro: Now imagine 100000000 data sets. The example would still work but Excel would have hit its 2^20 lines per sheet limit 99 Mio lines ago. Since pipelines "stream" data, the data set doesn't have to fit into the computer's main memory (*).

Con: You have to understand what you're doing and check your results for plausibility!

Pro: Pipelines can be put into scripts which is the perfect documentation of processing steps. Try to document the clicks done in a spread sheet application. Or try to repeat it 100 times with subtle changes between the runs.

(*) The sort command actually has to store all data going through it in main memory - otherwise it couldn't sort data reliably. However, since the data passing through this command is already reduced by the filter stages before it, the impact is minimal.

Back to FAQ start


Doing something with every line of a text file


If you want to do a simple text operation with every line of a text file, a while loop comes in handy.

Example: Add single quotes around the whole line (different methods):
  • user@host > cat file | ( while read line;do echo "'${line}'";done ) > newfile
  • user@host > cat file | awk "{ print \"'\" \$1 \"'\" }" > newfile
  • user@host > cat file | sed "s/\(.*\)/'\1'/" > newfile


Back to FAQ start


This topic: EDV/FuerUser > WebHome > LinuxAutomation
Topic revision: 05 Aug 2024, wherbst
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback