Automating things using the Linux command line
This page is about doing arbitrary computations efficiently with Linux. The focus is on batch processing lots of jobs in parallel.
Permanent Link:
If you feel potential for automation but you're unable to automate it yourself, feel free to ask IT. Some researchers already did work for 3 days which could be done in a minute. You might want to have a look at
LinuxCrashcourse first.
Find information about installed packages and their available commands in the institute
here, including highly specialized Neuro-analysis tools.
The page is divided into sections. Feel free to skip anything you already know.
What is displayed to you in a terminal is written as
this
, and what you type into a terminal is written as
that
, for example
user@host >
echo output
.
Expand all Collapse all
Terminology
More ...Less ...
Terminology |
Program Type |
What it does |
XFCE |
Session Manager |
It provides a graphical session including a desktop and a panel with a start menu and open windows. You'll need this only to start a Terminal Emulator for this tutorial. |
Terminal (More precise: XFCE-Terminal Emulator) |
Terminal Emulator |
It runs non-graphical applications like shells. The Terminal Emulator is responsible for sending your key presses to those applications and displaying their output in a useful way. Things that the Terminal emulator is responsible for and which the running application has no idea of:- The text font
- The actual colors being used
- Copy and paste (the application cannot distinguish between pasted text and text being typed)
- Remembering the output and providing a scrollbar for looking into output from the past.
The non-graphical application being run directly by the Terminal is usually a "shell". |
Bash |
Shell |
The shell interprets commands typed by a user and runs the actual programs that are doing e.g. calculations. The commands can be put into shell scripts for easier modification and repeated invocations. Bash is the default shell within the institute's network. You'll most likely never have contact with another one. Shells are programming languages optimized for interactive use and for running the actual programs that you want. |
Shell script |
Programs made of commands a shell can interpret. This includes simple commands, shell specific commands and control structures (conditions,loops,...). Shell scripts are simple text files which can be created and changed easily. |
Prompt |
A prompt is a (usually) small piece of text which is displayed by the Shell to tell you that it's OK to enter commands now. It usually looks like username@hostname> . When there's no prompt, the shell is currently running a program and can't accept commands at that time. |
To use a shell: Open the start menu and select Terminal Emulator. The shell being automatically started is Bash. From now on, the name Bash will be used instead of shell; since in the institute, it's the only shell being used.
Back to FAQ start
"Hello World" in Bash
More ...Less ...
The "Hello World" program in Bash looks like this:
user@host>
echo Hello World
Hello World
Explanation:
-
echo
is the the same as "print" or "write" in other languages. It takes arguments and shows them - without or with very little actual processing being done.
- It's handy to test expressions or to show the content of variables (See section Variables).
-
echo
is not a Bash command per se but an application being run by Bash. The application's path is /bin/echo
. There are thousands of those commands available even on a basic Linux system.
- The command is given two strings: "Hello" and "World". It's not a single one "Hello World", as one could expect.
- Bash splits up command arguments on white spaces.
- This means: To send a white space to a command read more about that in the Quoting section.
- Other effects: You can use an arbitrary number of white spaces between "Hello" and "World" - the output will be the same.
- There are only a few pure Bash commands but every program installed as a software packages becomes available as a command here. In a shell, most commands are just regular software binaries. For example the
firefox
command will start a web browser.
- Examples:
- Yes,
echo
is part of a software package. However, it is guaranteed to exist on every UNIX system - including your Internet router at home and your cell phone.
- Matlab (Command:
matlab
)
- FSL (Command example:
fsl_reg
)
Back to FAQ start
How to know which arguments to give to which command
More ...Less ...
Commands are little programs that usually include a detailed documentation. At some point you'll know which program to use but you don't know how to tell the program the specific mode of operation you need.
Example: We need a list of files ordered by their last modification time - one file per line.
- A Google search for "list files linux" will tell us that
ls
is the command of choice here.
- Most Linux commands do have a short help which is displayed by typing
[command] --help
(more likely) or [command] -help
.
- Most Linux commands go even further and provide a separate help page via
man [command]
- The
man
command opens an interactive text reader
- Use cursor keys and pgup/pgdown to scroll
- Use
q
to quit
- Use
/[text]
(slash followed by some text) to search for text
- You'll quickly find the necessary options:
-1
, -t
and depending on if you want ascending or descending order: -r
- Solution:
ls -1 -t -r
Hints:
- Options given to commands usually look like
--someoption
(long option) or -o
(short option) and are location-independent ( -a -o
equals -o -a
). However, this is just a convention most programmers of Linux software abide by. Some commands may behave differently.
- USE GOOGLE . It's amazing how much specific knowledge is accessible there.
- Try to be as general as possible in your search query. This the way to get knowledge that doesn't help you instantly but will get you an idea how to approach the problem.
- Example: You want to reverse 10000 MPEG1 video files. Do this via clicking in Final Cut Pro would be time consuming. A command line solution is better.
- Bad google query: "How to reverse mpeg1 video files"
- Good google query: "Linux command line reverse video"
- You'll get some command lines as one of the first results explaining how to use
ffmpeg
and sox
to do that.
- Side note:
ffmpeg
(alias avconv) is an extremely powerful video processing software. sox
is an extremely powerful audio processing tool. Both are installed on every workstation in the institute.
- In this case: Make sure to understand, how video processing works. If in doubt ask your department's technician or an IT staff member.
- In general: Make sure, you understand the solution. If in doubt ask your department's technician or an IT staff member.
Back to FAQ start
Two commands at once
More ...Less ...
Two commands can be run in two ways:
- Putting them on separate lines (Command 1, Enter, Command 2, Enter)
- Separating them by
;
Examples:
user@host >
echo Hello
Hello
user@host >
echo World
World
user@host >
echo Hello;echo World
Hello
World
Explanation:
- Both examples are semantically equivalent. However, the prompt display is different. In a shell script, there would be no prompt and the output would look exactly the same.
Back to FAQ start
Variables
More ...Less ...
The use of variables is valuable in the calculation of multiple test subjects. Variables are little containers, you can put something in. Later the variable is referenced and a computation is done with the current content of the variable. Example:
user@host>
subject=GT4K
user@host>
preprocess $subject
user@host>
process $subject
user@host>
postprocess $subject
Explanation:
- There are no commands called preprocess/process/postprocess by default available in the institute. You'd have to write those commands yourself. That's actually easier than it sounds.
- After the variable was assigned the Value
GT4K
, using the Syntax $subject
later will make the shell replace this variable reference by the variable's content before the actual command is running.
- The variable is only valid in one shell. No collisions will occur if you use the same variable name in different shells, i.e. in different Terminal Windows.
- There's a catch: The subject ID can not contain white spaces or other characters other than alphanumeric ones in this example.
Back to FAQ start
Quoting
More ...Less ...
If you need to send white spaces or special characters (other than alphanumerical ones) to a command, you'll have to use quotes. There are various types of quotes - you will most likely only need two. Here are some examples:
user@host >
process_subject "GT4K Session 2"
user@host >
process_subject 'GT4K Session 100$'
user@host >
subject='GT4K Session 100$'; process_subject "$subject"
Explanation:
- Example one uses double-quotes. You'll need those most of the time. Double quotes prevent the Bash from interpreting a given string into multiple arguments. Without double quotes,
process_subject
would get 3 useless arguments instead of a single correct one.
- Example two uses single quotes which cover most of the special characters, e.g.
$
signs.
- Example three shows a problem with variables, you might run into. Since a variable may contain special characters, you have to be careful when using it. This example shows the safe way - using double quotes. This will work even with special characters in the variable which a double quote could not handle directly.
Here's one more thing, you might have to struggle with:
user@host >
myvariable=test
user@host >
echo $myvariable_sometext
Explanation:
- In this example, Bash tries to use a variable called "$myvariable_sometext" and not "$myvariable" plus some literal text "_sometext". This is because "_" can be used as part of a variable name. The command will just print an empty line.
-
user@host >
echo ${myvariable}_sometext
will handle this correctly.
Back to FAQ start
Using filenames
More ...Less ...
Linux allows you to use all characters as file names with two exceptions: /
(which is the path separator) and ASCII 0x00 (which you'll most likely never see anyway). In a lot of cases, commands have to be given filenames as arguments which means: Quoting has to be used. The example from the Quoting section applies:
user@host >
process_subject "GT4K Session 2"
Explanation:
- The
process_subject
command will see a single argument: "GT4K Session 2"
including the white spaces. It can in turn successfully open the file with this name.
Back to FAQ start
Globber Expressions
More ...Less ...
Unix shells are the most powerful tools for selecting groups of files. The idea in most cases is to quickly select a group of files (all files with a .txt
extension, all files starting with the letter a
,...) and
Simple example:
user@host >
ls -l abc*
Explanation:
-
abc*
is a called a globber expression. Expressions like these are replaced by filenames matching it.
-
abc*
means "All files and directories in the current directory whose names begin with abc
followed by an arbitrary number of characters".
- The
ls
command will show all files and directories within the current directory starting with abc
as separate arguments.
- The option -l formats the output as a list.
- The
ls
command will never see the globber expression. It does not know that...
- ... there is such a thing as a globber expression
- ... how to interpret them
- ... that the filenames were presented using an expression and not via manual typing.
- Globber expressions can be more complex, but that is beyond the scope of this manual.
- Certain special characters like
*
, ?
, [
and ]
make a string a globber expression. If you want to use them as literal argument for commands, quoting must be used.
- These expressions are a good example of the sharing of labor in Linux. A command doesn't need to search files for itself or interpret
*
- this is done by the shell. There are thousands of commands who rely on filenames being given as arguments. None of those need to know about how to look for files. This saves precious development time for you too - if you decide to write scripts working with files.
- This sharing of labor is a key difference between a DOS/Windows command line and a Linux shell.
Now something cooler:
user@host >
rm {01..52}_\[2014\]*/[aA]*/somefile.txt
In this case, the expression will find
- objects (since we're using rm without -r, we're most likely looking for files - not directories) called
somefile.txt
...
- ... which are located in a directory starting with the letter
a
(minor or capital) ...
- ... which is in turn a subdirectory of a directory with a name starting a two digit number less or equal 52, continuing with the literal string
_[2014]
and optionally ending with an arbitrary string of characters ...
- ... which resides in the current directory
For beginners it's sometimes difficult to understand expressions and even more difficult to see them as useful applications. This particular expression was used by myself to remove temporary files a directory tree containing weekly performance statistics of computers. It took ~10 seconds to make it up because the directory structure was designed with such expressions in mind.
Hint: IT uses the Linux command line all the time. If you need a complex expression or an expression doesn't behave as expected, don't hesitate to ask.
Back to FAQ start
Loops
More ...Less ...
Automation usually means doing something simple repeatedly. A loop is the tool of choice for that. To use them, we'll need all the things we've learned in the previous sections. We usually want to process multiple data objects using the same action. It sequentially assigns different values to a given variable and runs programs that usually take this variable as an argument. Example:
user@host >
for subject in {001..003};do echo $subject;done
001
002
003
Explanation:
-
{001..003}
is a word generator. It acts as if you typed 001 002 003
in its place. OK, not much typing is saved in this example. Now imagine replacing 003 by 999.
- The word
do
has to be there.
- The word
done
marks the end of the loop. Only commands between do
and done
are running within the loop.
- We didn't use double-quotes around $subject. This is usually a bad idea. It works in this case since the word generator only makes numeric strings.
Usually, there is already a set of data. Sometimes, the files are not just numbers or there are numbers missing. In this case, a word generator doesn't help - actual filenames are needed. Example:
user@host >
for subject in *;do process_subject "$subject";done
Explanation:
- The
process_subject
command is called once for each file and for each directory in the directory, you're currently in.
- You should try to keep you subject's base directory free of other files because they would be caught by the
*
expression otherwise.
Back to FAQ start
Writing a command of your own
More ...Less ...
Usually processing data involves multiple steps (commands) per dataset (e.g. test subject). There are multiple reasons to place these commands in a script:
- You can keep all the necessary processing commands together in one place.
- You can write comments explaining your thoughts when writing the script. Lines starting with
#
are comments.
- RevisionControl can be used to track changes in the command set being used for your data over time.
- A set of scripts can be kept per study/project.
Example script:
#!/bin/bash
#
# This is the process_subject script
#
# It will do the following:
# ...
subject="$1"
echo "Test subject $subject is being processed"
# Some preprocessing
process_step1 "$subject"
# The most important part:
process_step2 "$subject"
Hints:
- Lines starting with
#
are comments.
- You should write a little comment header on top of your script to describe the purpose of it.
- It's good practice to document your intentions when running a command in a comment line above the command. Try to describe why not what (globber expressions are an exception).
- The first line of a script is not a comment but a hint for the operating system, which programming language the script is written in (shebang). The line must always be
#!/bin/bash
in case of a Bash script.
- There's a script directory in every AFS study storage block by default. Save the script in this directory to keep it close to your data. The path will the look like this:
/afs/cbs.mpg.de/projects/afs000_afstest/scripts/process_subject
.
-
$1
is the first parameter given to the command. Calling process_subject GT4K
would cause $1
to be GT4K
which would cause the variable $subject
to be GT4K
which would send this ID to the process_step1
and process_step2
commands.
Commands are being looked for in certain locations on the computer. You study's script directory is not among them because you might have multiple of them and this would make the Bash very slow. For now you'll have to write the full command name to call it:
user@host >
/afs/cbs.mpg.de/projects/afs000_afstest/scripts/process_subject GT4K
There are two ways to make the command's name shorter. The advantage of one way is the disadvantage of the other:
- See Expanding the search path
- and Using a central command location.
Back to FAQ start
Expanding the search path
More ...Less ...
The environment a
will change into a specific study storage block and include the scripts folder into the command search path. Example:
user@host>
a afs000
I: AFS study environment (Project id 'afs000')
I: Type 'envlist' to show all environment components incl. versions.
I: Type 'exit' to leave the environment
afsproject=afs000_user@host >
process_subject GT4K
Advantage:
- The change only affects a single Bash instance. You can work with multiple studies and multiple sets of scripts in multiple windows.
Back to FAQ start
Using a central command location
More ...Less ...
There is one location solely under your control which you can use to shorten the command: the ~/bin
directory in your home folder. You can either store a script directly there or use a technique called symlinking (basically making sure, a given file exists at multiple places at the same time):
user@host >
mkdir ~/bin;cd ~/bin
user@host >
ln -s /afs/cbs.mpg.de/projects/afs000_afstest/scripts/process_subject process_subject
From now on, you can type
user@host >
process_subject GT4K
Advantage:
- No preparation is necessary before working with a single study. The shortened command is available as soon as you log into your Linux session.
Back to FAQ start
Parallel Loops
More ...Less ...
Today's processors have multiple independent processing units (CPU cores) that can do things in parallel. Most of the institute's workstations have at least 4 of them. Bash can take advantage of this. However, to e.g. run 1000 jobs but only 4 of them at any given time, it's easier to use a helper program called fparallel
. The original name of this shell tool is parallel. fparallel
is a for the institute adapted version of this tool.
user@host >
fparallel -c 'process_subject {arg}' -P 4 abc* --args
Explanation:
- This tool will call a command called
process_subject
once for every file/directory starting with the string abc
in the current directory.
- This will even work, if there are hundreds of thousands of files starting with
abc
in that directory.
- 4 process_subject instances will run in
fparallel
. Whenever a job finishes, another argument (in this case another file starting with abc
) is taken from the job queue and being processed.
- Running more instances than your number of CPU cores in
fparallel
might knock out your computer.
- Keep in mind that you might need processing power for your graphical session. Reserving one CPU core for that might be a good idea.
If you need more resources have a look here: ComputeClusterSlurm.
Back to FAQ start
Pipelines and text processing
More ...Less ...
Unix operating systems (like the one on your workstation) make it very easy to process text based data. A lot of commands in a basic Linux setup can do just that - process text. Typically, you will combine several commands to process text-based data by typing text on the command line and receive output.
This example converts all given lowercase letters to uppercase:
user@host >
tr '[a-z]' '[A-Z]'
r
R
To process text files instead, a text file can be connected to a command's input channel like this:
user@host >
tr '[a-z]' '[A-Z]' < text_to_be_uppercased.txt
This example will however just print the results on your terminal. If you need the results in a file, you need to connect a file to the output interface of the command:
user@host >
tr '[a-z]' '[A-Z]' < text_to_be_uppercased.txt > uppercased_text.txt
The next step is to use multiple processing commands and avoid using intermediate files by connection the output channel of one command to the input channel of the next. This is called a pipeline.
Example
- You've got a text file containing ZIP Codes of Germany.
- I downloaded it from http://download.geonames.org/export/zip - it's called
DE.txt
- Every line looks like this:
DE 04103 Leipzig Sachsen ...
(ZIP Code, tab, city name, other stuff). There are >16000 lines
- You need every name of a city with a
4
as second digit of its ZIP Code
This is what we'll do:
- We only need the fields 2 (ZIP Code) and 3 (city name) and throw away the rest:
cut -d$'\t' -f2-3
- Then we apply a filter to only get ZIP Codes with
4
as second digit: grep '^.4'
- Since we're only interested in the city names, the ZIP Codes are thrown away:
cut -d$'\t' -f2
- Remember: The city name was field 3 in the beginning but field 1 was thrown away - it's field 2 now and will be field 1 after this step
- There are a lot of city names given multiple times now. We're going to remove the duplicates. However, the command for that requires that all duplicates are in adjacent lines - we have to sort first:
- To sort all the city names:
sort
- To remove the duplicates:
uniq
These commands are now combined into a command pipeline which means, the output of one command is the input of the next:
user@host >
cut -d$'\t' -f2-3 <DE.txt | grep '^.4' | cut -d$'\t' -f2 | sort | uniq > targetfile.txt
Hints:
- The pipeline only connects STDOUT to STDIN. All error messages are displayed in the terminal - independently of its position in the pipeline.
Pros and Cons
Con: Creating the pipeline in the example can be more complicated than defining a filter in Excel.
Pro: Now imagine 100000000 data sets. The example would still work but Excel would have hit its 2^20 lines per sheet limit 99 Mio lines ago. Since pipelines "stream" data, the data set doesn't have to fit into the computer's main memory (*).
Con: You have to understand what you're doing and check your results for plausibility!
Pro: Pipelines can be put into scripts which is the perfect documentation of processing steps. Try to document the clicks done in a spread sheet application. Or try to repeat it 100 times with subtle changes between the runs.
(*) The sort
command actually has to store all data going through it in main memory - otherwise it couldn't sort data reliably. However, since the data passing through this command is already reduced by the filter stages before it, the impact is minimal.
Back to FAQ start
Doing something with every line of a text file
More ...Less ...
If you want to do a simple text operation with every line of a text file, a while
loop comes in handy.
Example: Add single quotes around the whole line (different methods):
-
user@host >
cat file | ( while read line;do echo "'${line}'";done ) > newfile
-
user@host >
cat file | awk "{ print \"'\" \$1 \"'\" }" > newfile
-
user@host >
cat file | sed "s/\(.*\)/'\1'/" > newfile
Back to FAQ start