Saturday, February 23, 2019

regex - Use grep to find either of two strings without changing the order of the lines?




I'm sure this has been asked but I can't find it so my apologies for redundancy.



I want to use grep or egrep to find every line that has either ' P ' or ' CA ' in them and pipe them to a new file. I can easily do it with one or the other using:



egrep ' CA ' all.pdb > CA.pdb


or



egrep ' P ' all.pdb > P.pdb



I'm new to regex so I'm not sure the syntax for or.



Update:
The order of the output lines is important, i.e. I do not want the output to sort the lines by which string it matched. Here is an example of the first 8 lines of one file:



ATOM      1 N    THR U  27     -68.535  88.128 -17.857  1.00  0.00      1H5  N  
ATOM 2 HT1 THR U 27 -69.437 88.216 -17.434 0.00 0.00 1H5 H
ATOM 3 HT2 THR U 27 -68.270 87.165 -17.902 0.00 0.00 1H5 H

ATOM 4 HT3 THR U 27 -68.551 88.520 -18.777 0.00 0.00 1H5 H
ATOM 5 CA LYS B 122 -116.643 85.931-103.890 1.00 0.00 2H2B C
ATOM 6 P THY J 2 -73.656 70.884 -7.805 1.00 0.00 DNA2 P
ATOM 8 HB THR U 27 -68.543 88.566 -15.171 0.00 0.00 1H5 H
ATOM 9 CA LYS B 122 -116.643 85.931-103.890 1.00 0.00 2H2B C
ATOM 10 P THY J 2 -73.656 70.884 -7.805 1.00 0.00 DNA2 P
ATOM 11 HB THR U 27 -68.543 88.566 -15.171 0.00 0.00 1H5 H
ATOM 12 C SER D 2 -73.656 70.884 -7.805 1.00 0.00 DNA2 C
ATOM 13 OP1 SER D 2 -73.656 70.884 -7.805 1.00 0.00 DNA2 O



and I want the result file for this example to be:



ATOM      5 CA   LYS B 122    -116.643  85.931-103.890  1.00  0.00      2H2B C  
ATOM 6 P THY J 2 -73.656 70.884 -7.805 1.00 0.00 DNA2 P
ATOM 9 CA LYS B 122 -116.643 85.931-103.890 1.00 0.00 2H2B C
ATOM 10 P THY J 2 -73.656 70.884 -7.805 1.00 0.00 DNA2 P

Answer



You can use grep like this:




grep ' P \| CA ' file > new_file


The | expression indicates "or". We have to escape it in order to tell grep that it has a special meaning.



You can avoid this escaping and using something fancier with an extended grep:



grep -E ' (P|CA) ' file > new_file



In general, I prefer the awk syntax, since it is more clear and easier to extend:



awk '/ P / || / CA /' file


Or given your sample input, you can use awk to check if it is in the 3rd column when this happens:



$ awk '$3=="CA" || $3=="P"' file
ATOM 5 CA LYS B 122 -116.643 85.931-103.890 1.00 0.00 2H2B C

ATOM 6 P THY J 2 -73.656 70.884 -7.805 1.00 0.00 DNA2 P
ATOM 9 CA LYS B 122 -116.643 85.931-103.890 1.00 0.00 2H2B C
ATOM 10 P THY J 2 -73.656 70.884 -7.805 1.00 0.00 DNA2 P


Test



$ cat file
hello P is here and CA also
but CA appears

nothing here
P CA
$ grep ' P \| CA ' file
hello P is here and CA also
but CA appears
$ grep -E ' (P|CA) ' file
hello P is here and CA also
but CA appears
$ awk '/ P / || / CA /' file
hello P is here and CA also

but CA appears

No comments:

Post a Comment

plot explanation - Why did Peaches' mom hang on the tree? - Movies & TV

In the middle of the movie Ice Age: Continental Drift Peaches' mom asked Peaches to go to sleep. Then, she hung on the tree. This parti...