Saturday 7 April 2018

How to grep for utf8 characters?

UTF8 offers a lot of flexibility to represent literals and shapes (emoicons) that come from different languages other than Latin.
You can check the page: http://www.utf8-chartable.de/unicode-utf8-table.pl which offers a table form for UTF8.

Lets say you want to grep for all lines that contain Arabic characters in a file and you don't know Arabic and can't even read it.
An easy way would be to use Arabic character UTF8 representation if grep is able to understand it.

grep does offer this facility using the -P option.
grep man:
-P, --perl-regexp
              Interpret  PATTERN  as a Perl regular expression.  This is highly experimental and grep -P may warn of unimplemented
              features.

But a better approach is to use the very useful printf command.
printf is meant to replace the good old echo.
printf offers very strong set of escaping and formatting features much superior to any thing echo can offer.

Below is how the you can grep of the arabic character range:

[sherif@thingol ~]$ LANG=c grep [`printf "\xd8\x80"`-`printf "\xdb\xbf"`] mixed_text |cat
س
ي
ا
ر
ا
ت
م
م
ي
ز
ة
ف
ي
س
و
ق
ل
م
س
تع
م
ل
[sherif@thingol ~]$ LANG=c grep -v [`printf "\xd8\x80"`-`printf "\xdb\xbf"`] mixed_text |cat
A
GH
a
rt

Test
[sherif@thingol ~]$



Using the LANG environment variable ensures we are able to display arabic characters.
We use the printf command to represent the arabic characters using their hex values from the unicode table.
We could also use the unicode points using:

[sherif@thingol ~]$ printf "\u0644 \u0628 \n"
ل ب
[sherif@thingol ~]$


This allows to do any kind of string manipulation for unicode characters.

Enjoy playing with unicode. 

1 comment:

  1. Using grep -P as per: https://stackoverflow.com/questions/3001177/how-do-i-grep-for-all-non-ascii-characters

    grep -P -n "[\x80-\xFF]" myfile
    grep -P -n "[^\x00-\x7F]" myfile
    The second command excludes all ASCII character range using the ^ at the beginning of the expression

    ReplyDelete