Sunday 27 October 2019

Regular experession matching in Postgres

Lately I ran across the need to extract user id text strings from a text colum in a database.
The text could be large and might contain multiple user id matches in the same row.

Thus I imported to the data into Postgres and started looking for Postgres functions that would help me use regular expressions to extract a text part of the column that matches the user id text.

I found a very useful Postgres function: REGEXP_MATCHES('input string', 'regexp pattern', 'flags').
This function works very similar to the sed Unix command, it applies the pattern against input string and allows the use of flag modifiers similar to the one used with sed.

Below is an example of a sample data:

The ids as you can see can be found in any part of the comment-text field.
A query using the  REGEXP_MATCHES function would help extract those Ids into a string arrary type defined in Postgres, the query looks like below:

select REGEXP_MATCHES("comment-text", '\[uidkey\:user[a-fA-F0-9]+\]','g') from comments where "comment-text" like '%[uidkey:%]%';

The output of the query looks like below:
As you can see the REGEXP_MATCHES function matched all 5 ids from the 2 rows that contain them and returned 5 array objects with 1 element each containing those ids.
To cast the array objects to standard text strings we use another Postgres function: ARRAY_TO_STRING(Array_object, 'array element separator', 'optional null element replacement'.

The final version of the query looks like below:

select array_to_string(REGEXP_MATCHES("comment-text", '\[uidkey\:user[a-fA-F0-9]+\]','g'),',')
from comments where "comment-text" like '%[uidkey:%]%';

The output of the query looks like below:
Thus, using those methods, we can extract text from text fields and build more complex queries based on that data.

Enjoy playing with your data !!






1 comment:

  1. Further to this, adding some casting would also allow us to change the string to an integer:

    select productname, productid from
    (select productname, cast (array_to_string(REGEXP_MATCHES(productmeta, '[0-9]+','g'),',') as integer) as productid
    from (
    select id , productname ,array_to_string(REGEXP_MATCHES(product_descriptor, 'meta name="+product.id"+>[0-9]+<','g'),',') as productmeta
    from productmetadata
    where descriptor like '%product.id%'
    order by id
    ) as round1 ) as round2
    where productid not in (select pid from products)

    Very useful in extracting data from large text fields and using it in doing Joins.

    ReplyDelete