Tutorial
Here is a demo to showcase the use of PyRegexBuilder.
Problem
Hidden in the text
below is a Cyrillic letter that is preceded by exactly 2 Greek or Latin letters (no more, no less).
text = "ウثイا月εЗ人水جИ山γבИЖㄱتאEイЖבИBДE人ججイEㄷجاجㄷアЗاجاㄴACㄴDAエβエЖאג月水水ㄹجAㅁDبエα山הبتウИג月Aاイب山αثㄴاИδㄹ水ア人ㄹتεדβE山ㄴבㅁEאЙエㄷ山אباγجウㄹثㅁEЗИ日山イ日ثاㄷβBAㅁЖ水ㅁ山日水ㅁ人Иオבㄴγב月ت月اβアהγبβㄱبИㄱبオتㅁエ水αEتㄷAアדㄱبדDדㄱエㄹ水ثД山ㄱباイβاイ水δㄹЗㄹ月γB山ЗAアイαEИبДجЖεتИγㅁאاオㄷDЙアEεイㄹAثЖדD日山日δδDИCب月ЗבαتγウЖדبج日גBبהイㄱㅁ月月月ЖجBㄱגエבДجבㄱㅁㄱ人"
Which letter is it? Can you find it using a regular expression?
Solution
Let's see how we would go about coming up with a solution using PyRegexBuilder.
-
Import the required modules.
from pyregexbuilder import * import regex as re
-
Create a character set that will match Greek or Latin letters.
greek_or_latin = PosixClass("IsGreek").union(UnicodeClass("IsLatin"))
See the available character classes.
-
Now let's start constructing the regular expression.
greek_or_latin = PosixClass("IsGreek").union(UnicodeClass("IsLatin")) expression = ( Regex( ... ) )
-
Create an assertion to check that only the previous 2 characters are Greek or Latin letters.
greek_or_latin = PosixClass("IsGreek").union(UnicodeClass("IsLatin")) expression = ( Regex( PositiveLookbehind( ChoiceOf(Anchor.WORD_BOUNDARY, greek_or_latin.inverted), Repeat(greek_or_latin, count=2) ) ) )
See the available assertions and quantifiers.
-
After looking behind, capture the current letter if it is part of the Cyrillic alphabet.
greek_or_latin = PosixClass("IsGreek").union(UnicodeClass("IsLatin")) expression = ( Regex( PositiveLookbehind( ... ), Capture(PosixClass("IsCyrillic"), name="character"), ) )
Groups can be named or unnamed. See the available groups.
-
Set the flags for the regular expression and compile.
(Here, these flags are not necessary; they are just an example.)
greek_or_latin = PosixClass("IsGreek").union(UnicodeClass("IsLatin")) expression = ( Regex( ... ) .with_flags({"IGNORECASE": True}) .with_global_flags({"VERSION1": True}) .compile() )
See the available scoped and global flags here.
-
Now you can use the regular expression. Here is the full code:
from pyregexbuilder import * import regex as re text = "ウثイا月εЗ人水جИ山γבИЖㄱتאEイЖבИBДE人ججイEㄷجاجㄷアЗاجاㄴACㄴDAエβエЖאג月水水ㄹجAㅁDبエα山הبتウИג月Aاイب山αثㄴاИδㄹ水ア人ㄹتεדβE山ㄴבㅁEאЙエㄷ山אباγجウㄹثㅁEЗИ日山イ日ثاㄷβBAㅁЖ水ㅁ山日水ㅁ人Иオבㄴγב月ت月اβアהγبβㄱبИㄱبオتㅁエ水αEتㄷAアדㄱبדDדㄱエㄹ水ثД山ㄱباイβاイ水δㄹЗㄹ月γB山ЗAアイαEИبДجЖεتИγㅁאاオㄷDЙアEεイㄹAثЖדD日山日δδDИCب月ЗבαتγウЖדبج日גBبהイㄱㅁ月月月ЖجBㄱגエבДجבㄱㅁㄱ人" greek_or_latin = PosixClass("IsGreek").union(UnicodeClass("IsLatin")) expression = ( Regex( PositiveLookbehind( ChoiceOf(Anchor.WORD_BOUNDARY, greek_or_latin.inverted), Repeat( greek_or_latin, count=2, ) ), Capture(PosixClass("IsCyrillic"), name="character"), ) .with_flags({"IGNORECASE": True}) .with_global_flags({"VERSION1": True}) .compile() ) match = re.search(expression, text) print(match)