RegEx, can someone explain, please

Everything related to our flagship word processor.
Post Reply
User avatar
xiamenese
Posts: 513
Joined: 2006-12-08 00:46:44
Location: London or Exeter, UK

RegEx, can someone explain, please

Post by xiamenese »

On the Scrivener forum, a user wanted to search their project simultaneously for the word "house" as well as "garden", "gardens", "gardener" etc, but without finding "household" etc. Obviously, the answer is to use RegEx, so I turned to NWP as the best way to do it, as I'm not fluent in RegEx. I created a sentence, "They had a huge house, with a household staff of 20, and 4 acres of garden managed by a team of ten gardeners."

Another poster suggested the RegEx (\bhouse\b|\bgarden\b), which would only find "garden", not the derivatives. I used Powerfind to set up "0 or more lower case characters" and then switched the search to Powerfind Pro to turn it into code, which gave me \p{Lower}*. So my final Powerfind Pro expression was (\bhouse\b|\bgarden\p{Lower}*\b). That performed the search perfectly on my sentence in both NWP and Scrivener. However…

Wanting to know more about it, I looked up the relevant sections in the NWP manual which told me that:

\b is the code for "Backspace", yet in my expression the pairs act as boundaries, essentially equivalent to "Whole word"; the manual lists \m as the beginning of a word, but from the examples the end of the word seems to be \M, though the latter is not explained anywhere;

In the manual, I couldn't find any reference to \p, though the {lower} is fully understandable. The manual gives [[:lower:]] as the code for finding any lower case alphabetical character.

So, could someone please explain why my RegEx works in both apps? Rewriting it to follow the NWP manual also works in NWP (of course!), but it doesn't work in Scrivener.

:)

Mark
User avatar
martin
Official Nisus Person
Posts: 5119
Joined: 2002-07-11 17:14:10
Location: San Diego, CA
Contact:

Re: RegEx, can someone explain, please

Post by martin »

Ahhh regex, it's a wonderful tool — true text processing magic — but also deep and arcane 🔮✨

The \b escape sequence actually stands for a word boundary. It's a zero-length match like "start of paragraph". The Nisus Writer user guide is incorrect in saying it matches a backspace– sorry! Perhaps that was true once, but no longer.

As for [[:lower:]] and \p{lower} those are both still valid. The \p escape is the new Unicode character class syntax. You can match other special types of characters like \p{Currency_Symbol}. A nice trick is that you can negate a class using a capital P instead, e.g. \P{Currency_Symbol} matches all characters that aren't currency symbols.

I hope that helps. Let me know if you have any more questions. Oh and you could do worse than to consult this website:

https://www.regular-expressions.info

It's not a bad way to introduce oneself to additional regex concepts.
adryan
Posts: 492
Joined: 2014-02-08 12:57:03
Location: Australia

Re: RegEx, can someone explain, please

Post by adryan »

G'day, Martin, Mark et al

I was just about to post this when I saw Martin's posting. I might as well post mine anyway, just in case it sheds a bit of extra light.

For a start, as I understand it, "\b" only represents a backspace when it is enclosed in square brackets to act as a wildcard character.

If it's not enclosed in square brackets, "\b" represents a word boundary. A word boundary is more a position than a character. It separates two strings, each consisting solely of either word characters or non-word characters and each being the longest run thereof. A word character in NWP is basically any alphanumeric character or the underscore character (although some non-printing characters are also included); note that the hyphen is treated as a non-word character. If you have "\b" (without the quotation marks) as the search term by itself, you can observe the cursor skip across alternating runs of word characters and non-word characters as you succcessively perform the Find Next operation.

Thus, "\bgarden" will not return the "garden" in "mygarden" because the "\b" wants to station itself here just after a string of non-word characters, and the succeeding (word) string does not commence with "garden". However, it will return the "garden" in "gardens" and "gardener".

On the other hand, "garden\b" will return the "garden" in "mygarden" because the "\b" wants to station itself here just before a string of non-word characters, and the preceding (word) string does end with "garden". But it won't return the "garden" in "gardens" or "gardener", for the now obvious reason.

So "\bgarden\b" will return only free-standing instances of "garden" (ie, where it constitutes the entire word).

"\m" also represents a position, in this case immediately preceding a string consisting solely of word characters. Such a string does not necessarily represent a word in our everyday use of the term because the insertion point will halt just after a hyphen and what follows may not necessarily be a word you can find in your dictionary.

"\M" moves to a position immediately succeeding a string consisting solely of word characters, so it will halt just before a hyphen.

If you perform successive Find Next operations with each of these expressions as the sole search term, you will immediately see the differences from "\b". However, the effect of "\bgarden\b" and "\mgarden\M" should be the same.

It is not clear to me whether "\m" and "\M" are used like this in grep elsewhere than in NWP. (I think there is a variety of implementations of grep out there.)

Finally, when it comes to the letter case, I think that "\p{Lower}" is a more universal syntax, but it implies that a character class named "Lower" has been defined somewhere. In the current context, of course, the class would designate lowercase characters. I think the double-bracketed notation is NWP's own way of referring to these character classes, which is probably why it doesn't work in Scrivener (which I don't have). But the two forms seem to be equivalent in NWP. Note that the first form is displayed when you choose from the drop-down menu in the Find & Replace dialog box, but the Manual only mentions the second.

I hope I've got all this right — it seems consistent with what Martin has said — and that it helps to explain stuff.

The main trouble with these rabbit holes is that there never seems to be a barista in attendance. More searching required….

Cheers,
Adrian
MacBook Pro (M1 Pro, 2021)
macOS Monterey 12.6
Nisus Writer user since 1996
User avatar
xiamenese
Posts: 513
Joined: 2006-12-08 00:46:44
Location: London or Exeter, UK

Re: RegEx, can someone explain, please

Post by xiamenese »

Hello Martin and Adrian,

I feel somewhat vindicated in starting this thread, in that, as Martin acknowledges, there is an error in the manual re \b. While waiting for any answers, I did some research myself. A fellow Scrivener user pointed out that Scrivener uses the stock RegEx engine supplied by the Mac, which uses a the UTF-8 compatible ICU guidelines.

Having been told that I then looked up and made myself a "cheat-sheet" of ICU RegEx. It seems that it doesn't list \m and \M, so, Adrian, that bears out your experience of those metacharacters and NWP. But thank you for explaining the difference between \b … \b and \m … \M.

Thank you also, Martin, for the link to Regular-Expressions.info. I have bookmarked it so I can spend time going over it, and returning to it when necessary. I have also bookmarked regex101.com as it allows you to build, test and debug a regex you're creating. But, actually for someone who is just beginning to dip their toes into the RegEx water, it tells you at what point your expression fails, but not why. So the best tutor I've found is NWP … create your expression in PowerFind — as even with the current display problems on Monterey
Screenshot 2022-03-12 at 18.45.42.png
Screenshot 2022-03-12 at 18.45.42.png (119.55 KiB) Viewed 1026 times
it is more readable for the novice than a string of arcane metacharacters — then convert it to Powerfind Pro to see the full expression.

:D

Mark
Post Reply