How to Find All Non-ASCII Characters

Everything related to our flagship word processor.
raybythelake
Posts: 12
Joined: 2020-03-03 11:20:26

How to Find All Non-ASCII Characters

Post by raybythelake »

How do I find ALL non-ascii characters in text document I have opened in Nisus Writer Pro? I have a large text export from a database and I need to locate and remove all non-ascii characters from it. Thanks for any guidance on how to do this.

Ray

martin
Official Nisus Person
Posts: 4693
Joined: 2002-07-11 17:14:10
Location: San Diego, CA
Contact:

Re: How to Find All Non-ASCII Characters

Post by martin »

Hi Ray! Nisus Writer can definitely help you with this task. You can use PowerFind Pro (aka: regular expressions / regex) to find all the characters you don't want and replace them with nothing, effectively deleting them. The trick will be in identifying the characters you do want to keep.

The basic idea will be to use a negated character class to find all the characters you don't want. You do that with an expression like this:

find.png
find.png (121.06 KiB) Viewed 1642 times

The expression in the find field will match all characters that are not inside the brackets (aside from the leading ^ character which does the negation). The expression [^abc] matches all characters except a, b, and c; that expression would find and replace d, e, f, etc.

You want to find all characters that are not ASCII. To do that we'll use a character range inside the negation. The pattern [^a-z] would match all characters that aren't A through Z. But you want to replace all non-ASCII characters, so you can use the pattern [^\x00-\xFF]. However, that might also replace other characters you want to keep, like curly quotes. If you need to preserve additional characters, just add them to the end of the character class. For example, to delete all non-ASCII characters, but preserve curly quotes, you could use this find pattern:

[^\x00-\xFF“”‘’]

I hope that helps! Please let us know if you have any more questions.

adryan
Posts: 324
Joined: 2014-02-08 12:57:03
Location: Australia

Re: How to Find All Non-ASCII Characters

Post by adryan »

G'day, raybythelake et al

A few comments that may be helpful here.

For a large range of technical Find/Replace expressions, see (in the Help menu) the Nisus Writer User Guide, P. 470–477.

In a situation such as this, the Find expression could be quite complicated. It may be prudent to Find and Replace in stages, so you can check things are proceeding as you expect. If you are happy with the results of a stage, save the document and then work on a duplicate for the next stage. With a large document and a complicated project, you should of course be working with a duplicate in the first place.

I would advise against deleting anything without first checking what will be deleted. (The larger the document, the more complicated the operation and the less experienced you are, the more you ignore this admonition at your peril!) So don't just leave the Replace field blank and blithely hit one of the Replace buttons. Nor should you perform a Find operation and then blithely hit the Delete key to delete the selection. Have a good hard look before committing yourself.

A command that could be useful in this context is Edit > Select > Invert Selection.

As Martin has already intimated, the precise formulation of any Find expression(s) involved will depend on your particular requirements. To this end, careful perusal of the original document, paying particular attention to any peculiarities that may prove helpful or problematic, can help to expedite the whole process.

When it comes to ASCII characters, some are printable and some are not. So just be careful about exactly what you want. Tab characters originating from a database are probably going to be significant, and ill-considered deletion could result in — ahem! — untoward consequences, to put it mildly.

Cheers,
Adrian
MacBook Pro (mid-2014)
macOS Mojave 10.14.6
Nisus Writer user since 1996

ScottinPollock
Posts: 35
Joined: 2017-09-11 08:16:47

Re: How to Find All Non-ASCII Characters

Post by ScottinPollock »

martin wrote:
2020-03-03 14:57:46
to delete all non-ASCII characters, but preserve curly quotes, you could use this find pattern:

[^\x00-\xFF“”‘’]
Just curious... why does this find Fi and st

adryan
Posts: 324
Joined: 2014-02-08 12:57:03
Location: Australia

Re: How to Find All Non-ASCII Characters

Post by adryan »

G'day, ScottinPollock et al

I suspect that, in the font you are using, those character pairs are ligatures and not distinct ASCII characters.

Cheers,
Adrian
MacBook Pro (mid-2014)
macOS Mojave 10.14.6
Nisus Writer user since 1996

martin
Official Nisus Person
Posts: 4693
Joined: 2002-07-11 17:14:10
Location: San Diego, CA
Contact:

Re: How to Find All Non-ASCII Characters

Post by martin »

adryan wrote:
2020-03-03 23:53:25
I suspect that, in the font you are using, those character pairs are ligatures and not distinct ASCII characters.
That's basically right, although the font is not a relevant factor. If an "fi" is matched in your text by the regex [^x00-xFF] it means that your text does not contain the characters "f" and "i" next to each other. Your text instead has a single fi character (Unicode U+FB01). That's a single "pre-composed" ligature character, no matter what font you are using.

For more on this you might read this FAQ on the problems with ligatures. Generally speaking there are no issues in using display ligatures (as displayed by the applied font), but pre-composed ligature characters can cause a variety of unexpected problems.

ScottinPollock
Posts: 35
Joined: 2017-09-11 08:16:47

Re: How to Find All Non-ASCII Characters

Post by ScottinPollock »

Thanks guys... but I don't think that is what is going on here. Please have a quick look at this quick demo.

martin
Official Nisus Person
Posts: 4693
Joined: 2002-07-11 17:14:10
Location: San Diego, CA
Contact:

Re: How to Find All Non-ASCII Characters

Post by martin »

Thanks for the video– the fist is indeed fierce! 👊🏻🦁

However, what I mentioned above about pre-composed ligature characters may still be true. I don't recognize your active keyboard layout. It's conceivable that it's generating pre-composed ligature characters as you type. That seems like undesirable behavior to me, but it's certainly possible.

Could you please save your sample file after you type it out? If you can post it here (or email it to us privately) I can give you a more definitive answer.

ScottinPollock
Posts: 35
Joined: 2017-09-11 08:16:47

Re: How to Find All Non-ASCII Characters

Post by ScottinPollock »

Here you go.
FindFist.zip
(814 Bytes) Downloaded 29 times
BTW, I am still on 2.1.10.

martin
Official Nisus Person
Posts: 4693
Joined: 2002-07-11 17:14:10
Location: San Diego, CA
Contact:

Re: How to Find All Non-ASCII Characters

Post by martin »

Thanks for posting the file. There's nothing exotic about it after all. You have normal "f" and "i" characters, and not any pre-composed ligatures. If you ever need to check yourself, you can always select a character and use the menu Edit > Transform Text > Convert Character Codes > To Unicode Code Points.
ScottinPollock wrote:
2020-03-05 01:46:12
BTW, I am still on 2.1.10.
Aha! That's the problem. The regex engine used by Nisus Writer Pro version 2 has a bug with negated character classes. We updated the regex engine for version 3, which fixed the problems.

If you need to workaround the bug in version 2, you can turn off the "ignore case" search option. I think that avoids the bugs in the search engine, though I forget all the particulars, so please double-check to make sure it's matching what you expect.

ScottinPollock
Posts: 35
Joined: 2017-09-11 08:16:47

Re: How to Find All Non-ASCII Characters

Post by ScottinPollock »

martin wrote:
2020-03-05 07:42:37
If you need to workaround the bug in version 2, you can turn off the "ignore case" search option. I think that avoids the bugs in the search engine, though I forget all the particulars, so please double-check to make sure it's matching what you expect.
Thanks Martin! This is not something I usually do in Nisus, I was just curious about the unexpected behavior.

raybythelake
Posts: 12
Joined: 2020-03-03 11:20:26

Re: How to Find All Non-ASCII Characters

Post by raybythelake »

Thanks everyone for your suggestions. Really appreciate the help. I worked my way through the find examples in the user guide, but got overwhelmed by the level of geek-ness required to follow what was being suggested. Appreciate the specific answers to my question that were offered in this thread!

Best to all,

Ray

raybythelake
Posts: 12
Joined: 2020-03-03 11:20:26

Re: How to Find All Non-ASCII Characters

Post by raybythelake »

martin wrote:
2020-03-03 14:57:46
Hi Ray! Nisus Writer can definitely help you with this task. You can use PowerFind Pro (aka: regular expressions / regex) to find all the characters you don't want and replace them with nothing, effectively deleting them. The trick will be in identifying the characters you do want to keep.

The basic idea will be to use a negated character class to find all the characters you don't want. You do that with an expression like this:


find.png


The expression in the find field will match all characters that are not inside the brackets (aside from the leading ^ character which does the negation). The expression [^abc] matches all characters except a, b, and c; that expression would find and replace d, e, f, etc.

You want to find all characters that are not ASCII. To do that we'll use a character range inside the negation. The pattern [^a-z] would match all characters that aren't A through Z. But you want to replace all non-ASCII characters, so you can use the pattern [^\x00-\xFF]. However, that might also replace other characters you want to keep, like curly quotes. If you need to preserve additional characters, just add them to the end of the character class. For example, to delete all non-ASCII characters, but preserve curly quotes, you could use this find pattern:

[^\x00-\xFF“”‘’]

I hope that helps! Please let us know if you have any more questions.
Sorry, Martin. I'm not a technical person and was not able to follow your thorough explanation. I just want to REMOVE all non-ascii characters. So should I use the [^\x00-\xFF“”‘’] expression in the Regex PowerFind?

Ray

martin
Official Nisus Person
Posts: 4693
Joined: 2002-07-11 17:14:10
Location: San Diego, CA
Contact:

Re: How to Find All Non-ASCII Characters

Post by martin »

raybythelake wrote:
2020-03-06 12:19:40
I just want to REMOVE all non-ascii characters. So should I use the [^\x00-\xFF“”‘’] expression in the Regex PowerFind?
That's probably a good place to start. So you want to:

1. Open the Find panel.
2. Set the "Using" popup button to "PowerFind Pro (regex)"
3. Paste the expression [^\x00-\xFF“”‘’] into the "find what" field.
4. Leave the "replace with" field blank.
5. Click the "Replace All" button.

Let us know how it goes, or if the results are not what you wanted.

JBL
Posts: 169
Joined: 2003-04-25 14:33:59

Re: How to Find All Non-ASCII Characters

Post by JBL »

Before hitting replace all, you might consider just hitting find, and then copy and then pasting the result into a new empty document. This would allow you to scan for any characters you are about to delete that you might want to keep. You could then add those characters after the curly quotes in the find expression.

Post Reply