Compare and Remove Duplicates — 2 Lists

Get help using and writing Nisus Writer Pro macros.
Post Reply
jb
Posts: 92
Joined: 2007-11-09 15:27:25

Compare and Remove Duplicates — 2 Lists

Post by jb »

Hi again,
I have two word lists (in polytonic Greek, but I suppose that doesn’t matter).
List A has 500 words. List B has 1500 words.

I want to delete from List B all words that appear in List A.
Or of course, place all words not in List A into a new document.

I’m pretty sure it’s close to trivial for a macro guru, but even after all these years I’m far from that.


Any thoughts?
adryan
Posts: 561
Joined: 2014-02-08 12:57:03
Location: Australia

Re: Compare and Remove Duplicates — 2 Lists

Post by adryan »

G'day, jb et al

No doubt someone will supply an elegant Macro to accomplish what you want, but other approaches are possible, among them use of the supplied "Compare Documents" Macro.

Assuming the two documents are constructed similarly, here's another (non-Macro) method you could try:–

(1) Duplicate the longer file.
(2) Copy the contents of the shorter file and paste them into your duplicate file.
(3) Edit > Transform Paragraphs > Sort Ascending (A-Z)
(4) In the Find & Replace dialog box and using PowerFind Pro, untick all the checkboxes and insert the following expression (without quotes) into the Find field: "(.+\n)\1"
(5) Have the Replace field empty.
(6) Hit the Replace All button.

This assumes you aren't working with actual Lists in the technical sense. At least, it won't work with Numbered or Lettered Lists. In such a situation, you could remove the List styling before doing the replacement, then reinstate it afterwards.

Cheers,
Adrian
MacBook Pro (M1 Pro, 2021)
macOS Ventura
Nisus Writer user since 1996
jb
Posts: 92
Joined: 2007-11-09 15:27:25

Re: Compare and Remove Duplicates — 2 Lists

Post by jb »

Thank you, Adrian!

I want to keep only the unique words in list A, but I can make this work by giving list B a distinct character style so that it’s easy to select and remove those words at the end of the process.

Probably to do it all in one simple go I’d need a macro since there are several steps.

And while I’m wishing :wink: it would also be nice if the macro could search list B for the words in list A, rather than looking for duplicates in the combined list. I abbreviated the first time around, but list B isn’t in fact a ‘list’ of one word per paragraph, although list A is. To use your Powerfind Pro I can convert ‘list’ B into a list, of course. Nisus makes that easy enough, so it’s not a serious matter either.

In any case, I can get by with this, so thanks again for your help!
User avatar
phspaelti
Posts: 1313
Joined: 2007-02-07 00:58:12
Location: Japan

Re: Compare and Remove Duplicates — 2 Lists

Post by phspaelti »

Adrian's method is fine, and I use it all the time, but there are number of caveats.
Strictly speaking Adrian's method creates a combined list and then removes items that can be paired. If either list contains duplicates, this could change the result. Also it will leave the unique items from either list.

A good method to find/count the number of occurrences in a list is to use a Hash. Hashes are structures that use key/value pairs. Since keys have to be unique, you will be guaranteed that every item will occur only once. Typical code will look like this:

Code: Select all

$doc = Document.active
$sels = $doc.text.find '\w+', 'Ea'
$list = Hash.new
foreach $sel in $sels
  $list{$sel.substring} += 1
end
$uniqueList = $list.keys
Note that at the end of this procedure $uniqueList will be an array containing all the unique items found with the search expression (adjust as necessary). However the order of $uniqueList will be seemingly random. You will have to sort it first, if that's what you want.
Meanwhile $list{$word} will give you the number of occurrences of $word in your document. If you don't really need the count you could use '=1' instead of '+=1' or keep some other useful information about the relevant words. For example with a little extra work you could keep the location of the first/last occurrence or a list of all occurrences, etc.

We can now adapt this procedure to jb's problem.

Code: Select all

$docA = Document.withDisplayName 'List_A.rtf' # Adjust this as appropriate
$sels = $docA.text.find '\w+', 'Ea'
$list = Hash.new
foreach $sel in $sels
  $list{$sel.substring} = 1
end

$docB = Document.withDisplayName 'List_B.rtf' # Adjust this as appropriate
$sels = $docB.text.find '\w+', 'Ea'
$duplicateList = Array.new
$notInAList = Array.new
foreach $sel in $sels
  $item = $sel.substring
  if $list.definesKey($item)
    $duplicateList.push $item
  else
    $notInAList.push $item
  end
end
Document.newWithText $notInAList.join("\n")
Notice that in this case I used arrays for the output. This means that
  1. Multiple occurrences of words not in A will be listed multiple times
  2. The "not in A list" will have the words in the order they are found in B
If you prefer a unique list you could use a Hash for the "notInAList" instead. In that case write '$notInAList{$item} = 1' (or '+=1') as desired.
Obviously all of these lists can be sorted or rearranged as desired. Also make sure to use find expressions that work for the case you are looking for.
Finally I have used ".substring" because that works fastest. But if necessary use ".subtext" to keep formatting. However in that case you would need to be a bit more careful. The Hash keys will not allow formatted strings (I believe).

Finally it should be said that you could do all of this with arrays instead. Arrays have a command ".containsValue" which could be used to check if an item occurs in the list. This would be much slower with long lists, but with 500 ~ 1,500 words this would hardly be noticeable, I think.
philip
jb
Posts: 92
Joined: 2007-11-09 15:27:25

Re: Compare and Remove Duplicates — 2 Lists

Post by jb »

Hi Philip,

This is terrific. Perfect in fact.
Thank you!


I understand some of the code. :?
B.Otter
Posts: 27
Joined: 2021-02-06 15:24:00

Re: Compare and Remove Duplicates — 2 Lists

Post by B.Otter »

Thank you, Philip, still going school on your code,

Brad
Post Reply