Page 1 of 1

PDF Page Numbering Retention

Posted: 2019-02-15 15:09:43
by Revue Guru
The ability of Nisus Writer Pro to read/convert many (most?) PDF documents is a wonderful thing. But I have a particular problem ... which I hope has a solution within Nisus.

Some background: My book manuscript is all in Nisus RTF files, and – so far – exchanges with my publisher (who uses MS Word) have been great. However, the final proof copy (ready to go to print) will be sent to me as a PDF file with fixed page numbering. From that, I must prepare a formal Index. I am currently setting up the word list to create a standard Nisus book Index. There will be some small formatting issues, but so far so good.

Is there some "easy" way to have Nisus retain the format of the PDF file -- specifically the page numbering? Obviously, I can manually go through the entire manuscript and tweak the RTF pages so they match every page in the PDF ... but we're talking about an entire book of over 110,000 words (ouch). But nice if there was a way to avoid that.

Re: PDF Page Numbering Retention

Posted: 2019-02-15 17:44:24
by phspaelti
Let me get this straight, you are talking about a pdf that was generated by someone else (using Word or other) from your original file, and you want to know if there are any deviations in pagination?

There is obviously no built-in command in Nisus (or any other program) to synch pagination with a pdf. In terms of helping to do this manually, I could imagine writing a macro to check this, but it would really depend on getting the pagination information from the pdf. How to do that? On some pdfs it's possible to copy out the text. Paste this into a Nisus Document, then, assuming there are page numbers, it might be possible to generate an inventory of the text around the page breaks for each page and compare that with what is actually the case in the original Nisus document, and flag cases where the two deviate. This would not be a trivial exercise, and—as already mentioned—it would depend on whether the text can be copied out of the pdf. Another issue with this method is that pdf copied text does not always reflect the logical order of the pdf document. So the method might generate a lot of false positives, essentially still leaving you to check the whole document by eye.

But in the end, it really would seem to me that the deviations in pagination would be insignificant with respect to the precision necessary for an index.

Re: PDF Page Numbering Retention

Posted: 2019-02-15 19:40:34
by adryan
G’day, Revue Guru et al

Opening a PDF document in Nisus Writer doesn’t preserve the pagination of the former, as you have discovered. Many other formatting features are either lost or transmogrified.

Could your publisher not send you a Word version of the final manuscript, complete with pagination identical to that of the PDF document? You could then open this in Nisus Writer and get on with your indexing.

Otherwise, one might think of automating the process through AppleScript. However, the AppleScriptability of Preview is woeful. (Automator isn’t much help here, either.) It may be that some more powerful PDF editing application may help, but I have no experience of any.

I have looked at opening a PDF file in other applications (eg, Books, Safari, BBEdit), but we get no farther. As far as I can discern, BBEdit does reveal page breaks, but I cannot translate the PDFese into English in order to recognize where in a NWP document the page breaks need to be.

Given that you are at the indexing stage, no further changes to pagination should be contemplated; consequently any manual labor now should only have to be performed once.

So, if all else failed, the general procedure I would adopt is as follows:–

(1) Open the PDF file in NWP.
(2) Replace any existing page breaks with some recognizable nonsense that won’t occur elsewhere in the document (eg, “XYZXYZ”).
(3) In Preview, go to the end of each page of the PDF document in turn, select some terminal text fragment there and copy it. (It is the inability of AppleScript to perform this step that is the sticking point.)
(4) Paste it into the Find field of the NWP Find & Replace dialog box.
(5) Delete the trailing space and paragraph return that annoyingly appear in the Find expressions.
(6) In PowerFind Pro, set the Replace expression to “\0\f” (without the quotes, and that’s a zero).
(7) Find Next.
(8) Replace.
(9) When you’ve finished going through the whole document like this, delete all occurrences of the nonsense string.

That’s the general idea. You could incorporate most of these steps into a NWP Macro which in turn could be called by an AppleScript script you would invoke with a keyboard shortcut after each text fragment selection in the original PDF document.

Tedious, I know. I’ve only elaborated the process in case no one else has a better idea. A slurp of your favorite beverage after every five pages might (or might not) help maintain the mindset.

But really, I think the publisher should be able to supply you with what you need. Are you sure it’s not too late to switch to a publisher who uses Nisus Writer?


Re: PDF Page Numbering Retention

Posted: 2019-02-15 20:40:59
by phspaelti
With all due respect, Adrian, your "cure" sounds worse than the disease.

I would have thought that Guru would want to index and generate the index from his rtf original, not try to painfully restore the pagination from the pdf sent to them by their editor. Surely the editor's pdf is just for reference. The concern is only that the the pagination of the rtf original might deviate somewhat from the pdf.

Opening the pdf itself in Nisus will give you a semblance of the document you started with, but only barely. The only real guarantee that you have is that all the text that is on the same page of the pdf will be a sequential unit in the converted document. I tried an experiment opening a pdf that I had created myself with Nisus from my own 10 page document. Individual paragraphs were jumbled. Images, tables, and many style elements were gone. I had better success, with copy-and-paste-ing from the pdf, but the text order on the page is of course the same. Trying to use this output to generate an index would be a nightmare. The one saving grace was that my document had page numbers at the bottom, which the pdf text faithfully preserved. So I was able to quickly locate the page "units" with a simple Find. But the running header text—which should have been the first text on every page—was often randomly interspersed somewhere. So Adrian's method for finding the pages in the output might not even work.

Re: PDF Page Numbering Retention

Posted: 2019-02-15 21:29:58
by adryan
G'day, Philip et al

There is no easy solution here. A lot depends on the formatting of the PDF document (eg, with respect to headers, footers, page numbers, location of images, etc).

In this situation, I myself would not merely try to fix the pagination on my own (original) RTF file. To do so would be to make the huge assumption that it remains a true representation of the PDF document returned to me by the publisher. Having dealt with publishers, I know they cannot be trusted: you have to check everything yourself at every stage.

In this case, the author’s RTF file has been transmuted into a Word document by someone else and then (via what intermediary steps, we are not privy to) transmuted again into a PDF document that the author is now expected to index. If I were doing the indexing, I would want to be absolutely certain, not only that the page numbers concurred, but also that the strings I included in the index actually occurred in the PDF document. Hence my preference for working on a converted version of the PDF file.

As far as paginating an RTF document derived from a PDF file is concerned, one could of course just scroll through the respective windows and try to eyeball the relevant spots in the RTF document where page breaks are to go. It was in an endeavor to spare myself the eye-blurring tedium of this that I devised the repetitious keystroke tedium.

I considered extracting all the page-terminating strings from the PDF document and placing them all in a single document which could be used as a source for a macro that inserted all the page breaks into the RTF document in one fell swoop. However, in the end I thought the page-by-page approach at least permitted a quick check each time that everything was progressing satisfactorily in such a crucial procedure, particularly if an image or a table were to be the last item on a page.

It should go without saying that, before proceeding with indexing, one should do a meticulous check of the PDF document because it might be the last chance offered by the publisher to do any alterations in the body of the manuscript.


Re: PDF Page Numbering Retention

Posted: 2019-02-16 13:03:28
by Hamid
Hello Revue Guru,
The workflow you describe puts an unnecessary burden upon you for indexing. If you send a final version of your Nisus RTF file with index topics already marked, you can avoid all the trouble. When the publishers import your RTF file to Word to create a PDF, your index markings will also be imported into Word. They can generate the index automatically from Word with a single command and save it in Word and then create a PDF which will match the Word document.