mik icon indicating copy to clipboard operation
mik copied to clipboard

Toolchain: CONTENTdm compound PDFs

Open xing93111 opened this issue 6 years ago • 33 comments

On this page: https://github.com/MarcusBarnes/mik/wiki/Toolchain:-CONTENTdm-compound-PDFs I read for compound PDFs, CdmPhpDocuments class should be used. However, when I run mik

billg@lib10:/data/projects/arca$ ./mik/mik -c ./collections/AUebooks/config.ini
Commencing MIK.
PHP Fatal error:  Uncaught Error: Class 'mik\filegetters\CdmPhpDocuments' not found in /data/projects/arca/mik/mik:170
Stack trace:
#0 {main}
  thrown in /data/projects/arca/mik/mik on line 170

Then, I went to mik/src/filegetters and mik/src/writers. I found a class named CdmPdfDocuments. So I thought maybe there are typos on the document, and changed the class name to CdmPdfDocuments. However, it still does not work. The output gives corrupted PDFs.

This is the collection: http://digicon.athabascau.ca/cdm/landingpage/collection/AUebooks

The following is my configuration ini file:

; Trying out the compound thing

[CONFIG]
config_id = AUebooks
last_updated_on = "2018-10-11"
last_update_by = "hx"

[FETCHER]
class = Cdm
; The alias of the CONTENTdm collection.
alias = AUebooks
ws_url = "http://deck.cs.athabascau.ca/dmwebservices/index.php?q="
; 'record_key' should always be 'pointer' for CONTENTdm fetchers.
record_key = pointer
temp_directory = "/data/projects/arca/tmp"

[METADATA_PARSER]
class = mods\CdmToMods
alias = AUebooks
ws_url = "http://deck.cs.athabascau.ca/dmwebservices/index.php?q="
; Path to the csv file that contains the CONTENTdm to MODS mappings.
mapping_csv_path = '/data/projects/arca/collections/AUebooks/mapping.csv'
; Include the migrated from uri into your generated metadata (e.g., MODS)
include_migrated_from_uri = "http://digicon.athabascau.ca/cdm/ref/collection/"
repeatable_wrapper_elements[] = extension
repeatable_wrapper_elements[] = name
repeatable_wrapper_elements[] = subject
repeatable_wrapper_elements[] = identifier
repeatable_wrapper_elements[] = titleInfo
repeatable_wrapper_elements[] = title
repeatable_wrapper_elements[] = relatedItem
use_nicknames = true

[FILE_GETTER]
class = CdmPdfDocuments
alias = AUebooks
input_directories[] =
ws_url = "http://deck.cs.athabascau.ca/dmwebservices/index.php?q="
utils_url = "http://deck.cs.athabascau.ca/utils/"
temp_directory = "/data/projects/arca/tmp"

[WRITER]
class = CdmPdfDocuments
alias = AUebooks
output_directory = "/data/projects/arca/collections/AUebooks/output"
metadata_filename =
postwritehooks[] = "php extras/scripts/postwritehooks/move_packages_by_extension.php"
postwritehooks[] = "php extras/scripts/postwritehooks/validate_mods.php"
postwritehooks[] = "php extras/scripts/postwritehooks/object_timer.php"
postwritehooks[] = "php extras/scripts/shutdownhooks/delete_temp_files.php"
; Note: During testing we only generate MODS datastreams. In production, comment this line out.
; datastreams[] = MODS

[MANIPULATORS]
; filegettermanipulators[] = "CdmSingleFile|pdf"
; filegettermanipulators[] = "CdmCompound|Document-PDF"
fetchermanipulators[] = "CdmCompound|Document-PDF"
;metadatamanipulators[] = "FilterModsTopic|subject"
;metadatamanipulators[] = "AddContentdmData"
;metadatamanipulators[] = "AddUuidToMods"
;metadatamanipulators[] = "InsertXmlFromTemplate|null0|/Users/brandon/sfuvault/mik/manipulations/athabasca_manipulations/origininfo.xml"
;metadatamanipulators[] = "InsertXmlFromTemplate|null1|/Users/brandon/sfuvault/mik/manipulations/athabasca_manipulations/physicalDescription.xml"

[LOGGING]
path_to_log = "/data/projects/arca/tmp/mik.log"
path_to_manipulator_log = "/data/projects/arca/tmp/manipulator.log"

xing93111 avatar Oct 16 '18 21:10 xing93111

Thanks for submitting the issue, @xing93111.

Further detail: If MIK is run instead with the class CdmCompound, compound objects are generated with the directory structure of a Book, except each page is a PDF (instead of a TIFF). These PDFs are OK (not corrupt).

As far as we understand, the CdmPdfDocuments class is supposed to merge these page-level PDFs into a single aggregated PDF. The result is a corrupted PDF.

Is there anything wrong with the configuration? Or is there a flaw in the toolchain?

bondjimbond avatar Oct 17 '18 14:10 bondjimbond

I can't see anything wrong with the configuration. This particular toolchain relies on CONTENTdm's internal functionality to merge the PDF pages into a single document. It used to work fine - for example the PDFs in https://ecuad.arcabc.ca/islandora/object/ecuad%3Acals were generated using it, with this .ini file: https://github.com/MarcusBarnes/mik/blob/master/extras/samples/calendars_config.ini That said, the filegetter was has probably not been tested since the major code cleanup that happened after SFU used the toolchain.

The code that fetches the assembled PDF content is here. I suggest dumping the value of the URL generated here and then running it using curl to see whether the PDF if produces is corrupted.

mjordan avatar Oct 17 '18 15:10 mjordan

The configuration file here uses CdmPhpDocuments, but I don't see such class is included in mik toolkit source code. Where can I find the file?

xing93111 avatar Oct 17 '18 15:10 xing93111

@xing93111, sorry, that config file was an early one and predates #223. The configuration should use CdmPdfDocuments in lines 22 and 29.

mjordan avatar Oct 17 '18 15:10 mjordan

... and I've just updated https://github.com/MarcusBarnes/mik/wiki/Toolchain:-CONTENTdm-compound-PDFs. Very sorry about that.

mjordan avatar Oct 17 '18 16:10 mjordan

I used a text editor to open the generated PDF file and found it is not a PDF at all but an XML file. For example, the following is the content of the generated PDF file related to this object: http://digicon.athabascau.ca/cdm/ref/collection/auarchives/id/499

<?xml version="1.0"?>
<cpd>
	<type>Document</type>
  <page>
    <pagetitle>Page 1</pagetitle>
    <pagefile>485.pdf</pagefile>
    <pageptr>484</pageptr>
  </page>
  <page>
    <pagetitle>Page 2</pagetitle>
    <pagefile>486.pdf</pagefile>
    <pageptr>485</pageptr>
  </page>
  <page>
    <pagetitle>Page 3</pagetitle>
    <pagefile>487.pdf</pagefile>
    <pageptr>486</pageptr>
  </page>
  <page>
    <pagetitle>Page 4</pagetitle>
    <pagefile>488.pdf</pagefile>
    <pageptr>487</pageptr>
  </page>
  <page>
    <pagetitle>Page 5</pagetitle>
    <pagefile>489.pdf</pagefile>
    <pageptr>488</pageptr>
  </page>
  <page>
    <pagetitle>Page 6</pagetitle>
    <pagefile>490.pdf</pagefile>
    <pageptr>489</pageptr>
  </page>
  <page>
    <pagetitle>Page 7</pagetitle>
    <pagefile>491.pdf</pagefile>
    <pageptr>490</pageptr>
  </page>
  <page>
    <pagetitle>Page 8</pagetitle>
    <pagefile>492.pdf</pagefile>
    <pageptr>491</pageptr>
  </page>
  <page>
    <pagetitle>Page 9</pagetitle>
    <pagefile>493.pdf</pagefile>
    <pageptr>492</pageptr>
  </page>
  <page>
    <pagetitle>Page 10</pagetitle>
    <pagefile>494.pdf</pagefile>
    <pageptr>493</pageptr>
  </page>
  <page>
    <pagetitle>Page 11</pagetitle>
    <pagefile>495.pdf</pagefile>
    <pageptr>494</pageptr>
  </page>
  <page>
    <pagetitle>Page 12</pagetitle>
    <pagefile>496.pdf</pagefile>
    <pageptr>495</pageptr>
  </page>
  <page>
    <pagetitle>Page 13</pagetitle>
    <pagefile>497.pdf</pagefile>
    <pageptr>496</pageptr>
  </page>
  <page>
    <pagetitle>Page 14</pagetitle>
    <pagefile>498.pdf</pagefile>
    <pageptr>497</pageptr>
  </page>
  <page>
    <pagetitle>Page 15</pagetitle>
    <pagefile>499.pdf</pagefile>
    <pageptr>498</pageptr>
  </page>
</cpd>

xing93111 avatar Oct 18 '18 20:10 xing93111

We need to establish that CONTENTdm still supports the ability to join PDF pages into a single multipage PDF file (it may have changed since this code was written). To do that we need to create a request URL using the code below (from here):

            $get_file_url = $this->utilsUrl .'getdownloaditem/collection/'
                . $this->alias . '/id/' . $pointer . '/type/compoundobject/show/1/cpdtype/document-pdf/filename/'
                . $document_structure['page'][0]['pagefile'] . '/width/0/height/0/mapsto/pdf/filesize/0/title/'
                . urlencode($document_structure['page'][0]['pagetitle']);

and see if we get a PDF from the server. So that would look like:

http://yourcdmutilsurl/getdownloaditem/collection/auarchives/id/499/type/compoundobject/show/1/cpdtype/document-pdf/filename/485.pdf/width/0/height/0/mapsto/pdf/filesize/0/title/Page%201

If you use curl to get that URL, what does the resulting file look like?

mjordan avatar Oct 18 '18 21:10 mjordan

If you don't mind sharing your CONTENTdm API URL with me I can take a look.

mjordan avatar Oct 18 '18 21:10 mjordan

@bondjimbond has the URL but it requires a VPN connection. URL: http://deck.cs.athabascau.ca/dmwebservices/index.php?q=

xing93111 avatar Oct 18 '18 21:10 xing93111

@mjordan Here is the output:

billg@lib10:~$ curl http://digicon.athabascau.ca/getdownloaditem/collection/auarchives/id/499/type/compoundobject/show/1/cpdtype/document-pdf/filename/485.pdf/width/0/height/0/mapsto/pdf/filesize/0/title/Page%201
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" class="no-js">
<!-- CONTENTdm Version 6.8.0.412s/6.8.0.761w (c) OCLC 2011-2018. All Rights Reserved. //-->
<head>
  <meta name="robots" content="noindex,nofollow,noarchive" />
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  
    
  <link rel="shortcut icon" type="image/x-icon" href="/ui/custom/default/collection/default/images/favicon.ico?version=1404943627" />

  	<title>CONTENTdm Title</title>
	
    
  <script type="text/javascript">
    var cdmHttps = 'off';
    var cdmInsecureWebsitePort = '';
    var cdmSecureWebsitePort = '';
  </script>
  
  <link rel="stylesheet" type="text/css" href="/ui/custom/default/collection/default/css/main.css?version=1529334550" />
  <link type="text/css" href="/utils/getstaticcontent/file/js~bt~jquery.bt.css/type/stylesheet" rel="stylesheet" />
  <link type="text/css" href="/utils/getstaticcontent/file/js~skins~tango~skin.css/type/stylesheet" rel="stylesheet" />
  <link type="text/css" href="/utils/getstaticcontent/file/js~skins~cdm~skin.css/version/1401946701/type/stylesheet" rel="stylesheet" />
  
    
       
  <style>
    .line_breaker, pre {
        white-space: pre;
        white-space: pre-wrap;
        white-space: pre-line;
        white-space: -pre-wrap;
        white-space: -o-pre-wrap;
        white-space: -moz-pre-wrap;
        white-space: -hp-pre-wrap;
        word-wrap: break-word;
    } 
  </style>    
  
  <!-- NEW JQUERY and UI -->
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~jquery_1.7.2~jquery-1.7.2.js/type/javascript"></script>
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~jquery_1.7.2~jquery-ui-1.8.20.js/type/javascript"></script>
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~jquery-ui-togglebox.js/type/javascript"></script>
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~jquery.hoverIntent.minified.js/type/javascript"></script>
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~jquery.scrollTo-min.js/type/javascript"></script>
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~default.js/version/1401946702/type/javascript"></script>
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~modernizr-latest.js/type/javascript"></script>
  <!--[if lt IE 10]>
		<script type="text/javascript" src="/utils/getstaticcontent/file/js~cdmOldInternetExplorerChecker.js/type/javascript"></script>
	<![endif]-->
  
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~bt~jquery.bt.min.js/type/javascript"></script>
  <script type="text/javascript" src="/utils/getstaticcontent/file/js~quickview.js/type/javascript"></script>

	<!--[if IE]>
		<script type="text/javascript" src="/ui/cdm/default/collection/default/js/excanvas.compiled.js"></script>
	<![endif]-->
	<!--[if IE 7]>
		<link href="/ui/cdm/default/collection/default/css/ie7.css" type="text/css" rel="stylesheet" />
	<![endif]-->
  
           
  
  <script type="text/javascript">
    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
    (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
    m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
    })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

    ga('create', 'UA-6471153-5');
    ga('send', 'pageview');
      </script>
  <script type="text/javascript" src="/ui/cdm/default/collection/default/js/cdm_ga.js"></script>
         
</head>


<body>
    


    
  <a name="top"></a>

<!-- HEADER -->	
<div id="headerWrapper" tabindex="1000">
    <p><img src="/ui/custom/default/collection/default/images/digiport_banner6.jpg" alt="" /></p>
    <span class="clear"></span>
	</div>

<!--  NAV_TOP -->
	<div id="nav_top">
		<div id="nav_top_left">
			<ul class="nav">
                  <li class="nav_li">
            <a tabindex="1001" id="nav_top_left_first_link" href="http://digiport.athabascau.ca"  >
              <div class="nav_top_left_text_container">Home</div>
            </a>
          </li>
                    <li class="nav_li">
            <a tabindex="1002"  href="/cdm/"  >
              <div class="nav_top_left_text_container">Browse</div>
            </a>
          </li>
                    <li class="nav_li">
            <a tabindex="1003"  href="http://digicon.athabascau.ca/cdm4/help.php"  >
              <div class="nav_top_left_text_container">Help</div>
            </a>
          </li>
                    <li class="nav_li">
            <a tabindex="1004"  href="http://digiport.athabascau.ca/copyright.html"  >
              <div class="nav_top_left_text_container">Copyright</div>
            </a>
          </li>
                    <li class="nav_li">
            <a tabindex="1005"  href="http://library.athabascau.ca"  >
              <div class="nav_top_left_text_container">Athabasca University Library</div>
            </a>
          </li>
                    <li class="nav_li">
            <a tabindex="1006"  href="http://digiport.athabascau.ca/"  >
              <div class="nav_top_left_text_container">Digitization Portal</div>
            </a>
          </li>
          					
			</ul>
		</div>

		<div id="nav_top_right">
			<ul class="nav">
							<li class="nav_li_right_1">
					<span class=""><!--<a href="javascript:session_check(fx);" id="debug_session_check">Session Check</a>&nbsp;-&nbsp;<a href="javascript:session_auth();" id="debug_session_auth">Session Auth</a>&nbsp;-&nbsp;<a href="javascript:session_deauth();" id="debug_session_de-auth">Session De-Auth</a>&nbsp;-&nbsp;-->
                                                
						  <span class="currentUser" id="currentUser"></span><a tabindex="1007" id="login_link" href="http://digicon.athabascau.ca/login/" data-analytics='{"category":"navigation","action":"click","label":"Log in link"}'>Log in</a>
                                                                                    </span>
				</li>

				<li class="nav_li_right_1 nav_top_right_divider">|</li>
                                    
				<li class="nav_li_right_1">
					<span class="icon_10 icon_nav_top_right ui-icon-help cdmHelpLink"></span><a tabindex="1008" class="cdmHelpLink" href="javascript:;" data-analytics='{"category":"navigation","action":"click","label":"Help link"}'><b>Help</b></a>
				</li>
                  <li class="nav_li_right_1 nav_top_right_divider">|</li>

          <li class="nav_li_right_1">
            <div id="nav_top_right_language_dd_link">
              <a tabindex="1009" href="javascript:;" id="nav_top_right_language_dd_link_text" data-analytics='{"category":"navigation","action":"open","label":"language selection menu"}'>
              English              </a><span class="icon_10 icon_nav_top_right ui-icon-triangle-1-s"></span>
            </div>
            <br />
            <div id="nav_top_right_language_dd_container">
              <div id="nav_top_right_language_dd_content">
                                  <div tabindex="1010" class="language_option cdm_selected_language" lang="en_US" data-analytics='{"category":"navigation","action":"click","label":"language: English"}'>English</div>
                                    <div tabindex="1011" class="language_option " lang="de" data-analytics='{"category":"navigation","action":"click","label":"language: Deutsch"}'>Deutsch</div>
                                    <div tabindex="1012" class="language_option " lang="es" data-analytics='{"category":"navigation","action":"click","label":"language: Español"}'>Español</div>
                                    <div tabindex="1013" class="language_option " lang="en_PIRATE" data-analytics='{"category":"navigation","action":"click","label":"language: Pirate English"}'>Pirate English</div>
                                    <div tabindex="1014" class="language_option " lang="ko" data-analytics='{"category":"navigation","action":"click","label":"language: 한국어 Korean"}'>한국어 Korean</div>
                                    <div tabindex="1015" class="language_option " lang="fr" data-analytics='{"category":"navigation","action":"click","label":"language: Français"}'>Français</div>
                                </div>
              <span class="clear"></span>
            </div>
            				</li>
			</ul>
		</div>
	</div>



	
<!-- BEGIN TOP CONTENT -->
	<div id="top_content">
		<div style="height:400px;width:500px;margin:0 auto;" valign="top">
  <div id="cdm_error" style="height:24px;width:500px;" class="float_left spacePad10 spaceMar30T ui-state-error ui-corner-all">
    <span class="icon_10 ui-icon-alert ui-icon-alert-cdmerror"></span>
    404: Page not found  </div>
</div>	</div>
<!-- END TOP CONTENT -->

<!-- FOOTER -->
  <span class="clear"></span>
  <div id="cdmFooterWrapper" class="spaceMar20T">
    <div id="nav_footer">
      <div id="nav_footer_left">
        <ul class="nav">
                      <li class="nav_footer_li"><a href="/cdm/">Home</a></li>
                              <li class="nav_footer_left_divider">|</li>
                              <li class="nav_footer_li"><a href="/cdm/about">About</a></li>
                              <li class="nav_footer_left_divider">|</li>
                              <li class="nav_footer_li"><a href="mailto:[email protected]">Contact us</a></li>
                      </ul>
      </div>
      <div id="nav_footer_right"><ul class="nav">
        <li class="nav_footer_li"><a href="http://www.contentdm.org/" data-analytics='{"category":"navigation","action":"click","label":"Powered by CONTENTdm&reg; link"}'>Powered by CONTENTdm&reg;</a></li></ul>
      </div>
      <br /><br />
    </div>
    <span class="clear"></span>
  </div>

    <div id="login_dialog" title="Login" dialog_name="login_dialog"></div>

  <span class="clear"></span>
	<div id="content_footer"></div>

  <!-- language fields -->
  <input type="hidden" id="cdm_language_and" value="and" />
  <input type="hidden" id="cdm_language_or" value="or" />
  <input type="hidden" id="cdm_language_in" value="in" />
  <input type="hidden" id="cdm_language_advancedsearch" value="Advanced Search" />
  <input type="hidden" id="cdm_language_closeadvancedsearch" value="Close Advanced Search" />
  <input type="hidden" id="cdm_language_allofthewords" value="All of the words" />
  <input type="hidden" id="cdm_language_anyofthewords" value="Any of the words" />
  <input type="hidden" id="cdm_language_noneofthewords" value="None of the words" />
  <input type="hidden" id="cdm_language_theexactphrase" value="The exact phrase" />
  <input type="hidden" id="cdm_language_allfields" value="All fields" />
  <input type="hidden" id="cdm_language_error_enterAWordOrPhrase" value="Enter a word or phrase" />
  <input type="hidden" id="cdm_language_addorremovecollections" value="Add or remove collections" />
  <input type="hidden" id="cdm_language_limitsearchtospecificcollections" value="Limit search to specific collections" />
  <input type="hidden" id="cdm_language_failedtoretrieveitem" value="Failed to retrieve the item." />
  <input type="hidden" id="cdm_language_therewasaproblemrefreshingtheimage" value="therewasaproblemrefreshingtheimage" />
  <input type="hidden" id="cdm_language_close" value="Close" />
  <input type="hidden" id="cdm_language_login" value="Log in" />
  <input type="hidden" id="cdm_language_logout" value="Log out" />
  <input type="hidden" id="cdm_language_username" value="User Name" />
  <input type="hidden" id="cdm_language_password" value="Password" />
  <input type="hidden" id="cdm_language_cancel" value="Cancel" />
  <input type="hidden" id="cdm_language_ok" value="OK" />
  <input type="hidden" id="cdm_language_authenticating" value="Authenticating" />
  <input type="hidden" id="cdm_language_loading" value="loading..." />
  <input type="hidden" id="cdm_language_allCollections" value="All Collections" />
  <input type="hidden" id="cdm_language_remove" value="remove" />
  <input type="hidden" id="cdm_language_plus" value="Plus" />
  <input type="hidden" id="cdm_language_more" value="more" />
  <input type="hidden" id="cdm_language_foundindocument" value="found in document" />
  <input type="hidden" id="cdm_language_for" value="for" />

  <input type="hidden" id="cdm_language_error_nousernameentered" value="Please enter a user name." />
  <input type="hidden" id="cdm_language_error_nopasswordentered" value="Please enter a password" />
  <input type="hidden" id="cdm_language_error_authenticationfailed" value="Authentication Failed\nThe user name and/or password is not recognized.\nPlease check the spelling and try again." />
  <!-- end language fields -->

 
  </body>
</html>

xing93111 avatar Oct 18 '18 22:10 xing93111

You need the 'utils' subdirectory. Try:

curl http://digicon.athabascau.ca/utils/getdownloaditem/collection/auarchives/id/499/type/compoundobject/show/1/cpdtype/document-pdf/filename/485.pdf/width/0/height/0/mapsto/pdf/filesize/0/title/Page%201

mjordan avatar Oct 18 '18 22:10 mjordan

This is the response:

billg@lib10:~$ curl http://digicon.athabascau.ca/utils/getdownloaditem/collection/auarchives/id/499/type/compoundobject/show/1/cpdtype/document-pdf/filename/485.pdf/width/0/height/0/mapsto/pdf/filesize/0/title/Page%201
<?xml version="1.0"?>
<cpd>
	<type>Document</type>
  <page>
    <pagetitle>Page 1</pagetitle>
    <pagefile>485.pdf</pagefile>
    <pageptr>484</pageptr>
  </page>
  <page>
    <pagetitle>Page 2</pagetitle>
    <pagefile>486.pdf</pagefile>
    <pageptr>485</pageptr>
  </page>
  <page>
    <pagetitle>Page 3</pagetitle>
    <pagefile>487.pdf</pagefile>
    <pageptr>486</pageptr>
  </page>
  <page>
    <pagetitle>Page 4</pagetitle>
    <pagefile>488.pdf</pagefile>
    <pageptr>487</pageptr>
  </page>
  <page>
    <pagetitle>Page 5</pagetitle>
    <pagefile>489.pdf</pagefile>
    <pageptr>488</pageptr>
  </page>
  <page>
    <pagetitle>Page 6</pagetitle>
    <pagefile>490.pdf</pagefile>
    <pageptr>489</pageptr>
  </page>
  <page>
    <pagetitle>Page 7</pagetitle>
    <pagefile>491.pdf</pagefile>
    <pageptr>490</pageptr>
  </page>
  <page>
    <pagetitle>Page 8</pagetitle>
    <pagefile>492.pdf</pagefile>
    <pageptr>491</pageptr>
  </page>
  <page>
    <pagetitle>Page 9</pagetitle>
    <pagefile>493.pdf</pagefile>
    <pageptr>492</pageptr>
  </page>
  <page>
    <pagetitle>Page 10</pagetitle>
    <pagefile>494.pdf</pagefile>
    <pageptr>493</pageptr>
  </page>
  <page>
    <pagetitle>Page 11</pagetitle>
    <pagefile>495.pdf</pagefile>
    <pageptr>494</pageptr>
  </page>
  <page>
    <pagetitle>Page 12</pagetitle>
    <pagefile>496.pdf</pagefile>
    <pageptr>495</pageptr>
  </page>
  <page>
    <pagetitle>Page 13</pagetitle>
    <pagefile>497.pdf</pagefile>
    <pageptr>496</pageptr>
  </page>
  <page>
    <pagetitle>Page 14</pagetitle>
    <pagefile>498.pdf</pagefile>
    <pageptr>497</pageptr>
  </page>
  <page>
    <pagetitle>Page 15</pagetitle>
    <pagefile>499.pdf</pagefile>
    <pageptr>498</pageptr>
  </page>
</cpd>

xing93111 avatar Oct 18 '18 22:10 xing93111

At http://digicon.athabascau.ca/cdm/ref/collection/auarchives/id/499, if I wanted to download the entire document as a single PDF, how would I do that? I don't see a link that will allow me to do that. Is there an admin option that turns off that feature, and if so, do you have it turned off?

mjordan avatar Oct 18 '18 22:10 mjordan

I don't see a button allowing to download the entire compound object as a single PDF file and I don't find an option at the backend to turn it on/off. However, for this object: http://digicon.athabascau.ca/cdm/ref/collection/auriver/id/454, it has a download link. But I think it is a single object rather than a compound one.

xing93111 avatar Oct 19 '18 15:10 xing93111

Correct, that is a single-file object, not a compound.

mjordan avatar Oct 19 '18 15:10 mjordan

I think the manipulator has some problems. If I configure it like:

fetchermanipulators[] = "CdmCompound|Document-PDF"

It does not work because the output of the MIK is:

Commencing MIK.
Filtering 2 records through the CdmCompound fetcher manipulator.
==========================================================================================> 100%
Creating 0 Islandora ingest packages. Please be patient.

It just filtered out the two records in the collection. Then, I changed the manipulator like this:

fetchermanipulators[] = "CdmCompound|Document"

because I found the object types are

65,compound,Document
586,compound,Document

It does work but again I get corrupted PDF files because they are indeed XML files.

So I am thinking the manipulators section on this page:https://github.com/MarcusBarnes/mik/wiki/Toolchain:-CONTENTdm-compound-PDFs should not be restricted to

fetchermanipulators[] = "CdmCompound|Document-PDF"

xing93111 avatar Oct 19 '18 16:10 xing93111

@xing93111 can you test compound PDF documents with MIK as it stood prior to #223 and the work that brought MIK in line with coding standards? Try commit 9c6b8c537f477fd82f20f3c6ba2563fcd30bd7f5. The compound PDF toolchain code at that commit is essentially how it stood when SFU migrated its compound PDFs (as far as the compound PDF document code anyway). You will need to adjust your .ini file to use CdmPhpDocuments and not 'CdmPdfDocuments` (which is what #223 fixed).

If this works for you, then there is a problem with the current MIK code that we need to fix; if it doesn't, then we need to confirm that your CONTENTdm can produce a single multiplage PDF from single-page PDFs (which we have not done yet) and go from there.

@MarcusBarnes does this seem like a reasonable way of narrowing down the problem?

Does anyone know of another CONTENTdm instance that we can test against?

mjordan avatar Oct 25 '18 15:10 mjordan

@mjordan I don't see the class named CdmPhpDocuments on this page: https://github.com/MarcusBarnes/mik/tree/9c6b8c537f477fd82f20f3c6ba2563fcd30bd7f5/src/filegetters. I suppose this is the commit you would like me to pull out the code. If no such class, the command line will definitely fail

xing93111 avatar Oct 25 '18 21:10 xing93111

@xing93111 You're looking at the current code rather than the code from the earlier commit. In your MIK directory:

git checkout -b CdmPhpDocuments

Then you'll need to git reset --hard 9c6b8c5

This will take you to the earlier commit... Look in src/filegetters to see what the filename is.

bondjimbond avatar Oct 26 '18 14:10 bondjimbond

I still don't see the class. Here are my commands:

billg@lib10:/data4/test$ git clone https://github.com/MarcusBarnes/mik.git
Cloning into 'mik'...
remote: Enumerating objects: 18, done.
remote: Counting objects: 100% (18/18), done.
remote: Compressing objects: 100% (14/14), done.
remote: Total 5254 (delta 6), reused 10 (delta 4), pack-reused 5236
Receiving objects: 100% (5254/5254), 1.47 MiB | 0 bytes/s, done.
Resolving deltas: 100% (3468/3468), done.
Checking connectivity... done.
billg@lib10:/data4/test$ ls
mik
billg@lib10:/data4/test$ cd mik
billg@lib10:/data4/test/mik$ ls
composer.json  CONTRIBUTING.md  LICENSE  phpunit.xml.dist  src
composer.lock  extras           mik      README.md         tests
billg@lib10:/data4/test/mik$ git checkout -b CdmPhpDocuments
Switched to a new branch 'CdmPhpDocuments'
billg@lib10:/data4/test/mik$ 
billg@lib10:/data4/test/mik$ git reset --hard 9c6b8c5
HEAD is now at 9c6b8c5 Work on #397.
billg@lib10:/data4/test/mik$ ls
composer.json  composer.lock  CONTRIBUTING.md  extras  LICENSE  mik  README.md  src  tests
billg@lib10:/data4/test/mik$ cd src
billg@lib10:/data4/test/mik/src$ ls
config               fetchers                filemanipulators      metadataparsers
exceptions           filegettermanipulators  inputvalidators       utilities
fetchermanipulators  filegetters             metadatamanipulators  writers
billg@lib10:/data4/test/mik/src$ cd filegetters
billg@lib10:/data4/test/mik/src/filegetters$ ls
CdmBooks.php       CdmPdfDocuments.php  CsvCompound.php    FileGetter.php          OaipmhXpath.php
CdmCompound.php    CdmSingleFile.php    CsvNewspapers.php  OaipmhIslandoraObj.php
CdmNewspapers.php  CsvBooks.php         CsvSingleFile.php  OaipmhOjsPdf.php
billg@lib10:/data4/test/mik/src/filegetters$ 

Anything wrong?

xing93111 avatar Oct 26 '18 14:10 xing93111

I gave you the wrong commit hash. Try b6b8f0a280509cdae4ff11324c99ef14ffad8781, that puts the old filegetter back.

mjordan avatar Oct 26 '18 15:10 mjordan

@mjordan It seems vendor folder missed in this version of the code. Here is the output:

billg@lib10:/data4/projects/arca$ ./mik/mik -c ./collections/AUebooks/config.ini
PHP Warning:  require(vendor/autoload.php): failed to open stream: No such file or directory in /data4/projects/arca/mik/mik on line 10
PHP Fatal error:  require(): Failed opening required 'vendor/autoload.php' (include_path='.:/usr/share/php') in /data4/projects/arca/mik/mik on line 10
billg@lib10:/data4/projects/arca$ cd mik
billg@lib10:/data4/projects/arca/mik$ ls
composer.json  CONTRIBUTING.md  LICENSE  README_DEV.md  src
composer.lock  extras           mik      README.md      tests

xing93111 avatar Oct 26 '18 15:10 xing93111

When I check that commit out, vendor is still there. Did you try running composer update after you checked out b6b8f0a280509cdae4ff11324c99ef14ffad8781?

mjordan avatar Oct 26 '18 16:10 mjordan

Also good to run composer dump-autoload so that any new classes available via autoloading (after having run composer update to generate the vendor folder with any dependencies, etc.).

MarcusBarnes avatar Oct 26 '18 16:10 MarcusBarnes

@MarcusBarnes got the vendor folder, but now:

billg@lib10:/data4/projects/arca$ ./mik/mik -c ./collections/AUebooks/config.ini
PHP Fatal error:  Uncaught Error: Class 'Commando\Command' not found in /data4/projects/arca/mik/mik:20
Stack trace:
#0 {main}
  thrown in /data4/projects/arca/mik/mik on line 20

xing93111 avatar Oct 26 '18 16:10 xing93111

Do you still get that error after running composer dump-autoload?

mjordan avatar Oct 26 '18 16:10 mjordan

After running composer dump-autoload, I have the vendor folder, but was caught by the above error Commando\Command not found.

xing93111 avatar Oct 26 '18 16:10 xing93111

What do you see if you run ls vendor/nategood/commando/src/Commando/ from within the mik directory?

mjordan avatar Oct 26 '18 16:10 mjordan

@xing93111 Following up on @mjordan comment, double check if it's in your composer.json file (it might have been added after the commit that we're working from). If it's not there, over-write your exiting composer.json file with a copy of the latest composer.json file and then run composer install, the composer dump-autoload.

MarcusBarnes avatar Oct 26 '18 16:10 MarcusBarnes

The command line works now @MarcusBarnes. However, it still outputs corrupted PDFs as I mentioned above: https://github.com/MarcusBarnes/mik/issues/492#issuecomment-431150182

xing93111 avatar Oct 26 '18 17:10 xing93111