ScribusGenerator icon indicating copy to clipboard operation
ScribusGenerator copied to clipboard

support for XML data input & repeating fields

Open berteh opened this issue 9 years ago • 17 comments

request by @garydale:

.csv doesn't allow me to include all the data I'd like. My original XML file has repeated fields and I can't find a .csv converter that will keep them. The .csv file I've included, for example, handle them as separate .csv files, which would mean I'm doing a lot of cut and pasting. Others take just the first instance of a repeated field, which is just as bad.

Don't suppose you could make it work with XML as well?

berteh avatar Sep 11 '15 22:09 berteh

I see 2 distinct parts in your questions:

  1. ability to handle repeated field: csv indeed does, your converter does not, and neither does ScribusGenerator so far, indeed. Could you not simply concatenate the values of your xml fields in a single field? ... or if they repeat only a few times create a csv with an indexed field name, eg repeat_1, repeat_2, repeat_3, and then simply put the 3 fields, one after the other, where you want them in scribus? empty fields are supported just fine.

I personally usually handle the concatenation in Excel/OpenOffice with a formula, so I can add a separator (like ',' or 'and' or 'or'), to have a nice looking sentence... and keep it all in the csv (both repeated fields and the concatenated), but only use the concatenated in my template.

What I could (maybe) do is add a convention, e.g. %VAR_somename*% that is replace by the concatenated value of all fields starting with somename, something like a wildcard, but that would not solve your problem of lack of a suitable xml->csv converter... and seems like a lot of work to me for a limited benefits (since office suite do that really easily)

  1. ability to input xml: that might be possible, but not easy neither, as many syntax variants would be possible, and I don't want to have to manage 10+ different variants. If you have a single/simple solution let me know, and I'd consider it.

By the way: XML support would not solve on its own the problem of repeating fields: we'd need a convention to handle them (only the first, the last, always concatenate, always sum, multiply, or what...?). What would you propose?

some "concurrent" input XML that would all need to be valid to me would be:

<root>
    <fields>
        <field name="field_name" />
    </fields>
    <element>*
        <data value="" />*
    </element>
</root>

(quite verbose, but would support schema validation)

<root>
  <element_name field_name*="data" />*
</root>

(much more concise, but validation impossible)

<root>
  <element_name>*
    <field_name>data</field_name>*
  </element_name>
</root>

(still concise, but no attributes used)

and of course combining the 2 and 3... and many others... you see my problem I guess. any idea?

berteh avatar Sep 11 '15 22:09 berteh

one option for xml input is maybe to rely on https://pypi.python.org/pypi/xmlutils that has a xml>csv conversion, as long as the xml is flat enough.

berteh avatar Sep 11 '15 22:09 berteh

On 11/09/15 06:31 PM, Berteh wrote:

I see 2 distinct parts in your questions:

  1. ability to handle repeated field: csv indeed does, your converter does not, and neither does ScribusGenerator so far, indeed. Could you not simply concatenate the values of your xml fields in a single field? ... or if they repeat only a few times create a csv with an indexed field name, eg |repeat_1, repeat_2, repeat_3|, and then simply put the 3 fields, one after the other, where you want them in scribus? empty fields are supported just fine.

The "solution" from the various converters I've looked at is to concatenate structured data into a single field using separators, only take the first instance of a repeated field, or create additional, mostly blank, lines for the repeated fields. I haven't found one that would append numbers to field names to create new fields.

This would require the converter to do two passes to figure out the number of fields needed before doing the conversion.

I personally usually handle the concatenation in Excel/OpenOffice with a formula, so I can add a separator (like ',' or 'and' or 'or'), to have a nice looking sentence... and keep it all in the csv (both repeated fields and the concatenated), but only use the concatenated in my template.

Which would require a method of separating them in Scribus.

What I could (maybe) do is add a convention, e.g. %VAR_somename*% that is replace by the concatenated value of all fields starting with |somename|, something like a wildcard, but that would not solve your problem of lack of a suitable xml->csv converter... and seems like a lot of work to me for a limited benefits (since office suite do that really easily)

  1. ability to input xml: that might be possible, but not easy neither, as may syntax variants would be possible, and I don't want to have to manage 10+ different variants. If you have a single/simple solution let me know, and I'd consider it.

By the way: XML support would not solve on its own the problem of repeating fields: we'd need a convention to handle them (only the first, the last, always concatenate, always sum or multiply...?). What would you propose?

The simplest solution would be to insert additional instances of the element but that would require something like xsl to drive the generator.

Perhaps the easier solution would be to create an HTML/XHTML importer. The "get text" option claims to respect html tags but leaves me with an unformatted mass of text. It doesn't respect style sheets nor understand tables. It would be more work to use it than to use any other solution I've found - and certainly far more than the Generator you've developed.

some "concurrent" input XML that would all need to be valid to me would be:

| * *

* data* * |

the 1 is nice because it would allow schema validation. the 3rd because it's concise, and of course combining the 2 and 3... and many others... you see my problem I guess. any idea?

You'd probably have to require and interpret the schema at the start of the xml file (or a link to an external schema). Trying to infer a schema from the data would be difficult. In my case the schema is:

 <!ELEMENT club (clubname, clubnumber?, website?, charter?, region?, 

zone?, meetings?, den_, location_, mailing_, clubemail?, officers?)> ]>

garydale avatar Sep 12 '15 17:09 garydale

handling repeating fields

The simplest solution would be to insert additional instances of the element but that would require something like xsl to drive the generator.

xsl is not an issue to me... just adding an import (and an extra requirement), we could deactivate that function in the absence of the library... no problem.

So you know: that problem interests me a lot. I think there's much potential there, like populating the rows of a table, or including a variable amount of figures, or a variable amount of articles, with their headline, teaser and text,... but have no clue how to handle it the Scribus template (as mentioned in #11). For instance: should the duplicate instance be shifted (down/left/at all)? Wanna have a quick chat over it? to explain me the way you would see this working...

This would require the converter to do two passes to figure out the number of fields needed before doing the conversion.

That's a perfectly valid option to me (xsl functions actual makes it not too difficult) unless the size of the XML makes it a bad idea?

FYI the "simple" converter I had a look at, from xmlutils, generates, for an XML close to the one you mentionned, a csv with the following columns, basically considering the first occurence of the element as a reference definition (~schema), and requiring all subsequent elements of have the same structure (eg not supporting variation in repeating elements numbers across different elements).

clubname
    clubnumber
    website
    region
    location
    location
    mailing
    mailing
    mailing
            officers_officer_title
            officers_officer_address
            officers_officer_address
            officers_officer_telephone
            officers_officer_telephone
            officers_officer_telephone
        officers_officer
            officers_officer_title
            officers_officer_address
            officers_officer_address
            officers_officer_telephone
            officers_officer_telephone
            officers_officer_telephone
        officers_officer
    officers

To me enforcing the number of repeating elements (eg number of officers, number of telephones) to be always the same is not acceptable... so I'll keep on looking elsewhere.

berteh avatar Sep 13 '15 21:09 berteh

On 13/09/15 05:24 PM, Berteh wrote:

handling repeating fields

    The simplest solution would be to insert additional instances
    of the element but that would require something like xsl to
    drive the generator.

xsl is not an issue to me... just adding an import (and an extra requirement), we could deactivate that function in the absence of the library... no problem.

So you know: that problem interests me a lot. I think there's much potential there, like populating the rows of a table, or including a variable amount of figures, or a variable amount of articles, with their headline, teaser and text,... but have no clue how to handle it the Scribus template (as mentioned in #11 https://github.com/berteh/ScribusGenerator/issues/11). For instance: should the duplicate instance be shifted (down/left/at all)? Wanna have a quick chat over it? to explain me the way you would see this working...

This would require the converter to do two passes to figure out
the number of fields needed before doing the conversion.

That's a perfectly valid option to me (xsl functions actual makes it not too difficult) unless the size of the XML makes it a bad idea?

FYI the "simple" converter I had a look at, from xmlutils http://nadh.in/code/xmlutils.py/, generates, for an XML close to the one you mentionned, a csv with the following columns, basically considering the first occurence of the element as a reference definition (~schema), and requiring all subsequent elements of have the same structure (eg not supporting variation in repeating elements numbers across different elements).

|clubname clubnumber website region location location mailing mailing mailing officers_officer_title officers_officer_address officers_officer_address officers_officer_telephone officers_officer_telephone officers_officer_telephone officers_officer officers_officer_title officers_officer_address officers_officer_address officers_officer_telephone officers_officer_telephone officers_officer_telephone officers_officer officers |

To me enforcing the number of repeating elements (eg number of officers, number of telephones) to be always the same is not acceptable... so I'll keep on looking elsewhere.

— Reply to this email directly or view it on GitHub https://github.com/berteh/ScribusGenerator/issues/19#issuecomment-139918865.

Yes, I've had reasons to do that myself - create a dummy first instance with every field even it is optional and multiple occurrences of anything that could have multiples.

In the context of Scribus Generator, how would the Generator handle multiple fields with the same name? With the %VAR_s you don't have to go column by column, so even detecting multiple columns could be tricky.

Then there is your example above, with the repeating data elements having the same names within different columns so referring to a different XML parent element.

This is where something big and complicated like XSL comes in. You have control logic to help arrange the XML data, which then produces html/xhtml output.

At that point the simplest solution might be to get Scribus's Get Text function to import tables, etc. which it doesn't currently do. But then you'd need to get it to recognize CSS too.

Sticking within the Generator, it's not terribly difficult to give XML fields unique names in KATE, using a combo of escape sequences and regexs. Other editors that handle multiline regexs would probably do it even faster. Or I could learn Awk.

So then I can specify the decisions in the template file with the provision of auto-removing lines or fields that are empty. That still leaves possibly some extra garbage to clean up, like separator characters that aren't needed when there is nothing to separate.

For that, an XSL style "%IF_% " could be all that is needed.

garydale avatar Sep 13 '15 21:09 garydale

Found a cvs converter that numbers fields for multiple instances of an xml element. The .cvs file it creates is ugly but seems to have everything. It uses "/" to separate elements in a structured name, such as /officer/1/telephone/2 for the 3rd telephone of the second officer.

Unfortunately this leaves me with a new problem. I have to include every possible field. While not difficult, it leaves me with lots of blank lines where fields are empty in the .csv file.

Is there a way to conditionally omit empty fields?

garydale avatar Aug 31 '16 03:08 garydale

Yes indeed, empty fields are removed from the generated document. And so are text frames that are empty after all substitutions are done.

It's illustrated in the "clean output" documentation: https://github.com/berteh/ScribusGenerator/blob/master/README.md#clean-output

Kindly let me know how that works for you, and feel free to post an example

  • link to converter in the wiki, as it may interest others!

B.

Le 31 août 2016 05:30, "garydale" [email protected] a écrit :

Found a cvs converter that numbers fields for multiple instances of an xml element. The .cvs file it creates is ugly but seems to have everything. It uses "/" to separate elements in a structured name, such as /officer/1/telephone/2 for the 3rd telephone of the second officer.

Unfortunately this leaves me with a new problem. I have to include every possible field. While not difficult, it leaves me with lots of blank lines where fields are empty in the .csv file.

Is there a way to conditionally omit empty fields?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/berteh/ScribusGenerator/issues/19#issuecomment-243648729, or mute the thread https://github.com/notifications/unsubscribe-auth/AACQVvGczdPunzRcannL1vSEeTtqGNimks5qlPVrgaJpZM4F8FyA .

berteh avatar Aug 31 '16 05:08 berteh

Additionally you may need to do a quick find& replace on the generated CSV headers, to remove all "/" characters.

berteh avatar Aug 31 '16 05:08 berteh

Surprisingly no. The generator seems to have no problem with "/" in the variable names.

On 31/08/16 01:58 AM, Berteh wrote:

Additionally you may need to do a quick find& replace on the generated CSV headers, to remove all "/" characters.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berteh/ScribusGenerator/issues/19#issuecomment-243666223, or mute the thread https://github.com/notifications/unsubscribe-auth/ANO5WwBLTP6GRIDFGDzdX-1gSmymbS5vks5qlRfrgaJpZM4F8FyA.

garydale avatar Aug 31 '16 22:08 garydale

Unfortunately they are not. Some are and some aren't. I'm getting lots of empty lines in my .sla output file.

On 31/08/16 01:54 AM, Berteh wrote:

Yes indeed, empty fields are removed from the generated document. And so are text frames that are empty after all substitutions are done.

It's illustrated in the "clean output" documentation: https://github.com/berteh/ScribusGenerator/blob/master/README.md#clean-output

Kindly let me know how that works for you, and feel free to post an example

  • link to converter in the wiki, as it may interest others!

B.

Le 31 août 2016 05:30, "garydale" [email protected] a écrit :

Found a cvs converter that numbers fields for multiple instances of an xml element. The .cvs file it creates is ugly but seems to have everything. It uses "/" to separate elements in a structured name, such as /officer/1/telephone/2 for the 3rd telephone of the second officer.

Unfortunately this leaves me with a new problem. I have to include every possible field. While not difficult, it leaves me with lots of blank lines where fields are empty in the .csv file.

Is there a way to conditionally omit empty fields?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub

https://github.com/berteh/ScribusGenerator/issues/19#issuecomment-243648729, or mute the thread

https://github.com/notifications/unsubscribe-auth/AACQVvGczdPunzRcannL1vSEeTtqGNimks5qlPVrgaJpZM4F8FyA .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berteh/ScribusGenerator/issues/19#issuecomment-243665675, or mute the thread https://github.com/notifications/unsubscribe-auth/ANO5WwLsaT91DRHSoSJ510qXKyVwYTCEks5qlRb9gaJpZM4F8FyA.

garydale avatar Aug 31 '16 22:08 garydale

good to know for the "/" char, thanks. could you please elaborate a little further the issue of some empty variables not being removed in this new issue: #56

berteh avatar Aug 31 '16 23:08 berteh

Hi @garydale

there's a new feature called "next-record" that now allows to display mutliple data records on a single page (or document). It's available to test in the branch https://github.com/berteh/ScribusGenerator/tree/next-record Kindly let me know if that helps in your case or if you see any better way to do it... or other major improvement or bug. I'll add some documentation later... but any feedback is already welcome in the meanwhile. Please have a look at the example: https://github.com/berteh/ScribusGenerator/blob/next-record/example/Next_Record.sla

Berteh.

berteh avatar Mar 12 '17 23:03 berteh

Would it be possible to have the original "sub-records" at the source be output as separate, but copied rows of the parent record?,

For example:

1,Home,"Home Page",0,"",""
2,AboutUs,"About Us",0,"",""
3,Product1,"Product 1",1,"Hammer","Red"
4,Product1,"Product 1",1,"Hammer","Blue"
5,Product1,"Product 1",1,"Hammer","Green"

could mean: 1 - Home - Home Page - 0 1 - AboutUs - About Us Page - 0 1 - Product1 - Product 1 - 1 - Hammer

  • Red
  • Blue
  • Green

Where (RedBlueGreen) would be an array within the existing array to allow for depth storage as well.

This system could include rules for when to detect a new row within the current record, such as "when only one value has changed from the previous record", and allow it to be turned on or off.

That's just an idea, please ignore it if you have already solved the issue.

AaronDP avatar Apr 19 '17 05:04 AaronDP

Thanks @AaronDP for the idea. A few questions to dig it further... and no, this particular issue is not even close from being solved.

  1. In the end I would like the "template" file (in Scribus) to be as user-friendly as possible. How would you then see this template for your example above ?
  2. I think, if nested data becomes supported, I would prefer to use a proper structured data input syntax (JSON, for example, instead of CSV), rather than resort to the rules mechanism you suggest... optionally using a simple (non-nested) converter for backwards compatibility.
  3. Could you go a bit beyond your personal scenario to imagine other similar uses. What would change? Some user may want the sub-elements to display as an horizontal table (eg: a timeline of Hammers). Some users may want to have sub-sub-elements (Hammers of 3 colors, all in multiple sizes and different materials): how to handle this (or not handle it) ?

Just ideas again, not saying it will be done next week... but that could be fun, even more so if it's useful ;9

berteh avatar Apr 19 '17 21:04 berteh

20170422_002318

ScribusGenerator(SG) could be made to connect to disparate data sources using a source priority list, a source definition list, and a source exclusion list. The result could be converted to well-formed XML and then (optionally: translated using a themed stylesheet, and) passed to SG as a resultset containing all of the fields required, or an error message explaining the problem.

The "distiller" function could check for a "green light" on all data sourcs needed to supply request. b) take a snapshot of all data within the same time period. c) translate to uniform format and supply to fulfill contents of the *VAR_*varname fields. d) report back with any error messages but an empty response on failure. It could also allow the user to specify credentials for some data source connections.

AaronDP avatar Apr 22 '17 06:04 AaronDP

Hi @AaronDP

Thanks for this last suggestion. I personnaly would prefer to use tools that already exist and do a good job at integrating data sources... than try to hack something else.

I would use data integration (ETL) tools such as https://skyvia.com/ or https://www.singer.io, and their many bridges with existing online services, file formats and data bases; or a more manual spreadsheet import and filter like Power Queries in Excel or similar data sources in OpenOffice, including Spreadsheet, CSV, Mozilla Adress Book and even any simple XML file for the later.

... and when the data is clean (can be turned into a scheduled job): export to CSV and use a document generation tool, like ScribusGenerator ;)

If your data is really messy, then you may be more interested in data (integration and) cleaning tools, such as OpenRefine, that directly integrates data sources such as TSV, CSV, Text file with custom separators or columns split by fixed width, XML, RDF triples, JSON, Google Spreadsheets and Google Fusion Tables, a.o.

berteh avatar Apr 23 '17 10:04 berteh

added these info to page https://github.com/berteh/ScribusGenerator/wiki/Other-data-sources

berteh avatar Apr 23 '17 20:04 berteh