metafacture-core icon indicating copy to clipboard operation
metafacture-core copied to clipboard

Add command `string-to-variable` to reuse incoming string as variable

Open TobiasNx opened this issue 1 year ago • 9 comments

At the moment we cannot use the incoming url-string after it is used in open-http.

A useful scenario would be if we scrape a website but the website does not provide the url as metadata and to quickly identify the source. Another would be if catching errors in a later process it could state the _id as source of the error.

There also could be a more abstract approach since this could also be useful for open-file and provide the file name as _id

e.g.: https://metafacture.org/playground/?flux=%22https%3A//phet-dev.colorado.edu/html/build-an-atom/0.0.0-3/simple-text-only-test-page.html%22%0A%7C+open-http%28accept%3D%22application/xml%22%29%0A%7C+decode-html%0A%7C+fix%28%22copy_field%28%27_id%27%2C%27_id%27%29%22%29%0A%7C+encode-json%28prettyPrinting%3D%22true%22%29%0A%7C+print%0A%3B

Not sure where the value of _id comes from.

PS: 17.9.24: I suggest to introduce a command that would reuse the incoming string as java variable string-to-variable that would be a generic approach and the command could be put infront of the specific opener

TobiasNx avatar May 23 '24 09:05 TobiasNx

_id is the internal record identifier which is set automatically by some decoder/handler modules and which can be set manually (based on some literal value) with the change-id Flux command.

It can not be set by input modules, because they don't know anything about records at that point. OTOH, the source location (URL, path) is not available anymore when the decoder receives the stream and there is (currently) no way to transport it out-of-band. Setting the ID to the source location would also mean that (potentially) multiple records would get the same ID, so it violates the uniqueness guarantee.

It might, however, be possible to save the URL in a variable which can then be used in the transformation. Maybe along the following lines:

default inputUrl = "https://phet-dev.colorado.edu/html/build-an-atom/0.0.0-3/simple-text-only-test-page.html";

inputUrl
| open-http(accept="application/xml")
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;

blackwinter avatar May 23 '24 11:05 blackwinter

I would be fine with a variable that could be used in the FIX and the FLUX.

It would help in this scenario.

Nice would be also to use the variable in e.g. logging contexts or in other scenarios as variable in the FLUX, but this would be an additional feature.

TobiasNx avatar May 23 '24 12:05 TobiasNx

I would be fine with a variable that could be used in the FIX and the FLUX.

So your initial use case is solved?

Nice would be also to use the variable in e.g. logging contexts or in other scenarios as variable in the FLUX, but this would be an additional feature.

I'm not sure I understand this part. Do you mean that all variables should be included whenever anything is logged? And what other contexts are you referring to?

blackwinter avatar May 23 '24 13:05 blackwinter

I would be fine with a variable that could be used in the FIX and the FLUX.

So your initial use case is solved?

I think if I could use the variable in the fix my use case would be solved yes. :)

Nice would be also to use the variable in e.g. logging contexts or in other scenarios as variable in the FLUX, but this would be an additional feature.

I'm not sure I understand this part. Do you mean that all variables should be included whenever anything is logged? And what other contexts are you referring to?

If I could configure the logging message and add the variable to the output is one scenario where the variable could be handy. Another could be if the file-name is passed on as a variable I could use it to write a file with a given variable as name. But these are additional feature, what would be good in the first place is to have the variable available for FIX and for other FLUX Commands.

TobiasNx avatar May 23 '24 15:05 TobiasNx

I think if I could use the variable in the fix my use case would be solved yes. :)

But you can. Doesn't the proposed solution work for you?

blackwinter avatar May 23 '24 15:05 blackwinter

ahh, i now I see the specific aspect of your approach. I tought you were suggesting that the opener-module would create the variable, but you were not.

something like this:

sitemap
| oersi.SitemapReader(wait=input_wait, limit=input_limit, urlPattern=".*/course/.*")
| open-http(input-to-variable="inputUrl"))
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;

Instead you would define the variable beforehand.

This would not solve my usecase since you have to provide/configure the variable outside of the flux-workflow itself. The usecase would be in our scenario to use a sitemap via the sitemap reader in oersi, then open the html and fetch data. I do not know the data before hand.

Perhaps another and more general solution would be a flux-module that sets the incoming string as variable.

sitemap
| oersi.SitemapReader(wait=input_wait, limit=input_limit, urlPattern=".*/course/.*")
| string-to-variable("inputUrl")
| open-http(header=user_agent_header)
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;

TobiasNx avatar May 28 '24 08:05 TobiasNx

Perhaps another and more general solution would be a flux-module that sets the incoming string as variable.

sitemap
| oersi.SitemapReader(wait=input_wait, limit=input_limit, urlPattern=".*/course/.*")
| string-to-variable("inputUrl")
| open-http(header=user_agent_header)
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;

I suggest we go with this approach.

TobiasNx avatar Sep 17 '24 13:09 TobiasNx

I don't think that your idea would work: you seem to propose like setting a variable globally i.e. that could be accessed independently of the modules. This must break, ultimatley when using threads. It would break even before, because the modules are of stream character, and you cannot guarantee that the variable is not changed before the content of the variable (the associated data) is already treated in downstream modules.

A possible solution could maybe be, if _id_ is really unique and can be accessed in an unambigous way throughout all modules, to make us of a global HashTable, where your variables are associated with that _id.

dr0i avatar Sep 19 '24 08:09 dr0i

Yes the intend is to set a global variable, that can be reused at a later stage , e.g. usecase scenario we had in oersi-marc:

opening a folder with files manipulate them and later reuse the filenames of the incoming string.

"folderPath"
| open-dir
| string-to-variable
| open-file
| decode-json
| batch-reset("1")
| fix ("copy_field("$[inputString]","fileName")
| encode-json
| write("output/$[inputString]")
;

The other scenario coming from oersi, when using a oersi.sitemapreader or a textfile with multiple urls one cannot get the URL of each subsite or given in the textfile.

maybe I am thinking about this in an undercomplex way. btw writing this my solution would not be good enough you are right and would not solve my scenario since the incoming string in openFile is an relativePath not the filename...

Then I go back to my old idea. open-file and open-http should provide the filename/filepath or the URL as variable for later use. But to make this threadsafe it seems that it will be difficult.

TobiasNx avatar Sep 19 '24 09:09 TobiasNx

The usecase came up again in context of https://github.com/hbz/eli-sa-mapping-onix3.0-to-marcXml

I have a folder with different files. I want to transform these files but keep the filenames for the newly generated files as described here: https://github.com/metafacture/metafacture-core/issues/533#issuecomment-2360466579

opening a folder with files manipulate them and later reuse the filenames of the incoming string.

"folderPath"
| open-dir
| string-to-variable
| open-file
| decode-json
| batch-reset("1")
| fix ("copy_field("$[inputString]","fileName")
| encode-json
| write("output/$[inputString]")
;

TobiasNx avatar Feb 12 '25 16:02 TobiasNx