Add command `string-to-variable` to reuse incoming string as variable
At the moment we cannot use the incoming url-string after it is used in open-http.
A useful scenario would be if we scrape a website but the website does not provide the url as metadata and to quickly identify the source. Another would be if catching errors in a later process it could state the _id as source of the error.
There also could be a more abstract approach since this could also be useful for open-file and provide the file name as _id
e.g.: https://metafacture.org/playground/?flux=%22https%3A//phet-dev.colorado.edu/html/build-an-atom/0.0.0-3/simple-text-only-test-page.html%22%0A%7C+open-http%28accept%3D%22application/xml%22%29%0A%7C+decode-html%0A%7C+fix%28%22copy_field%28%27_id%27%2C%27_id%27%29%22%29%0A%7C+encode-json%28prettyPrinting%3D%22true%22%29%0A%7C+print%0A%3B
Not sure where the value of _id comes from.
PS: 17.9.24:
I suggest to introduce a command that would reuse the incoming string as java variable string-to-variable that would be a generic approach and the command could be put infront of the specific opener
_id is the internal record identifier which is set automatically by some decoder/handler modules and which can be set manually (based on some literal value) with the change-id Flux command.
It can not be set by input modules, because they don't know anything about records at that point. OTOH, the source location (URL, path) is not available anymore when the decoder receives the stream and there is (currently) no way to transport it out-of-band. Setting the ID to the source location would also mean that (potentially) multiple records would get the same ID, so it violates the uniqueness guarantee.
It might, however, be possible to save the URL in a variable which can then be used in the transformation. Maybe along the following lines:
default inputUrl = "https://phet-dev.colorado.edu/html/build-an-atom/0.0.0-3/simple-text-only-test-page.html";
inputUrl
| open-http(accept="application/xml")
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;
I would be fine with a variable that could be used in the FIX and the FLUX.
It would help in this scenario.
Nice would be also to use the variable in e.g. logging contexts or in other scenarios as variable in the FLUX, but this would be an additional feature.
I would be fine with a variable that could be used in the FIX and the FLUX.
So your initial use case is solved?
Nice would be also to use the variable in e.g. logging contexts or in other scenarios as variable in the FLUX, but this would be an additional feature.
I'm not sure I understand this part. Do you mean that all variables should be included whenever anything is logged? And what other contexts are you referring to?
I would be fine with a variable that could be used in the FIX and the FLUX.
So your initial use case is solved?
I think if I could use the variable in the fix my use case would be solved yes. :)
Nice would be also to use the variable in e.g. logging contexts or in other scenarios as variable in the FLUX, but this would be an additional feature.
I'm not sure I understand this part. Do you mean that all variables should be included whenever anything is logged? And what other contexts are you referring to?
If I could configure the logging message and add the variable to the output is one scenario where the variable could be handy. Another could be if the file-name is passed on as a variable I could use it to write a file with a given variable as name. But these are additional feature, what would be good in the first place is to have the variable available for FIX and for other FLUX Commands.
I think if I could use the variable in the fix my use case would be solved yes. :)
But you can. Doesn't the proposed solution work for you?
ahh, i now I see the specific aspect of your approach. I tought you were suggesting that the opener-module would create the variable, but you were not.
something like this:
sitemap
| oersi.SitemapReader(wait=input_wait, limit=input_limit, urlPattern=".*/course/.*")
| open-http(input-to-variable="inputUrl"))
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;
Instead you would define the variable beforehand.
This would not solve my usecase since you have to provide/configure the variable outside of the flux-workflow itself. The usecase would be in our scenario to use a sitemap via the sitemap reader in oersi, then open the html and fetch data. I do not know the data before hand.
Perhaps another and more general solution would be a flux-module that sets the incoming string as variable.
sitemap
| oersi.SitemapReader(wait=input_wait, limit=input_limit, urlPattern=".*/course/.*")
| string-to-variable("inputUrl")
| open-http(header=user_agent_header)
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;
Perhaps another and more general solution would be a flux-module that sets the incoming string as variable.
sitemap | oersi.SitemapReader(wait=input_wait, limit=input_limit, urlPattern=".*/course/.*") | string-to-variable("inputUrl") | open-http(header=user_agent_header) | decode-html | fix("set_field('_id', '$[inputUrl]')", *) | change-id | fix("copy_field('_id', '_id')") | encode-json(prettyPrinting="true") | print ;
I suggest we go with this approach.
I don't think that your idea would work: you seem to propose like setting a variable globally i.e. that could be accessed independently of the modules. This must break, ultimatley when using threads. It would break even before, because the modules are of stream character, and you cannot guarantee that the variable is not changed before the content of the variable (the associated data) is already treated in downstream modules.
A possible solution could maybe be, if _id_ is really unique and can be accessed in an unambigous way throughout all modules, to make us of a global HashTable, where your variables are associated with that _id.
Yes the intend is to set a global variable, that can be reused at a later stage , e.g. usecase scenario we had in oersi-marc:
opening a folder with files manipulate them and later reuse the filenames of the incoming string.
"folderPath"
| open-dir
| string-to-variable
| open-file
| decode-json
| batch-reset("1")
| fix ("copy_field("$[inputString]","fileName")
| encode-json
| write("output/$[inputString]")
;
The other scenario coming from oersi, when using a oersi.sitemapreader or a textfile with multiple urls one cannot get the URL of each subsite or given in the textfile.
maybe I am thinking about this in an undercomplex way. btw writing this my solution would not be good enough you are right and would not solve my scenario since the incoming string in openFile is an relativePath not the filename...
Then I go back to my old idea. open-file and open-http should provide the filename/filepath or the URL as variable for later use. But to make this threadsafe it seems that it will be difficult.
The usecase came up again in context of https://github.com/hbz/eli-sa-mapping-onix3.0-to-marcXml
I have a folder with different files. I want to transform these files but keep the filenames for the newly generated files as described here: https://github.com/metafacture/metafacture-core/issues/533#issuecomment-2360466579
opening a folder with files manipulate them and later reuse the filenames of the incoming string.
"folderPath" | open-dir | string-to-variable | open-file | decode-json | batch-reset("1") | fix ("copy_field("$[inputString]","fileName") | encode-json | write("output/$[inputString]") ;