digdag
digdag copied to clipboard
[New feature] Digdag users want to share files between tasks on server mode.
Digdag users want to share files between tasks on server mode.
Use cases
- Download from a database. And use it.
- ~~Download file from a database. and execute embulk with
sh>operator. Twitter (Japanese)~~ - Download file from a database. And compress it. #735
- A user use
sh>: command && digdag run wf01.dig && digdag run wf02.digfor workaround twitter - TBW..
Current server mode, another task can't access download file. (See also #735).
I think upload_s3 option may solve those use cases.
It uploads pg>, td> redshift> results to s3 instead of writing locally.
For examle,
+step1:
pg>: XXX.sql
upload_s3: my-bucket/file.csv
How can the user share other files between tasks? (such as text/binary files)
Use Case: (Machine Learning Application):
- Task A:
- download several input files
- do some processing on them
- output one single txt file
- Task B:
- input txt file from task A
- do some processing
- output single text file
- Task C:
- ...
Is it possible to implement such a use case in DigDag? Is there a way to define which files are "output" and which are only temporary files that don't belong to the output?
I think(a digdag user) sharing data between tasks is an expectation feature in future release.
Where is the input/output data store? What operator do you use?
My upload_s3 idea is sharing task data using S3 between tasks. Does your scenario require local storage?
Those slides may help.
machine-learning example.
- Hivemall meets Digdag @Hackertackle
- Machine Learning and Natural Language Processing on Treasure CDP
- machine-learning examples
Those examples use Treasure Data data store. Because, Digdag and Hivemall maintainted by ARM treasure data.
Another case, some user use EFS(Amazon Elastic File system.) with sh operator for avoiding isolating working area.
I agree too. My idea is it will be workspace can be selected to be generated per session. Current, only per task.
I did workaround to make a shell script that doing multi tasks.