FLAN icon indicating copy to clipboard operation
FLAN copied to clipboard

[Question] What environment did you use for fetching large data set like dialog_mixture

Open quq99 opened this issue 1 year ago • 5 comments

Hi, I am trying to fetch FLAN v2 by running the

PYTHONPATH=. python flan/v2/run_example.py

I could successfully run cot_submix, but I faced out of memory issue when I was trying to fetch dialog_submix in a single AWS p4d instance. Some of the logs showed it download wiki_dialog data and also did some processing (maybe) using Apache Beam.

Warning: The dataset you're trying to generate is using Apache Beam,
yet no `beam_runner` nor `beam_options` was explicitly provided.

Some Beam datasets take weeks to generate, so are usually not suited
for single machine generation. Please have a look at the instructions
to setup distributed generation:

https://www.tensorflow.org/datasets/beam_datasets#generating_a_beam_dataset

What I was doing is follow the readme file in flan/v2 directory, bash setup.sh and ran PYTHONPATH=. python flan/v2/run_example.py. The only entrance I could get was the seqio.get_mixture_or_task('dialog_submix').get_dataset() function in that run_example script. I was not clear about how seqio.get_dataset() interact with or call apache beam. Apart from pip install apache beam, is there any other step about settings? like the environment? how could we pass runner type to beam_runner?

And I assume dialog_submix is not the largest in this five categories. So could you give me some help on explaining what environment you are using when running the script to generate data? for example, do you use multiple machines to do it, like google cloud or aws EC2, EMR? Are there further steps like settings, configs before running that run_example.py code? Thanks a lot!

quq99 avatar Apr 28 '23 23:04 quq99

Hi quq99,

There's nothing special about the dialog submix, it should work on a single machine, unless we missed something...

lehougoogle avatar Apr 29 '23 16:04 lehougoogle

Hi @lehougoogle, thanks so much for the reply. Could you give me more info about how much memory needs when I try to run on a single machine? I noticed this issue https://github.com/google-research/FLAN/issues/44 mentioned he could run it on a 300G memory machine.

Another question is, how long does it typically need if I want to fetch all five categories(cot, t0, dialog, flan ..). Thanks :)

quq99 avatar Apr 30 '23 06:04 quq99

Hi @lehougoogle, more context is when I run the script for "dialog_submix" I face an error

python: malloc.c:4615: _int_realloc: Assertion `ncopies >= 3' failed.

I thought it was memory not enough issue, but when I change another machine(500G memory), I still saw the error, and I run free -g to see it was using around 70G memory. so I assume it was not a memory issue. Have you faced this issue before, any thoughts would be helpful, thanks a lot!

quq99 avatar May 03 '23 23:05 quq99

@quq99 it is quite memory intensive. We ran it a while ago internally with Google infrastructure so I don't have specific numbers unfortunately, but in terms of compute it should roughly be this order (least to most): cot, dialog, niv2, flan, t0, with fsopt using much more than zsopt.

If your compute is constrained, you can make it more efficient by splitting task configs into different submixtures (e.g. splitting t0 into 10) and running them separately then joining them at the end. I hope this helps!

shayne-longpre avatar May 08 '23 19:05 shayne-longpre

You can also now manually download the Dialog submixture (and the others) -- see the new README! :)

shayne-longpre avatar May 25 '23 14:05 shayne-longpre