fast-data-dev icon indicating copy to clipboard operation
fast-data-dev copied to clipboard

How to persist data

Open jreBoAG opened this issue 7 years ago • 10 comments

Hey there,

we're currently using your project for development. Is there an easy way (since we're new to Kafka as well as docker) to persist our topics as well as the connectors?

jreBoAG avatar Sep 12 '17 09:09 jreBoAG

@jreBoAG Kafka Connect stores all it's configuration, offsets and statuses in Kafka, there are system topics that hold this, connect-offsets, connect-status and connect-configs, these topic names are set in the connect-avro-distributed.properties file. In the Docker you will lose your data in kafka unless you mount a volume or point Connect to another Kafka Cluster that has persistence.

andrewstevenson avatar Sep 12 '17 09:09 andrewstevenson

Hello, although we do not explicitly use docker volumes, there are two ways to persist data.

The first is to persist your docker container. For example, you could start fast-data-dev like this:

docker run -it -p 3030:3030 --name mykafka landoop/fast-data-dev

Once you finish working, press CTRL+C to stop the container. The container isn't deleted, just stopped. You can start it once again via:

docker start -ai mykafka

You could also set the container to run in the background, replacing -it switch with -d. In that case you would stop it with docker stop.

The second option, if you want your data to persist across containers, you may use an external directory to store kafka and zookeeper files. We store them under /tmp so you would need to mount a volume at this path:

docker run --rm -it -v /path/to/local/directory:/tmp landoop/fast-data-dev

Now if you stop (and remove) this container and start a new one providing the same volume, it should start from where the previous one left. One catch is that the volume (/path/to/local/directoryin the example) should be writeable by all (chmod 0777) as Kafka and Zookeeper run as user nobody.

andmarios avatar Sep 12 '17 17:09 andmarios

@andmarios, I am running a container with docker run --rm -it
-p 2181:2181 -p 3030:3030 -p 8081:8081
-p 8082:8082 -p 8083:8083 -p 9092:9092
-e ADV_HOST=127.0.0.1
landoop/fast-data-dev

While the container is running, I produce and consume several messages to/from a topic that I've created with success. You said that data from kafka and zookeeper are stored in /tmp. Could you present us the complete path of /tmp, because I've haven't found any subdirectories and files under /tmp directory inside the docker container:

docker run --rm -it --net=host landoop/fast-data-dev bash

root@fast-data-dev / $ cd tmp root@fast-data-dev tmp $ ls root@fast-data-dev tmp $ pwd /tmp root@fast-data-dev tmp $

eldontc avatar Oct 28 '17 20:10 eldontc

The directories are created by the Kafka broker and zookeeper on startup. The way you ran the image, you skipped starting these services.

The broker stores its data under /tmp/kafka-logs. Zookeeper stores its data under /tmp/zookeeper.

Try to go inside a normal running container to see them. E.g, start fast-data-dev:

docker run --rm -it --net=host --name=fdd landoop/fast-data-dev

Then from a second terminal:

docker exec -it fdd bash

andmarios avatar Oct 28 '17 23:10 andmarios

Thank you for your replay @andmarios. But, sorry, I think I couldn't explain my question with enough details. I will try again. I did what you suggested me, but didn't works for me.

I will explain step by step what I've been doing:

  1. Run a landoop/fast-data-dev docker container (after the execution, I didn't kill this terminal tab): docker run --rm -it -p 2181:2181 -p 3030:3030 -p 8081:8081 -p 8082:8082 -p 8083:8083 -p 9092:9092 -e ADV_HOST=127.0.0.1 landoop/fast-data-dev

  2. In another terminal tab, I executed (after the execution, I didn't kill this terminal tab too): docker run --rm -it --net=host landoop/fast-data-dev bash Inside the container at root@fast-data-dev /, I created a topic and produced some messages to it with success.

  3. At this moment, in another terminal tab, I executed: docker run --rm -it --net=host landoop/fast-data-dev bash Inside the container at root@fast-data-dev /, I did a ls command in /tmp. This directory was empty. I expected to see kafka-logs and zookeeper directories.

I checked the configuration on /opt/confluent-3.3.0/etc/kafka: In server.properties, I saw log.dirs=/tmp/kafka-logs In zookeeper.properties, I saw dataDir=/tmp/zookeeper

eldontc avatar Oct 29 '17 17:10 eldontc

You have to familiarize with docker a bit more. Every time you do docker run you create a new docker container, think of it as a new VM. If you run Kafka on one VM, you wouldn't expect to see its data in another VM, right?

The proper way to run your example would be:

  1. At this stage you indeed have to create a new container that runs Kafka. Please notice the --name=fdd parameter: docker run --rm -it -p 2181:2181 -p 3030:3030 -p 8081:8081 -p 8082:8082 -p 8083:8083 -p 9092:9092 -e ADV_HOST=127.0.0.1 --name=fdd landoop/fast-data-dev
  2. Now you don't have to create a new container, you can go into the one running Kafka: docker exec -it fdd bash
  3. Same as before, you need to connect to the container running Kafka: docker exec -it fdd ls /tmp

Hope this helps!

andmarios avatar Oct 29 '17 18:10 andmarios

@andmarios, sorry about that. I'm newbie in Docker and I now I realize what you tried to tell me before. Of course, when I execute the command docker run, actually I create another container.

eldontc avatar Oct 30 '17 02:10 eldontc

No worries, we all passed from this stage (and still learning and making mistakes). :)

andmarios avatar Oct 30 '17 13:10 andmarios

Hey @andmarios, does your answer from 2017-09-12 still hold true? I see, that the Dockerfile currently created a volue to the /data folder. If I fill up some topics and want to back this state as a starting point, so that I can revert later, what should I do? I tried mouting it via -v /path/to/my/local/folder:/data, but it didn't work.

AlexVPopov avatar Jun 28 '19 12:06 AlexVPopov

now, it's "/data"

muyu66 avatar Jul 28 '21 08:07 muyu66