bdutil
bdutil copied to clipboard
HowTo configure defaultFS for hadoop on singlenode/yarn setup
Hi - I have built a gce structure using ./bdutil deploy --bucket anintelclustergen1-m-disk -n 2 -P anintelcluster -e extensions/spark/spark_on_yarn_env.sh.
In the bucket paraments, both in command and bdutil_evn.sh, I have specified a non-boot bucket. In the core-site.xml (under hadoop/etc) on the master, it show the xml with the correct bucket value under defaultFS. However, the hadoop console (50070) does not show the nonboot bucket attached, but shows the boot disk attached on the name node.
Node Last contact Admin State Capacity Used Non DFS Used Remaining Blocks Block pool used Failed Volumes Version anintelcluster.c.anintelcluster.internal:50010 (10.240.0.2:50010) 0 In Service 98.4 GB 28 KB 6.49 GB 91.91 GB 0 28 KB (0%) 0 2.7.1
Is it possible to specify a non-boot bucket with the singlenode setup? If not, what needs to be done to be able to specify the non-boot disk, which will both get attached to instance as read/write and also be used by hadoop for storage etc?
So, the GCS connector actually isn't able to be mounted as a local filesystem, it simply plugs into Hadoop at Hadoop's FileSystem.java layer. This means it gets used as the FileSystem for Hadoop-specific jobs, but doesn't change the way the local filesystem uses a real disk as a block device.
The GCS connector also lives independently alongside Hadoop's HDFS. So, when you're looking at 50070, you're seeing the actual HDFS setup which writes blocks out to the local disk and not to GCS, which would be accessible as a "hdfs:///" path for Hadoop jobs. In general, if you've configured defaultFS to use a GCS path, you can just ignore whatever the NameNode on 50070 is reporting, since in that case your typical Hadoop jobs simply won't interact with the HDFS setup at all.