fluid icon indicating copy to clipboard operation
fluid copied to clipboard

[BUG] Read file slow with alluxioruntime

Open robin1900 opened this issue 4 years ago • 6 comments

What is your environment(Kubernetes version, Fluid version, etc.)

Fluid 0.60 alluxioruntime , use lustre as backend storage

Describe the bug

run in pod with dataset bound

time cp alluxio-2.6.2-bin.tar.gz /dev/null

real 1m36.357s user 0m0.014s sys 0m1.047s

the file is about 1.8GB , why read file is so slow for the first time


apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: lustre spec: mounts:

  • mountPoint: local:///mnt/lustre name: lustre accessModes:
  • ReadWriteMany

apiVersion: data.fluid.io/v1alpha1 kind: AlluxioRuntime metadata: name: lustre spec:

Add fields here

replicas: 3 data: replicas: 2

tieredstore: levels: - mediumtype: MEM path: /dev/shm quota: 30Gi high: "0.9" low: "0.6" - mediumtype: SSD path: /data/alluxio quota: 150Gi high: "0.9" low: "0.7"

properties: # jni-fuse related configurations alluxio.fuse.jnifuse.enabled: "true" alluxio.user.client.cache.enabled: "false" alluxio.user.client.cache.store.type: MEMORY alluxio.user.client.cache.dir: /alluxio/ram alluxio.user.client.cache.page.size: 2MB alluxio.user.client.cache.size: 1800MB # alluxio master alluxio.master.metastore: ROCKS alluxio.master.metastore.inode.cache.max.size: "10000000" alluxio.master.journal.log.size.bytes.max: 500MB alluxio.master.metadata.sync.concurrency.level: "128" alluxio.master.metadata.sync.executor.pool.size: "128" alluxio.master.metadata.sync.ufs.prefetch.pool.size: "128" # alluxio configurations alluxio.user.block.worker.client.pool.min: "512" alluxio.fuse.debug.enabled: "false" alluxio.web.ui.enabled: "false" alluxio.user.file.writetype.default: THROUGH alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.LocalFirstPolicy alluxio.user.block.write.location.policy.class: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy alluxio.worker.allocator.class: alluxio.worker.block.allocator.GreedyAllocator alluxio.user.block.size.bytes.default: 16MB alluxio.user.streaming.reader.chunk.size.bytes: 32MB alluxio.user.local.reader.chunk.size.bytes: 32MB alluxio.worker.network.reader.buffer.size: 32MB alluxio.worker.file.buffer.size: 320MB alluxio.user.metrics.collection.enabled: "false" alluxio.master.rpc.executor.max.pool.size: "1024" alluxio.master.rpc.executor.core.pool.size: "128" #alluxio.master.mount.table.root.readonly: "true" alluxio.user.update.file.accesstime.disabled: "true" alluxio.user.file.passive.cache.enabled: "false" alluxio.user.block.avoid.eviction.policy.reserved.size.bytes: 2GB alluxio.master.journal.folder: /journal alluxio.master.journal.type: UFS alluxio.user.block.master.client.pool.gc.threshold: 2day alluxio.user.file.master.client.threads: "1024" alluxio.user.block.master.client.threads: "1024" alluxio.user.file.readtype.default: CACHE alluxio.security.stale.channel.purge.interval: 365d alluxio.user.metadata.cache.enabled: "true" alluxio.user.metadata.cache.expiration.time: 2day alluxio.user.metadata.cache.max.size: "1000000" alluxio.user.direct.memory.io.enabled: "true" alluxio.fuse.cached.paths.max: "1000000" alluxio.job.worker.threadpool.size: "164" alluxio.user.worker.list.refresh.interval: 2min alluxio.user.logging.threshold: 1000ms alluxio.fuse.logging.threshold: 1000ms alluxio.worker.block.master.client.pool.size: "1024" master: jvmOptions: - "-Xmx6G" - "-XX:+UnlockExperimentalVMOptions" - "-XX:ActiveProcessorCount=8" worker: jvmOptions: - "-Xmx12G" - "-XX:+UnlockExperimentalVMOptions" - "-XX:MaxDirectMemorySize=32g" - "-XX:ActiveProcessorCount=8" resources: limits: cpu: 8 fuse: env: MAX_IDLE_THREADS: "32" jvmOptions: - "-Xmx16G" - "-Xms16G" - "-XX:+UseG1GC" - "-XX:MaxDirectMemorySize=32g" - "-XX:+UnlockExperimentalVMOptions" - "-XX:ActiveProcessorCount=24" resources: limits: cpu: 16 args: - fuse - --fuse-opts=kernel_cache,rw,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty

robin1900 avatar Nov 17 '21 02:11 robin1900

Do you have tried to copy this file from lustre to local file system? How many time does it take?

cheyang avatar Nov 17 '21 03:11 cheyang

Do you have tried to copy this file from lustre to local file system? How many time does it take?

yes . only take 2 seconds

robin1900 avatar Nov 17 '21 03:11 robin1900

You can check throughput with https://docs.alluxio.io/os/user/stable/en/operation/Admin-CLI.html#metrics. Are they using the same machine to copy file, and did you clear page cache. Actually , If it took 2 seconds to copy 1.8 GiB file, it means nearly 1 GiB/second. Can you confirm this?

cheyang avatar Nov 17 '21 03:11 cheyang

You can check throughput with https://docs.alluxio.io/os/user/stable/en/operation/Admin-CLI.html#metrics. Are they using the same machine to copy file, and did you clear page cache. Actually , If it took 2 seconds to copy 1.8 GiB file, it means nearly 1 GiB/second. Can you confirm this?

yes, I cleaned zhe page cache. it takes 4 seconds from lustre. the lustre is small cluster with 3 nodes , and base on ssd. so it is fast

robin1900 avatar Nov 17 '21 03:11 robin1900

You can check throughput with https://docs.alluxio.io/os/user/stable/en/operation/Admin-CLI.html#metrics. Are they using the same machine to copy file, and did you clear page cache. Actually , If it took 2 seconds to copy 1.8 GiB file, it means nearly 1 GiB/second. Can you confirm this?

I got the metrics for copy the file(before copy I cleared metrics already):

Cluster.BytesReadDirect (Type: COUNTER, Value: 0B) Cluster.BytesReadDirectThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesReadDomain (Type: COUNTER, Value: 0B) Cluster.BytesReadDomainThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesReadLocal (Type: COUNTER, Value: 0B) Cluster.BytesReadLocalThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesReadRemote (Type: COUNTER, Value: 84.17GB) Cluster.BytesReadRemoteThroughput (Type: GAUGE, Value: 42.09GB/MIN) Cluster.BytesReadUfsAll (Type: COUNTER, Value: 0B) Cluster.BytesReadUfsThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesWrittenDomain (Type: COUNTER, Value: 0B) Cluster.BytesWrittenDomainThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesWrittenLocal (Type: COUNTER, Value: 0B) Cluster.BytesWrittenLocalThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesWrittenRemote (Type: COUNTER, Value: 0B) Cluster.BytesWrittenRemoteThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesWrittenUfsAll (Type: COUNTER, Value: 0B) Cluster.BytesWrittenUfsThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.CapacityFree (Type: GAUGE, Value: 571,518,496,412) Cluster.CapacityFreeTierHDD (Type: GAUGE, Value: 0) Cluster.CapacityFreeTierMEM (Type: GAUGE, Value: 91,555,901,084) Cluster.CapacityFreeTierSSD (Type: GAUGE, Value: 479,962,595,328) Cluster.CapacityTotal (Type: GAUGE, Value: 579,820,584,960) Cluster.CapacityTotalTierHDD (Type: GAUGE, Value: 0) Cluster.CapacityTotalTierMEM (Type: GAUGE, Value: 96,636,764,160) Cluster.CapacityTotalTierSSD (Type: GAUGE, Value: 483,183,820,800) Cluster.CapacityUsed (Type: GAUGE, Value: 8,302,088,548) Cluster.CapacityUsedTierHDD (Type: GAUGE, Value: 0) Cluster.CapacityUsedTierMEM (Type: GAUGE, Value: 5,080,863,076) Cluster.CapacityUsedTierSSD (Type: GAUGE, Value: 3,221,225,472) Cluster.RootUfsCapacityFree (Type: GAUGE, Value: 193,345,646,592) Cluster.RootUfsCapacityTotal (Type: GAUGE, Value: 211,243,999,232) Cluster.RootUfsCapacityUsed (Type: GAUGE, Value: 17,898,352,640) Cluster.Workers (Type: GAUGE, Value: 3) Master.AbsentCacheHits (Type: COUNTER, Value: 0) Master.AbsentCacheInvalidations (Type: COUNTER, Value: 0) Master.AbsentCacheMisses (Type: COUNTER, Value: 0) Master.AbsentCacheSize (Type: GAUGE, Value: 0) Master.CompleteFileOps (Type: COUNTER, Value: 0) Master.ConnectFromMaster.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.Create.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.CreateDirectoryOps (Type: COUNTER, Value: 0) Master.CreateFileOps (Type: COUNTER, Value: 0) Master.DeletePathOps (Type: COUNTER, Value: 0) Master.DirectoriesCreated (Type: COUNTER, Value: 0) Master.EdgeCacheEvictions (Type: COUNTER, Value: 0) Master.EdgeCacheHits (Type: COUNTER, Value: 2,136,970) Master.EdgeCacheLoadTimes (Type: COUNTER, Value: 0) Master.EdgeCacheMisses (Type: COUNTER, Value: 0) Master.EdgeCacheSize (Type: GAUGE, Value: 123,438) Master.EdgeLockPoolSize (Type: GAUGE, Value: 123,439) Master.FileBlockInfosGot (Type: COUNTER, Value: 0) Master.FileInfosGot (Type: COUNTER, Value: 15) Master.FilesCompleted (Type: COUNTER, Value: 0) Master.FilesCreated (Type: COUNTER, Value: 0) Master.FilesFreed (Type: COUNTER, Value: 0) Master.FilesPersisted (Type: COUNTER, Value: 0) Master.FilesPinned (Type: GAUGE, Value: 0) Master.FreeFileOps (Type: COUNTER, Value: 0) Master.GetAcl.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.GetAcl.User:root.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.GetBlockMasterInfo.User:root (Type: TIMER, Value: 88) Master.GetFileBlockInfoOps (Type: COUNTER, Value: 0) Master.GetFileInfoOps (Type: COUNTER, Value: 4) Master.GetFileLocations.User:root.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.GetNewBlockOps (Type: COUNTER, Value: 0) Master.GetSpace.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.GetSpace.User:root.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 9) Master.GetStatus.User:root (Type: TIMER, Value: 2) Master.GetStatus.User:root.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.GetWorkerInfoList.User:root (Type: TIMER, Value: 2) Master.InodeCacheEvictions (Type: COUNTER, Value: 0) Master.InodeCacheHits (Type: COUNTER, Value: 13,135,456) Master.InodeCacheLoadTimes (Type: COUNTER, Value: 0) Master.InodeCacheMisses (Type: COUNTER, Value: 0) Master.InodeCacheSize (Type: GAUGE, Value: 123,439) Master.InodeLockPoolSize (Type: GAUGE, Value: 123,439) Master.JournalFlushTimer (Type: TIMER, Value: 0) Master.LastBackupEntriesCount (Type: GAUGE, Value: -1) Master.LastBackupRestoreCount (Type: GAUGE, Value: -1) Master.LastBackupRestoreTimeMs (Type: GAUGE, Value: -1) Master.LastBackupTimeMs (Type: GAUGE, Value: -1) Master.ListStatus.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 15) Master.ListStatus.User:root (Type: TIMER, Value: 2) Master.ListStatus.User:root.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.ListingCacheEvictions (Type: COUNTER, Value: 0) Master.ListingCacheHits (Type: COUNTER, Value: 2) Master.ListingCacheLoadTimes (Type: COUNTER, Value: 0) Master.ListingCacheMisses (Type: COUNTER, Value: 0) Master.ListingCacheSize (Type: GAUGE, Value: 124,005) Master.MountOps (Type: COUNTER, Value: 0) Master.NewBlocksGot (Type: COUNTER, Value: 0) Master.PathsDeleted (Type: COUNTER, Value: 0) Master.PathsMounted (Type: COUNTER, Value: 0) Master.PathsRenamed (Type: COUNTER, Value: 0) Master.PathsUnmounted (Type: COUNTER, Value: 0) Master.PerUfsOpConnectFromMaster.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 0) Master.PerUfsOpCreate.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 0) Master.PerUfsOpGetFileLocations.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 0) Master.PerUfsOpGetSpace.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 9) Master.PerUfsOpGetStatus.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 0) Master.PerUfsOpListStatus.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 15) Master.PerUfsSavedOpGET_FILE_INFO.UFS:%2FunderFSStorage (Type: COUNTER, Value: 13) Master.PerUfsSavedOpLIST_STATUS.UFS:%2FunderFSStorage (Type: COUNTER, Value: 2) Master.RenamePathOps (Type: COUNTER, Value: 0) Master.SetAclOps (Type: COUNTER, Value: 0) Master.SetAttributeOps (Type: COUNTER, Value: 0) Master.TotalPaths (Type: GAUGE, Value: 123,439) Master.UfsJournalInitialReplayTimeMs (Type: GAUGE, Value: -1) Master.UfsSessionCount-Ufs:%2Fjournal%2FMetricsMaster (Type: COUNTER, Value: 0) Master.UfsSessionCount-Ufs:%2FunderFSStorage (Type: COUNTER, Value: 0) Master.UnmountOps (Type: COUNTER, Value: 0) Master.blockHeartbeat.User:root (Type: TIMER, Value: 468) Master.clearMetrics.User:root (Type: TIMER, Value: 0) Master.close.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.commitBlock.User:root (Type: TIMER, Value: 111) Master.getConfigHash (Type: TIMER, Value: 12) Master.getConfigHash.User:root (Type: TIMER, Value: 33) Master.getConfiguration (Type: TIMER, Value: 2) Master.getConfiguration.User:root (Type: TIMER, Value: 103) Master.getMasterInfo.User:root (Type: TIMER, Value: 88) Master.getMetrics.User:root (Type: TIMER, Value: 3) Master.getPinnedFileIds.User:root (Type: TIMER, Value: 468) Master.getUfsInfo.User:root (Type: TIMER, Value: 1) Master.getWorkerId.User:root (Type: TIMER, Value: 0) Master.registerWorker.User:root (Type: TIMER, Value: 0)

robin1900 avatar Nov 17 '21 04:11 robin1900

I suggest you to repeat the experiment and check Cluster.BytesReadUfsThroughput (Type: GAUGE, Value: 0B/MIN). Thanks. @apc999 Any sugggestions.

cheyang avatar Nov 17 '21 05:11 cheyang