[BUG] Read file slow with alluxioruntime
What is your environment(Kubernetes version, Fluid version, etc.)
Fluid 0.60 alluxioruntime , use lustre as backend storage
Describe the bug
run in pod with dataset bound
time cp alluxio-2.6.2-bin.tar.gz /dev/null
real 1m36.357s user 0m0.014s sys 0m1.047s
the file is about 1.8GB , why read file is so slow for the first time
apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: lustre spec: mounts:
- mountPoint: local:///mnt/lustre name: lustre accessModes:
- ReadWriteMany
apiVersion: data.fluid.io/v1alpha1 kind: AlluxioRuntime metadata: name: lustre spec:
Add fields here
replicas: 3 data: replicas: 2
tieredstore: levels: - mediumtype: MEM path: /dev/shm quota: 30Gi high: "0.9" low: "0.6" - mediumtype: SSD path: /data/alluxio quota: 150Gi high: "0.9" low: "0.7"
properties: # jni-fuse related configurations alluxio.fuse.jnifuse.enabled: "true" alluxio.user.client.cache.enabled: "false" alluxio.user.client.cache.store.type: MEMORY alluxio.user.client.cache.dir: /alluxio/ram alluxio.user.client.cache.page.size: 2MB alluxio.user.client.cache.size: 1800MB # alluxio master alluxio.master.metastore: ROCKS alluxio.master.metastore.inode.cache.max.size: "10000000" alluxio.master.journal.log.size.bytes.max: 500MB alluxio.master.metadata.sync.concurrency.level: "128" alluxio.master.metadata.sync.executor.pool.size: "128" alluxio.master.metadata.sync.ufs.prefetch.pool.size: "128" # alluxio configurations alluxio.user.block.worker.client.pool.min: "512" alluxio.fuse.debug.enabled: "false" alluxio.web.ui.enabled: "false" alluxio.user.file.writetype.default: THROUGH alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.LocalFirstPolicy alluxio.user.block.write.location.policy.class: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy alluxio.worker.allocator.class: alluxio.worker.block.allocator.GreedyAllocator alluxio.user.block.size.bytes.default: 16MB alluxio.user.streaming.reader.chunk.size.bytes: 32MB alluxio.user.local.reader.chunk.size.bytes: 32MB alluxio.worker.network.reader.buffer.size: 32MB alluxio.worker.file.buffer.size: 320MB alluxio.user.metrics.collection.enabled: "false" alluxio.master.rpc.executor.max.pool.size: "1024" alluxio.master.rpc.executor.core.pool.size: "128" #alluxio.master.mount.table.root.readonly: "true" alluxio.user.update.file.accesstime.disabled: "true" alluxio.user.file.passive.cache.enabled: "false" alluxio.user.block.avoid.eviction.policy.reserved.size.bytes: 2GB alluxio.master.journal.folder: /journal alluxio.master.journal.type: UFS alluxio.user.block.master.client.pool.gc.threshold: 2day alluxio.user.file.master.client.threads: "1024" alluxio.user.block.master.client.threads: "1024" alluxio.user.file.readtype.default: CACHE alluxio.security.stale.channel.purge.interval: 365d alluxio.user.metadata.cache.enabled: "true" alluxio.user.metadata.cache.expiration.time: 2day alluxio.user.metadata.cache.max.size: "1000000" alluxio.user.direct.memory.io.enabled: "true" alluxio.fuse.cached.paths.max: "1000000" alluxio.job.worker.threadpool.size: "164" alluxio.user.worker.list.refresh.interval: 2min alluxio.user.logging.threshold: 1000ms alluxio.fuse.logging.threshold: 1000ms alluxio.worker.block.master.client.pool.size: "1024" master: jvmOptions: - "-Xmx6G" - "-XX:+UnlockExperimentalVMOptions" - "-XX:ActiveProcessorCount=8" worker: jvmOptions: - "-Xmx12G" - "-XX:+UnlockExperimentalVMOptions" - "-XX:MaxDirectMemorySize=32g" - "-XX:ActiveProcessorCount=8" resources: limits: cpu: 8 fuse: env: MAX_IDLE_THREADS: "32" jvmOptions: - "-Xmx16G" - "-Xms16G" - "-XX:+UseG1GC" - "-XX:MaxDirectMemorySize=32g" - "-XX:+UnlockExperimentalVMOptions" - "-XX:ActiveProcessorCount=24" resources: limits: cpu: 16 args: - fuse - --fuse-opts=kernel_cache,rw,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty
Do you have tried to copy this file from lustre to local file system? How many time does it take?
Do you have tried to copy this file from lustre to local file system? How many time does it take?
yes . only take 2 seconds
You can check throughput with https://docs.alluxio.io/os/user/stable/en/operation/Admin-CLI.html#metrics. Are they using the same machine to copy file, and did you clear page cache. Actually , If it took 2 seconds to copy 1.8 GiB file, it means nearly 1 GiB/second. Can you confirm this?
You can check throughput with
https://docs.alluxio.io/os/user/stable/en/operation/Admin-CLI.html#metrics. Are they using the same machine to copy file, and did you clear page cache. Actually , If it took 2 seconds to copy 1.8 GiB file, it means nearly 1 GiB/second. Can you confirm this?
yes, I cleaned zhe page cache. it takes 4 seconds from lustre. the lustre is small cluster with 3 nodes , and base on ssd. so it is fast
You can check throughput with
https://docs.alluxio.io/os/user/stable/en/operation/Admin-CLI.html#metrics. Are they using the same machine to copy file, and did you clear page cache. Actually , If it took 2 seconds to copy 1.8 GiB file, it means nearly 1 GiB/second. Can you confirm this?
I got the metrics for copy the file(before copy I cleared metrics already):
Cluster.BytesReadDirect (Type: COUNTER, Value: 0B) Cluster.BytesReadDirectThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesReadDomain (Type: COUNTER, Value: 0B) Cluster.BytesReadDomainThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesReadLocal (Type: COUNTER, Value: 0B) Cluster.BytesReadLocalThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesReadRemote (Type: COUNTER, Value: 84.17GB) Cluster.BytesReadRemoteThroughput (Type: GAUGE, Value: 42.09GB/MIN) Cluster.BytesReadUfsAll (Type: COUNTER, Value: 0B) Cluster.BytesReadUfsThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesWrittenDomain (Type: COUNTER, Value: 0B) Cluster.BytesWrittenDomainThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesWrittenLocal (Type: COUNTER, Value: 0B) Cluster.BytesWrittenLocalThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesWrittenRemote (Type: COUNTER, Value: 0B) Cluster.BytesWrittenRemoteThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesWrittenUfsAll (Type: COUNTER, Value: 0B) Cluster.BytesWrittenUfsThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.CapacityFree (Type: GAUGE, Value: 571,518,496,412) Cluster.CapacityFreeTierHDD (Type: GAUGE, Value: 0) Cluster.CapacityFreeTierMEM (Type: GAUGE, Value: 91,555,901,084) Cluster.CapacityFreeTierSSD (Type: GAUGE, Value: 479,962,595,328) Cluster.CapacityTotal (Type: GAUGE, Value: 579,820,584,960) Cluster.CapacityTotalTierHDD (Type: GAUGE, Value: 0) Cluster.CapacityTotalTierMEM (Type: GAUGE, Value: 96,636,764,160) Cluster.CapacityTotalTierSSD (Type: GAUGE, Value: 483,183,820,800) Cluster.CapacityUsed (Type: GAUGE, Value: 8,302,088,548) Cluster.CapacityUsedTierHDD (Type: GAUGE, Value: 0) Cluster.CapacityUsedTierMEM (Type: GAUGE, Value: 5,080,863,076) Cluster.CapacityUsedTierSSD (Type: GAUGE, Value: 3,221,225,472) Cluster.RootUfsCapacityFree (Type: GAUGE, Value: 193,345,646,592) Cluster.RootUfsCapacityTotal (Type: GAUGE, Value: 211,243,999,232) Cluster.RootUfsCapacityUsed (Type: GAUGE, Value: 17,898,352,640) Cluster.Workers (Type: GAUGE, Value: 3) Master.AbsentCacheHits (Type: COUNTER, Value: 0) Master.AbsentCacheInvalidations (Type: COUNTER, Value: 0) Master.AbsentCacheMisses (Type: COUNTER, Value: 0) Master.AbsentCacheSize (Type: GAUGE, Value: 0) Master.CompleteFileOps (Type: COUNTER, Value: 0) Master.ConnectFromMaster.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.Create.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.CreateDirectoryOps (Type: COUNTER, Value: 0) Master.CreateFileOps (Type: COUNTER, Value: 0) Master.DeletePathOps (Type: COUNTER, Value: 0) Master.DirectoriesCreated (Type: COUNTER, Value: 0) Master.EdgeCacheEvictions (Type: COUNTER, Value: 0) Master.EdgeCacheHits (Type: COUNTER, Value: 2,136,970) Master.EdgeCacheLoadTimes (Type: COUNTER, Value: 0) Master.EdgeCacheMisses (Type: COUNTER, Value: 0) Master.EdgeCacheSize (Type: GAUGE, Value: 123,438) Master.EdgeLockPoolSize (Type: GAUGE, Value: 123,439) Master.FileBlockInfosGot (Type: COUNTER, Value: 0) Master.FileInfosGot (Type: COUNTER, Value: 15) Master.FilesCompleted (Type: COUNTER, Value: 0) Master.FilesCreated (Type: COUNTER, Value: 0) Master.FilesFreed (Type: COUNTER, Value: 0) Master.FilesPersisted (Type: COUNTER, Value: 0) Master.FilesPinned (Type: GAUGE, Value: 0) Master.FreeFileOps (Type: COUNTER, Value: 0) Master.GetAcl.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.GetAcl.User:root.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.GetBlockMasterInfo.User:root (Type: TIMER, Value: 88) Master.GetFileBlockInfoOps (Type: COUNTER, Value: 0) Master.GetFileInfoOps (Type: COUNTER, Value: 4) Master.GetFileLocations.User:root.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.GetNewBlockOps (Type: COUNTER, Value: 0) Master.GetSpace.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.GetSpace.User:root.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 9) Master.GetStatus.User:root (Type: TIMER, Value: 2) Master.GetStatus.User:root.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.GetWorkerInfoList.User:root (Type: TIMER, Value: 2) Master.InodeCacheEvictions (Type: COUNTER, Value: 0) Master.InodeCacheHits (Type: COUNTER, Value: 13,135,456) Master.InodeCacheLoadTimes (Type: COUNTER, Value: 0) Master.InodeCacheMisses (Type: COUNTER, Value: 0) Master.InodeCacheSize (Type: GAUGE, Value: 123,439) Master.InodeLockPoolSize (Type: GAUGE, Value: 123,439) Master.JournalFlushTimer (Type: TIMER, Value: 0) Master.LastBackupEntriesCount (Type: GAUGE, Value: -1) Master.LastBackupRestoreCount (Type: GAUGE, Value: -1) Master.LastBackupRestoreTimeMs (Type: GAUGE, Value: -1) Master.LastBackupTimeMs (Type: GAUGE, Value: -1) Master.ListStatus.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 15) Master.ListStatus.User:root (Type: TIMER, Value: 2) Master.ListStatus.User:root.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.ListingCacheEvictions (Type: COUNTER, Value: 0) Master.ListingCacheHits (Type: COUNTER, Value: 2) Master.ListingCacheLoadTimes (Type: COUNTER, Value: 0) Master.ListingCacheMisses (Type: COUNTER, Value: 0) Master.ListingCacheSize (Type: GAUGE, Value: 124,005) Master.MountOps (Type: COUNTER, Value: 0) Master.NewBlocksGot (Type: COUNTER, Value: 0) Master.PathsDeleted (Type: COUNTER, Value: 0) Master.PathsMounted (Type: COUNTER, Value: 0) Master.PathsRenamed (Type: COUNTER, Value: 0) Master.PathsUnmounted (Type: COUNTER, Value: 0) Master.PerUfsOpConnectFromMaster.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 0) Master.PerUfsOpCreate.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 0) Master.PerUfsOpGetFileLocations.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 0) Master.PerUfsOpGetSpace.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 9) Master.PerUfsOpGetStatus.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 0) Master.PerUfsOpListStatus.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 15) Master.PerUfsSavedOpGET_FILE_INFO.UFS:%2FunderFSStorage (Type: COUNTER, Value: 13) Master.PerUfsSavedOpLIST_STATUS.UFS:%2FunderFSStorage (Type: COUNTER, Value: 2) Master.RenamePathOps (Type: COUNTER, Value: 0) Master.SetAclOps (Type: COUNTER, Value: 0) Master.SetAttributeOps (Type: COUNTER, Value: 0) Master.TotalPaths (Type: GAUGE, Value: 123,439) Master.UfsJournalInitialReplayTimeMs (Type: GAUGE, Value: -1) Master.UfsSessionCount-Ufs:%2Fjournal%2FMetricsMaster (Type: COUNTER, Value: 0) Master.UfsSessionCount-Ufs:%2FunderFSStorage (Type: COUNTER, Value: 0) Master.UnmountOps (Type: COUNTER, Value: 0) Master.blockHeartbeat.User:root (Type: TIMER, Value: 468) Master.clearMetrics.User:root (Type: TIMER, Value: 0) Master.close.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.commitBlock.User:root (Type: TIMER, Value: 111) Master.getConfigHash (Type: TIMER, Value: 12) Master.getConfigHash.User:root (Type: TIMER, Value: 33) Master.getConfiguration (Type: TIMER, Value: 2) Master.getConfiguration.User:root (Type: TIMER, Value: 103) Master.getMasterInfo.User:root (Type: TIMER, Value: 88) Master.getMetrics.User:root (Type: TIMER, Value: 3) Master.getPinnedFileIds.User:root (Type: TIMER, Value: 468) Master.getUfsInfo.User:root (Type: TIMER, Value: 1) Master.getWorkerId.User:root (Type: TIMER, Value: 0) Master.registerWorker.User:root (Type: TIMER, Value: 0)
I suggest you to repeat the experiment and check Cluster.BytesReadUfsThroughput (Type: GAUGE, Value: 0B/MIN). Thanks. @apc999 Any sugggestions.