libyami icon indicating copy to clipboard operation
libyami copied to clipboard

Intermittent failed case for VPP of CSC+Sharpness by using yamitranscode on Fedora and ubuntu::yakkety

Open FocusLuo opened this issue 7 years ago • 22 comments

To use the latest commit on master of yami and libva/intel-driver Test CMD: yamivpp .//1920x1080.nv12 -s 59 ./1920x1080.yv12 yamivpp .//1920x1080.yv12 -s 59 ./1280x720.i420

FocusLuo avatar Jul 20 '17 02:07 FocusLuo

vpp_clips.zip

FocusLuo avatar Jul 20 '17 03:07 FocusLuo

We have setup the fedora 25 env. We are trying to reproduce the issue

xuguangxin avatar Jul 28 '17 00:07 xuguangxin

I have setup fedora 25 env on different APL machines. Building yami uses configure options found in buildlog on website http://media-ci.ostc.intel.com:8810/dashboard. And I run the TEST CMD above for thousands of times. However the issue did not come out. I will try to reproduce the issue with Docker next.

Zhziyao avatar Aug 01 '17 05:08 Zhziyao

What result are you expecting? I don't think this reported issue description tells the whole story.

The actual issue is that the output result from the above test command is not always the same. That is, the md5sum of the output result intermittently changes from run-to-run. The output result is compared via the md5sum output for this test, which changes from run-to-run (i.e. md5sum ./1920x1080.yv12 is not always the same).

I don't know how yamitranscode (mentioned in issue title) has anything to do with this, either.

uartie avatar Aug 01 '17 05:08 uartie

Also, when the md5sum result is not expected I've seen associated GPU Hang on 4.10 and 4.11 kernels:

[23010.721025] drm/i915: Resetting chip after gpu hang
[23010.723370] [drm] RC6 on
[23010.724143] [drm] GuC firmware load skipped

uartie avatar Aug 01 '17 05:08 uartie

I'm able to reproduce at least once every ~200-300 runs sequentially

uartie avatar Aug 01 '17 06:08 uartie

md5sum of ./1920x1080.yv12 output should be f15e2b55a786fcf691f8e9d79e91653d

uartie avatar Aug 01 '17 06:08 uartie

@uartie Thank you for your detailed explanation. And I understand the issue much more clear.

Zhziyao avatar Aug 01 '17 06:08 Zhziyao

@uartie, ziyao used md5 sum to check the command result. It can't reproduce in APL machine, Is it possible it related to CPU step? could you share your CPU step to ziyao in the mail. So he can compare the cpu info.

xuguangxin avatar Aug 01 '17 06:08 xuguangxin

use "lspci -nn |grep VGA"

xuguangxin avatar Aug 01 '17 06:08 xuguangxin

00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:5a85] (rev 0b)

uartie avatar Aug 01 '17 06:08 uartie

ok, just checked, we do not have rev 0b.. @uartie , do you have another stepping. We also checked kernel version, we use fedora 25 it is 4.8.6-300.fc25.x86_64, it's not like your kernel version. What os version are you used?

xuguangxin avatar Aug 01 '17 08:08 xuguangxin

@xuguangxin, no I don't have another stepping locally. We use Fedora 25 host with updated kernel (via dnf package manager) and Ubuntu Xenial (16.04) host with updated kernel (via apt package manager).

Please try to update your Fedora 25 packages (including kernel) via dnf update and see if that can reproduce afterwards.

uartie avatar Aug 01 '17 15:08 uartie

Sorry for not explaining my former work clearly.

  1. I updated the kernel to the latest version and ran the test on APL machine.
  2. Besides, I installed the Docker and pulled fedora 25 image from Intel repo. I set up the env with RETOOL.Then I ran the test in the container of fedora 25. However, the issue did not come out under both conditions. I also saved the message of CMD dmesg | grep -i gpu after each loop, but to find no "GPU HANG" message.

Zhziyao avatar Aug 04 '17 02:08 Zhziyao

Seems it's a kind of certain a stepping issue Sadly, U.Artie's stepping higher than Ziyao's Let us find a stepping rev0b

xuguangxin avatar Aug 04 '17 02:08 xuguangxin

Surely, it is a kind of a stepping issue. I can reproduce the Issue on the machine supplied by uartie.

Zhziyao avatar Aug 10 '17 04:08 Zhziyao

Ok, please continue root-causing on the APL I've supplied you.

uartie avatar Aug 10 '17 22:08 uartie

@Zhziyao, @xuguangxin this issue shows up on BSW, too. It's strange APL and BSW would both be caused by stepping issue.

00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:22b1] (rev 21)

uartie avatar Aug 14 '17 15:08 uartie

@Zhziyao , could you find a bsw to reproduce this issue?

xuguangxin avatar Aug 15 '17 03:08 xuguangxin

@Zhziyao , any update on this?

xuguangxin avatar Aug 17 '17 07:08 xuguangxin

I can't reproduce this issue on bsw either. And I just finish setting the test env on another machine. I wonder if there is any difference between uartie's test env and mine, which may probably leads to my failure of reproducing the issue. I will provide my host machine address to uartie on slack for checking.

Zhziyao avatar Aug 17 '17 07:08 Zhziyao

Any progress with identifying/reproducing this issue on your end. I am attaching the i915_error_state generated when GPU hang occurs.

i915_error_state.gz

uartie avatar Feb 01 '18 18:02 uartie