Welcome! Log In Create A New Profile

Advanced

[Solved] EHCI timed out on TD - token=XXXX

Posted by rayvt 
[Solved] EHCI timed out on TD - token=XXXX
July 05, 2017 05:45PM
Many people have run into the problem on a Pogoplug with a USB disk that refuses to boot. If you can see the U-Boot console, either with a serial cable or netconsole, the error message is "EHCI timed out on TD - token=0x{something}" repeated over and over again. A related problem is a drive which randomly fails to be detected by uboot, but without any error message.

Frustratingly, once the Pogoplug is running Linux, the same disk works perfectly ok, it is only U-boot that has problems with it. Ugh

I have found the problem and fixed it.

It has sometimes been suggested that two environment variables, "usb_pgood_delay" and "usb_ready_retry" be set. These can be removed. They had nothing to do with the problem and did nothing to fix or resolve it.

Pending the time until the fix gets pulled in to the "official" U-Boot code, I have created u-boot firmware and made it available for download.

This is a fixed version of the bodhi "U-Boot 2016.05 (May 20 2017 - 12:42:17 -0500)"
filename: "uboot.2016.05-tld-1.pogo_e02.mtd0.kwb"

The versions I created are for Pogoplug_E02 (pink pogoplug) and Pogoplug_V4. I can build versions for other pogoplugs, just ask.

Link to the folder containing the files:
https://www.dropbox.com/sh/nwjwxzdy0ifd1aq/AADcduZlnF9psWjRBpMB-cMja?dl=0

Link to the patch file. (It is also in the above folder.)
https://www.dropbox.com/s/nrkrd1no63viuu8/uboot-bodhi-2016.05-timeoutTD.patch?dl=0

For your convenience, there is also a script to install the firmware.
Usage: install-new-uboot.sh {filename.kwb}
Re: [Solved] EHCI timed out on TD - token=XXXX
July 05, 2017 05:47PM
[next post]

For those who are curious about the error message:
EHCI timed out on TD - token=0x80008d88
and
EHCI timed out on TD - token=0x8d88

TD stands for "Queue Element Transfer Descriptor (qTD)", which is a micro-code instruction for the EHCI USB chip. A TD controls the I/O transfer of up to 20480 bytes of data. One I/O operation consists of one or more TDs. (Ref: Enhanced Host Controller Interface Specification for Universal Serial Bus Ch. 3.5)

The "token" field is detailed information, control, and status of the TD's data transfer. In particular, the rightmost byte is the status field. 0x80 bit means the TD is active and the data transfer has not yet completed. 0x08 bit means there was some sort of data transfer error (XactErr).

U-boot allows 5 seconds for the last TD to complete, before it declares a time-out error. The largest possible I/O operation in U-boot is 65535 blocks or about 33 MB. Depending on the disk, this takes about 0.9s to 1.8s.
Re: [Solved] EHCI timed out on TD - token=XXXX
July 05, 2017 05:52PM
[next post]

Analysis of the U-Boot bug. And the fix.

There were two problems that caused this bug, which is what made it so difficult to reproduce and fix. The first problem laid a land-mine for the next I/O operation -- the innocent bystander -- to step on and blow up (timed out).

Most USB disks worked okay. The disks that had trouble generally worked okay on the 2012-era versions of U-boot, but would not work reliably on the 2016-era versions. Linux is okay with these drives because it has a quirks table which identifies USB chips that require special handling. It seems that USB drives with Genesys Logic chips are a major culprit.

The root problem is the USB chip in these drives have sporadic trouble when doing large I/O read transfers. The larger the number of blocks in the I/O read, the more probable that it will have a transfer error. This is why the older versions of U-boot worked. The older versions have a maximum block count of 20, whereas the new versions have 65535.

If the drive gets a transfer error, it refuses to do any other I/O transfer until the error is properly cleared and reset. U-boot did not do this, so every subsequent disk read would get a timeout error because the drive would not send any data. The next two I/O operations are to read the status and reset the drive -- both of which got a timeout. The only way to clear it was to do a U-boot USB RESET command.

The fix has two parts. The first part is to properly clear and reset the USB disk when it gets a transfer error. This gets rid of the timeout errors.

The second part is to dynamically reduce the maximum block count at every read I/O that throws a transfer error. We don't want to permanently lower the max block count, because even with a 33MB blocksize fewer than 20% of the reads get an error. Over 80% are successful. And reducing the block count greatly slows down the read operation, so we don't want to do that. For a 25MB read, a successful read with the default max block count (65535) takes about 1.0 seconds but with a block count of 64 it takes 22 seconds. With the dynamic adjustment, the troublesome drives took average 1.2 seconds. Much better than 22 seconds.

Every read operation starts at the maximum block size. The dynamic reduction begins anew with each read.

There is new environment variable so that the maximum block count is a parameter instead of hard-coded. This variable is "usb_max_blk". The default is still 65535. However, with the dynamic reduction I never had a need to set this variable -- test runs with thousands of reads were all successful.

A related problem is that some drives are slow to come up. Linux handles this by issuing a spinup command and allowing more time for the drive to respond. The fix was to do the same (in two places in the U-boot code).

Patch files are in my dropbox folder:
https://www.dropbox.com/sh/nwjwxzdy0ifd1aq/AADcduZlnF9psWjRBpMB-cMja?dl=0

one is for the bodhi "u-boot-kirkwood-2016.05-kirkwood-tld-2"
The other is for the "v2017.05" U-boot from denx.de's github.

Also there is an excel file containing the results & statistics of my
testing one such troublesome drive on a Pogoplug E02 (pink pogoplug).

--------
To execute a test run of 256 large reads, do these u-boot commands:

setenv rd 'mw.b 0xf0000 52 c00000 ; usb read 0x80000 0 c000 && crc32 0x80000 c00000 '
setenv cnt 100 ; while itest $cnt > 0 ; do echo '######' loop $cnt ; if run rd ; then ; else echo '!!!!!!!' fail at cnt $cnt; setenv cnt -1;fi ; setexpr cnt $cnt - 1; done
Re: [Solved] EHCI timed out on TD - token=XXXX
July 05, 2017 09:01PM
Awesome work! Thank you for your work and sharing it!
Re: [Solved] EHCI timed out on TD - token=XXXX
July 06, 2017 06:14PM
@ray,

static int usb_test_unit_ready(ccb *srb, struct us_data *ss)
 {
        int retries = 10;
+       int gave_extra_time = 0;
 
        do {
                memset(&srb->cmd[0], 0, 12);
@@ -1019,6 +1052,13 @@
                if ((srb->sense_buf[2] == 0x02) &&
                    (srb->sense_buf[12] == 0x3a))
                        return -1;
+               /* If the status is "Not Ready - becoming ready", give it
+                * more time.  Linux issues a spinup command (once) and gives
+                * it 100 seconds. */
+               if (srb->sense_buf[2] == 0x02 && srb->sense_buf[12] == 0x04 &&
+                   gave_extra_time == 0)
+                  gave_extra_time = retries = 100; /* Allow 10 seconds. */
+               
                mdelay(100);
        } while (retries--);

For USB HDDs such as WD brand, 10 seconds might not be enough to spin up.

-bodhi
===========================
Wiki
latest Kirkwood kernel builds and rootfs
latest u-boot-kirkwood builds
latest Oxnas kernel builds and rootfs
latest u-boot-oxnas builds
latest MVEBU Armada kernel builds and rootfs
U-Boot & Kernel Booting process
bodhi's u-boot GitHub
bodhi's corner
Re: [Solved] EHCI timed out on TD - token=XXXX
July 06, 2017 09:47PM
> For USB HDDs such as WD brand, 10 seconds might not be enough to spin up.

Could be. It was enough for the one WD drive I had. I thought that Linux's 100 seconds might be excessive -- BWDIK? It would be hard to argue that uboot shouldn't also do 100 seconds. It's not like we'd be waiting 100 seconds for a bad or missing drive -- the drive is responding.

So change it to:
gave_extra_time = retries = 1000; /* Allow 100 seconds. */
Re: [Solved] EHCI timed out on TD - token=XXXX
July 06, 2017 09:53PM
ray,

> gave_extra_time = retries = 1000; /* Allow 100 sec
> onds. */

It's probably OK for mainline. Adding a u-boot env is hard to get accepted in mainline. But the max block env is a good one, so I think they will be OK with it.

For my build, I would keep the usb_ready_retry env for total spin up time.

-bodhi
===========================
Wiki
latest Kirkwood kernel builds and rootfs
latest u-boot-kirkwood builds
latest Oxnas kernel builds and rootfs
latest u-boot-oxnas builds
latest MVEBU Armada kernel builds and rootfs
U-Boot & Kernel Booting process
bodhi's u-boot GitHub
bodhi's corner



Edited 1 time(s). Last edit at 07/07/2017 03:09AM by bodhi.
Re: [Solved] EHCI timed out on TD - token=XXXX
July 07, 2017 03:18AM
Hi ray,

Consider the scenarios:

1. For boxes that boot with a HDD or MMC rootfs, we don't care much about the spin up time of the USB drive. Actually we want to ignore the failure rather quickly.
2. For boxes that boot with USB rootfs, we want u-boot to wait for a long period until the drive spun up.

I understood you would take the approach that is acceptable to upstream. But I think there has been a shortage of configurable parameters to do what we want for certain use cases.

-bodhi
===========================
Wiki
latest Kirkwood kernel builds and rootfs
latest u-boot-kirkwood builds
latest Oxnas kernel builds and rootfs
latest u-boot-oxnas builds
latest MVEBU Armada kernel builds and rootfs
U-Boot & Kernel Booting process
bodhi's u-boot GitHub
bodhi's corner
Re: [Solved] EHCI timed out on TD - token=XXXX
July 07, 2017 05:39PM
Re: [Solved] EHCI timed out on TD - token=XXXX
July 07, 2017 06:21PM
Quote

For boxes that boot with a HDD or MMC rootfs, we don't care much about the spin up time of the USB drive. Actually we want to ignore the failure rather quickly.

That's the maximum time we will allow the drive to keep saying "Not Ready". As soon as the drive says "Ready" we are good to go. If the drive doesn't respond at all, we don't retry, we fail immediately.

I only have a pogoplug, so I could not test real HDD or MMC. I have 3 WD USB drives, two became ready immediately and one took about 3 seconds to say "ready". Before that, it kept returning "Not Ready - becoming ready". I happen to know it was doing a spinup because I could feel it with my hand.

The original uboot code only waited 1 second (10 tries at 100 msec each). Not that if the drive returns anything exept "Not Ready", we do not give any extra retries.
Re: [Solved] EHCI timed out on TD - token=XXXX
July 07, 2017 07:42PM
>
> That's the maximum time we will allow the d
> rive to keep saying "Not Ready". As soon as the d
> rive says "Ready" we are good to go. If the drive
> doesn't respond at all, we don't retry, we fail im
> mediately.

I knew. The max spin time is just that. So it could be successful, like in 1/2 second, and we are done.

> The original uboot code only waited 1 second (10 t
> ries at 100 msec each). Not that if the drive ret
> urns anything exept "Not Ready", we do not give an
> y extra retries.

True. It's always been like this in mainline code.

So I basically don't disagree with what you said. But to have the most flexible implementation, the max spin time should be adjustable by user. Linux have quirks table, but u-boot does not. Therefore I'll merge your patch, but keeping the code about factoring usb_ready_retry env into the max spin time. In a multi-drive configuration, this env will be useful.

-bodhi
===========================
Wiki
latest Kirkwood kernel builds and rootfs
latest u-boot-kirkwood builds
latest Oxnas kernel builds and rootfs
latest u-boot-oxnas builds
latest MVEBU Armada kernel builds and rootfs
U-Boot & Kernel Booting process
bodhi's u-boot GitHub
bodhi's corner
Re: [Solved] EHCI timed out on TD - token=XXXX
July 07, 2017 08:51PM
Quote

But to have the most flexible implementation, the max spin time should be adjustable by user. Linux have quirks table, but u-boot does not. Therefore I'll merge your patch, but keeping the code about factoring usb_ready_retry env into the max spin time. In a multi-drive configuration, this env will be useful.

Sounds good.

Yeah, the quirks thing...Linux has a lot of flexibility because it is continually updated, and it can to clever retries, etc.. And it's got logging, a console, etc. Uboot code is basically static, and you don't want to put a lot of special purpose code in it -- like a quirks table. And for most people, no logging and no console, so if it doesn't work all you know is that the dang thing won't boot.
So from that aspect, it's better to lean toward robustness and reliability. Better to sometimes take 10-15 seconds to boot -- and boot everything, than to take 5 seconds to boot most disks but utterly fail on many disks -- or worse yet, fail 50% of the time and succeed 50% of the time.

Anyway, uboot code is just to get Linux booted up, so speed is not as important as robustness.



I don't disagree about leaving usb_ready_retry env in, even though with the patched code it doesn't seem to be needed. Most people will not be aware that there is even such an environment variable. And when people try googling their problem they get a huge number of hits, mostly other people having the same problem. But also a lot of suggestions like "try this, try that". Nobody really knew what the problem was, it was a lot of guessing and hoping. Even my worse drive worked 80% of the time, so holding your mouth funny and replugging it would generally work. And, of course, all the things that people tried didn't actually do any good, it just randomly worked or failed.

It took me quite a bit of work to isolate the problem, and it was really tricky to find & fix it. I bet I flashed 100-150 builds. Clearly nobody had tracked it down before me, otherwise it would have been fixed.

Better if the uboot code just works without having to configure special things.

----------
And now I just got exposed to the long-time programmer rule: Do not make any changes right before a vacation or holiday. 'cause I just tried to do a new build and it wouldn't build, something about "__bswaps12 not found". No time to track it down. I guess something updated and screwed up the cross-compile tools. ;-(
Re: [Solved] EHCI timed out on TD - token=XXXX
July 08, 2017 05:46AM
Quote

It took me quite a bit of work to isolate the problem, and it was really tricky to find & fix it. I bet I flashed 100-150 builds. Clearly nobody had tracked it down before me, otherwise it would have been fixed.

:)) you should be booting with kwboot, before final testing with u-boot flashed in NAND.

In any case, mucho kudos for perserverance and technical expertise in fixing this problem :) Nobody wanted to get to the bottom of this, they just used a different flash drive,

Quote

Better if the uboot code just works without having to configure special things.

Only in the last 3-4 years or so, people have started getting serious about making u-boot robust enough. I think it has a lot to do with there have been developers who are actually getting paid in their day job to roll new u-boot in mainline.

-bodhi
===========================
Wiki
latest Kirkwood kernel builds and rootfs
latest u-boot-kirkwood builds
latest Oxnas kernel builds and rootfs
latest u-boot-oxnas builds
latest MVEBU Armada kernel builds and rootfs
U-Boot & Kernel Booting process
bodhi's u-boot GitHub
bodhi's corner



Edited 1 time(s). Last edit at 07/08/2017 05:47AM by bodhi.
Author:

Your Email:


Subject:


Spam prevention:
Please, enter the code that you see below in the input field. This is for blocking bots that try to post this form automatically. If the code is hard to read, then just try to guess it right. If you enter the wrong code, a new image is created and you get another chance to enter it right.
Message: