Can't install 5.8 can install 5.7

i3luefire · #1 **(View Single Post)** 25th December 2015

I get the below error message when I try to install 5.8.
I have completely installed 5.7 on the exact same hardware before and after the failed install of 5.8.
I tried to install from the install58.fs and from miniroot58.fs and I tried to upgrade a 5.7 install to 5.8.
Every time I get the same error.
Also if I try to ping from the command prompt on the 5.8 install image the first ping goes through and the I get the "Illegal instruction" error I assume this would be true for a number of commands since it seems to happen when the installer tries to unzip something and when I try to ping something.

Code:

zip: stdin: Input/output error
tar: End of archive volume 1 reached
Illegal instruction
ftp: Can't open file ///mnt/usr/share/sysmerge/etc.tgz: No such file or directory
gzip: stdin: unrecognized file format
tar: End of archive volume 1 reached
tar: Sorry, unable to determine archive format.
Installation of base58.tgz failed. Continue anyway? [no]

below is a log of what some ppl in the freenode #openbsd room had me try and some things i tried during.
then a link to some pictures of the problem happening.
then a link to my dmesg output running 5.7
if i could figure out how to copy and paste from the install img to the internet i would give a paste of the dmesg output on the 5.8 install img.
https://gist.github.com/i3luefire/9fd73e1b7f284bc6ca16
https://imgur.com/a/1HJgP
https://gist.github.com/i3luefire/623b62a44affdc47ad44

jggimi · #2 **(View Single Post)** 25th December 2015

Hello, and welcome!

The illegal instruction error is likely indicative of the root cause. Are you able to install an i386 system successfully?

i3luefire · #3 **(View Single Post)** 25th December 2015

Yes. Actually and I just tried the 5.9 snapshot of amd64 and the problem still exists. Also
wierder thing is when i use ping it gets one result before getting the error "Illegal instruction"
if i do ping -c 1 google.com it has no error
but with ping -c 2 google.com it has the error

these are some pictures and a dmesg from the i386 install
https://drive.google.com/folderview?...U0&usp=sharing

jggimi · #4 **(View Single Post)** 25th December 2015

This is very strange. The Celeron G1610 is 64-bit capable, and should be able to run either architecture. Both fail with similar issues, and it appears you've attempted to install from local media as well as from a nearby mirror.

I recommend taking this to the Project for analysis and review. A -current dmesg, and links to photos should be sent to the bugs@ mailing list.

(Now is a good time to do so, if the source of the problem happens to be a software bug. The Project just entered beta testing for 5.9, and they're asking for -current bug reports to be sent in.)

i3luefire · #5 **(View Single Post)** 26th December 2015

okay. i reported it.
but now i have a new bit of info.
i used my intel i5 laptop to install 5.9 amd64 to an external hard drive and when i put it on the celeron computer it will boot fine like the install media... but i still have the same illegal instruction message when i try to ping with more than -c 1

jggimi · #6 **(View Single Post)** 26th December 2015

When you run ping(8) from the installed system, does it create a .core file from the illegal instruction error?

i3luefire · #7 **(View Single Post)** 26th December 2015

No. strangely it does not or at least i can't find it. but i have *.core files from ftp ntpd and tmux. i sent this along with my last update on the mailing list.
here are some core dumps related to this problem.
https://github.com/i3luefire/openbsd...ive/master.zip
and here is the gdb output from those core dumps
https://gist.github.com/i3luefire/3b1177deef1ef473735b

jggimi · #8 **(View Single Post)** 27th December 2015

I built ntpd (since that was your first core file) with debugging symbols, and ran gdb against your core file, hoping for a match. If this is correct, the failure is in line 262 of ntpd.c:

Code:

if ((nfds = poll(pfd, i, timeout)) == -1)

This is the syscall poll(2).

jggimi · #9 **(View Single Post)** 27th December 2015

OK, that syscall is defined in /usr/src/sys/kern/syscalls.master as sys_ppoll. That function is in /usr/src/sys/kern/sys_generic.c.

The revision of sys_generic.c with OpenBSD 5.7 was 1.96. Looking through the syscall and its subfunction that does the work, doppoll(), I can see the addition of a POLLNOHUP loop at 1.98. I don't know if that's applicable to the problem or not, but its the only apparent change since 5.7 to my unskilled eyes.

The commit log says:

Code:

revision 1.98
date: 2015/05/10 22:35:38;  author: millert;  state: Exp;  lines: +5 -3;  commitid: rtX5Mpzd4CgHtDmM;
Set POLLHUP even if no valid events were specified as per POSIX.
Since we use the poll backend for select(2), care must be taken not
to set the fd's bit in writefds in this case.  A kernel-only flag,
POLLNOHUP, is used by selscan() to tell the poll backend not to
return POLLHUP on EOF.  This is currently only used by fifo_poll().
The fifofs regress now passes.  OK guenther@

Here's a an excerpt of the diff between 1.96 and 1.98, just within the dopoll() function:

Code:

@@ -953,8 +940,10 @@ doppoll(struct proc *p, struct pollfd *f
    if ((error = copyin(fds, pl, sz)) != 0)
        goto bad;

-    for (i = 0; i < nfds; i++)
+    for (i = 0; i < nfds; i++) {
+        pl[i].events &= ~POLLNOHUP;
        pl[i].revents = 0;
+    }

    if (tsp != NULL) {
        getnanouptime(&rts);

i3luefire · #10 **(View Single Post)** 27th December 2015

That does not help me because it is a bit over my head. but thank you for your response. if you think that may help the people on the mailing list solve the problem I hope you will send that reply to the ppl on the mailing list. bugs@

jggimi · #11 **(View Single Post)** 27th December 2015

No, I don't think it will help -- the .core file needs to match the symbols in the source code exactly, and it doesn't. There's nothing in that section of the poll() syscall code that indicates to me anything very special -- the change which touched the code only runs through the array of pl structures, setting variables.

So this morning (my time, just now) I ran a backtrace against your tmux core file, and can see that it's out-of-sync with the source code more clearly. It indicated a library error with event management, but the function noted in the stack was at a different location in source code -- so the symbols were misaligned.

---

The problems are occurring due to an illegal instruction, but I cannot locate the source of it with the information I have. There have been illegal instructions previously reported with virtual Celeron G1610s, as the Xen hypervisor can indicate this model to guest virtual machines...but I didn't find any reported with real Celeron hardware.

I can build you a system from -current source code, and then we'd know that any .core file you create will match that source code exactly. You'd have to install it from your working hardware, and then test again from the non-working hardware, capturing .core files once more.

But you'd have to trust some random guy on the Internet to provide kernels and filesets. Let me know if you'd like to give that a try -- and I'll build a system from source, and retain that source for debugging.

i3luefire · #12 **(View Single Post)** 27th December 2015

I will do it. Just let me know when the img is ready to install.

jggimi · #13 **(View Single Post)** 27th December 2015

Building begins. I'll have a matching source tarball available to you as well as the release(8). It'll be a few hours.

jggimi · #14 **(View Single Post)** 27th December 2015

Build of kernels, userland, and xenocara complete. Links provided via PM.

i3luefire · #15 **(View Single Post)** 28th December 2015

ohhhhh kayyyy. well. i am starting to notice a pattern. at least in some circumstances the core dump can be brought on by attempting an exit. eg if i type tmux then try "exit" tmux core dumps, or if i exit from my ssh session the ssh sshd core dumps, or if i ^c out of top i get a core dump. i have been trying to get info but i had to keep rebooting the machine because if i sshd in and exited the ssh it would core dump the sshd and i could not get back into the machine

jggimi · #16 **(View Single Post)** 28th December 2015

i3luefire has sent me a lot of core files, and I have matching source code. I started with ntpd, as it was discussed earlier. Frame #0 is the failure in the poll(2) syscall, and Frame #1 is the syscall to poll() at line 262 of ntpd.c:

Code:

if ((nfds = poll(pfd, i, timeout)) == -1)

The arguments passed to the poll(2) are: a valid pointer to the pollfd structure array pfd, and two variables: i =3, and timeout = -1.

The variable i defines the number of structures in the pollfd array. The core file shows them:

pfd[0]: fd = 3, events = 1, revents= 0
pfd[1]: fd = 4, events = 1, revents = 0
pfd[2]: fd = 7, events = 1, revents = 0

events = 1 is POLLIN per /usr/include/sys/poll.h, which is defined in the man page as "Data other than high-priority data may be read without blocking."

If the timeout argument is set to -1, the poll() blocks until the condition is met.

This syscall looks valid to me. The failing frame only provides an address ... and as I have the kernel source to match, I should be able to find it with a kernel built with makeoptions DEBUG="-g".

jggimi · #17 **(View Single Post)** 28th December 2015

OK, that failed. The backtrace has only two frames:

Code:

(gdb) bt
#0  0x00000ee8802c4dda in poll () at <stdin>:2
#1  0x00000ee64bf05e8f in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/usr.sbin/ntpd/ntpd.c:262

and the doppoll function is located in the kernel much further away.

Code:

(gdb) file bsd.gdb
Reading symbols from bsd.gdb...done.
(gdb) info address doppoll
Symbol "doppoll" is a function at address 0xffffffff811a97f0.

I don't know how to debug syscalls, obviously.

All I know of them is on page 15 of this presentation.

I'm going to look through the other core files today, and see if I can find other types of errors.

jggimi · #18 **(View Single Post)** 28th December 2015

I've looked at these core files. All are failing inside of syscalls, though the syscalls vary: poll(2) twice, kevent(2), read(2), waitpid(2).

I'll post findings to bugs@ later today, and ask for assistance. I'm sure there's something easy and obvious which I'm missing regarding syscall debugging.

i3luefire · #19 **(View Single Post)** 28th December 2015

Thanks for all your help so far.

jggimi · #20 **(View Single Post)** 28th December 2015

I've posted to bugs@. Hopefully, we'll get some direction to narrow this down.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
SSD Install	spitfire_ak	OpenBSD Installation and Upgrading	10	30th August 2014 06:56 PM
Install 5.0 from a 4.9 CD?	raindog308	OpenBSD Installation and Upgrading	7	24th April 2012 04:00 PM
to install on usb to hd...	demonio	FreeBSD Installation and Upgrading	1	21st July 2011 05:28 PM
How - To install GNOME vile I install OpenBSD ?	looop	OpenBSD Installation and Upgrading	6	24th April 2010 08:58 PM
How to install from CD	cvr1985	FreeBSD Installation and Upgrading	3	4th June 2008 07:53 PM