Great report from whitslack on this crash at startup:
```
2020-10-07T13:03:21.419Z **BROKEN** lightningd: FATAL SIGNAL 6 (version 0.9.1)
2020-10-07T13:03:21.419Z **BROKEN** lightningd: backtrace: common/daemon.c:51 (crashdump) 0x559fb67bcc76
2020-10-07T13:03:21.419Z **BROKEN** lightningd: backtrace: /var/tmp/portage/sys-libs/glibc-2.32-r2/work/glibc-2.32/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 ((null)) 0x7f61cdca8baf
2020-10-07T13:03:21.419Z **BROKEN** lightningd: backtrace: ../sysdeps/unix/sysv/linux/raise.c:50 (__GI_raise) 0x7f61cdca8b31
2020-10-07T13:03:21.419Z **BROKEN** lightningd: backtrace: /var/tmp/portage/sys-libs/glibc-2.32-r2/work/glibc-2.32/stdlib/abort.c:79 (__GI_abort) 0x7f61cdc92535
2020-10-07T13:03:21.419Z **BROKEN** lightningd: backtrace: /var/tmp/portage/sys-libs/glibc-2.32-r2/work/glibc-2.32/assert/assert.c:92 (__assert_fail_base) 0x7f61cdc9241e
2020-10-07T13:03:21.419Z **BROKEN** lightningd: backtrace: /var/tmp/portage/sys-libs/glibc-2.32-r2/work/glibc-2.32/assert/assert.c:101 (__GI___assert_fail) 0x7f61cdca1241
2020-10-07T13:03:21.419Z **BROKEN** lightningd: backtrace: lightningd/subd.c:750 (subd_send_msg) 0x559fb67a1c31
2020-10-07T13:03:21.419Z **BROKEN** lightningd: backtrace: lightningd/subd.c:745 (subd_send_msg) 0x559fb67a1c31
2020-10-07T13:03:21.419Z **BROKEN** lightningd: backtrace: lightningd/peer_htlcs.c:252 (local_fail_in_htlc) 0x559fb6798f77
2020-10-07T13:03:21.419Z **BROKEN** lightningd: backtrace: lightningd/peer_htlcs.c:1441 (onchain_failed_our_htlc) 0x559fb6798f77
2020-10-07T13:03:21.419Z **BROKEN** lightningd: backtrace: lightningd/onchain_control.c:339 (handle_missing_htlc_output) 0x559fb6786b9d
2020-10-07T13:03:21.419Z **BROKEN** lightningd: backtrace: lightningd/onchain_control.c:455 (onchain_msg) 0x559fb6786b9d
```
The problem is a channel with an onchaind can be in state FUNDING_STATE_SEEN,
because onchaind has started but not responded to init yet (which it does once it
has analyzed the commitment tx).
Channel B is onchain, and its onchaind fails the HTLC, and we try to send a msg
to channel A's onchaind as if it were channeld.
Explicitly check if it's channeld, rather than trying to see if it's onchaind.
Fixes: #4114
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: crash: assertion fail at restart when source and destination channels of an HTLC are both onchain.
We're rarely in a hurry here, and bitcoind is aggressive with fees.
You can always spend this output if you really have to, using CPFP.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: Protocol: mutual closing feerate reduced to "slow" to avoid overpaying.
We had a couple of instances where a plugin would be killed by `lightningd`
because we were returning a result of an exception twice, and it was hard to
trace down the logic error in the user plugin that caused that. This patch
adds a traceback the first time we return a result/exception, and raise an
exception with a stacktrace of the first termination when a second one comes
in.
This can still terminate the plugin, but the programmer gets a clear
indication where the result was set, and can potentially even recover from it.
Changelog-Added: pyln: Plugin method and hook requests prevent the plugin developer from accidentally setting the result multiple times, and will raise an exception detailing where the result was first set.
It is often pretty usefuk to use the builtin logging module to debug things,
including libraries that a plugin may use. This adds a simple
`PluginLogHandler` that maps the python logging levels to the `lightningd`
logging levels, and formats the record in a way that it doesn't clutter up the
`lightningd` logs (no duplicate timestamps and levels).
This allow us to tweak the log level that is reported to `lightningd` simply
using the following
```python3
import logging
logging.basicConfig(level=logging.DEBUG)
```
Notice that in order for the logs to be displayed on the terminal or the
logfile, both the logging level in the plugin _and_ the `--log-level`
`lightningd` is running need to be adjusted (the python logging level only
controls which messages get forwarded to `lightningd`, it does not have the
power to overrule `lightningd` about what to actually display).
I chose `logging.INFO` as the default, since libraries have a tendency to spew
out everything in `logging.DEBUG` mode
Changelog-Added: pyln: Plugins have been integrated with the `logging` module for easier debugging and error reporting.
I discovered this accidentally when using the `tests/plugins/dblog.py`
plugin on another testcase: tests/test_connection.py::test_fail_unconfirmed
There the plugin/hook crashes because it can't execute the statement:
```json
{
"jsonrpc": "2.0",
"id": 34,
"error": {
"code": -32600,
"message": "Error while processing db_write: unrecognized token: \"174WHERE\"",
"traceback": "Traceback (most recent call last):\n File \"/home/will/projects/lightning.git/contrib/pyln-client/pyln/client/plugin.py\", line 535, in _dispatch_request\n result = self._exec_func(method.func, request)\n File \"/home/will/projects/lightning.git/contrib/pyln-client/pyln/client/plugin.py\", line 520, in _exec_func\n return func(*ba.args, **ba.kwargs)\n File \"/home/will/projects/lightning.git/tests/plugins/dblog.py\", line 45, in db_write\n plugin.conn.execute(c)\nsqlite3.OperationalError: unrecognized token: \"174WHERE\"\n"
}
}
```
Changelog-Fixed: plugin: Regression with SQL statement expansion that could result in invalid statements being passed to the `db_write` hook.
Since we start a new instance of postgres for each test we may end up swamped
and the startup can take a bit longer. So let's loop until we get a success.
lightning-5 can sometimes see itself sweeping the unilateral output resulting
in this weird line:
```log
HTLC already resolved by SELF when we found preimage
```
In both cases the flakyness arises from the destination not knowing about the
modified fees of the forwarding node, thus including the outdated details in
the routehint, and the sender being unlucky and always trying with the
routehint anyway.
The long-term solutions to this is going to be #4111, this commit just reduces
the flakyness to get back to business.
It was really flaky, especially under `test_mpp_interference_2`, most likely
due to multiple calls to `fund_channel`. This commit looks for the specific
txids in the `listfunds` output and the `getrawmempool` output, avoiding
strange artifacts from multiple calls.
We really are just interested in their on-chain footprint, so actually
starting the nodes is pointless overhead, and caused a lot of flakyness due to
the output ordering sometimes not matching up. We now just use the `bitcoind`
API to fund, sign and send a raw transaction that matches the stashed gossip
messages.
Kinda uses an antipattern of `cd`ing into the directory and calling `make`
there, but this keeps the contrib dirs self-contained so we can split them out
eventually.
The 100MB log buffer has been the biggest memory footprint for the daemon.
Keeping 10MB for emergency log dumps seems sufficient.
This has been mentioned in the last developer meeting.
Changelog-Changed: In-memory log buffer reduced from 100MB to 10MB
For performance reasons we were starting one for each session, which caused
the same postgres DB to be re-used for multiple tests (all test run in the
same worker process), but this could lead to interactions if there is a
timeout or a test happens to touch the `db_provider`. It turns out that we
were only saving about 15 seconds on a 1250 second run anyway, which is a
small cost for increased test isolation.
We were not removing the base test directory if we had other files in there,
which was the case for postgres runs. This now explicitly check for `test_*`
directories which are an indicator of a failed test.
We had a couple of issues with workers dying and attempting to re-initialize
the database while it was already initialized. This will look for a free
directory and just start the DB in there, allowing workers to be better
isolated.
While continuing to test all tests is good, it also means we potentially wait
much longer to get the final verdict, and it can truncate logs if there are
too many failures.
I got a corrupt file, which looked like multiple concurrent attempts
to build it. So instead, build it in one command, but also use
VERBOSE so we print correctly with V=1 (and --quiet).
Also move into plugins/ where it logically belongs.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We added a conversion of failcodes that do not have sufficient information in
faac4b28ad. That means that a failcode that'd require additional information
in order to be a correct error to return in an onion is mapped to a generic
one since we can't backfill the information.
This tests that the mapping is performed correctly and replicates the
situation in #4070