Sanitizing illegal utf-8 sequences

I have incoming logs with invalid encodings. It is normal and expected: logs actually contain (verbatim) illegal input from the users, like mail log of an extremely bogus username, for example, or a server log listing Postgres error message rejecting invalid UTF-8.

However this cause a lot of pain for fluentd, mainly getting unfriendly errors from Elasticsearch, and keep trying to send inifitely (or, maybe in newer versions: actually losing the logline!).

I have tried - and failed - to sanitize like

  @type record_modifier
  # try to replace invalid encoding
  char_encoding utf-8:utf-8

but it doesn’t like my idea at all.

Is there anyone with an idea how to [forcibly] replace illegal encoding with U+FFFD (�) or even a question mark?


Dear Myself!

I am not a helpful community (seems it’s gone far), but there is a solution.

  # requires gem
    @type string_scrub
    replace_char \uffef

It is dangerously underdocumented but if you provide a replace_char (unlike yours truly) and it’s sound (of which I am not sure in the example above, but “?” seems to work) it seems to work, or at least prevent chunks from being stuck: it will replace invalid utf-8 sequences with replacement.

We are sorry that you’re on your own, but that’s the way (a-ha a-ha) I like it (a-ha a-ha).

It doesn’t work well though.

I had to ruby it, eventually, and do

but nobody uses this forum anyway, so who cares.

I also wrote a filter to convert escaped and encoded email subjects to utf8, also validate it, so fluentd doesn’t freeze the stream daily, and elastic doesn’t reject it all the time, but, again, nobody cares about that either.