Will FluentD work for my use case?

Hello!

Apologies in advance for my general lack of knowledge. I’m fairly new the the observability world. I’m wanting to know if FluentD would be a good fit for my use case. Any feedback or guidance would be appreciated.

My company has tons of varying log types that are mostly stored in S3. These logs are ranging from common log types such as, AWS CloudTrail & VPC flow logs, to common linux logs such as /var/log/secure, to specific application logs, Kubernetes logs, etc.

Part of my team’s responsibility is to take these logs and move them into our data lake: Google BigQuery.

So far what my team has been doing is writing Dataflow templates per type of logs … so we’d have a specific Dataflow template for CloudTrail, one for VPC flow logs, etc. The Dataflow jobs are basically reading things out of a bucket and then streaming them into BigQuery. The reason that we’re having to create new templates is to accommodate the format of the logs… since CloudTrail’s json needs to be parsed out, whereas VPC flow logs are a flat file that, etc.

What I’d like is a solution that doesn’t force me to create a new parser for each log type out there… it sort of feels like we’re reinventing the wheel. I’m sure we’ll have to create unique parsers for stuff like application logs, and hopefully that’s not too difficult either.

Another thing that I’d like to have is some kind of transforming ability so that the schema is more consistent between log types, so that it’s easier to correlate data between different log sources.

We also stream a lot of data into our data lake, 20 petabytes per year and growing.

I know that there are plugins for things like S3 and BigQuery… I’d just like to get some advice on whether what I’m trying to do is feasible. We have dozens of different S3 buckets with multiple different log source types… all outputting to tables in BigQuery that have different schemas depending on the log source. I’d just like to get some insight as to whether on not it makes sense to use FluentD to help streamline this from folks who have experience with it.

Thanks in advance for any advice you can offer :slight_smile:

Hey Keefer,

Thanks for the question - I think that the scenario you have could be accommodated by Fluentd, though like any integration will need some work. If I understand correctly the main use case is collect logs from various sources / transform + parse logs / Output them into Google Big Query.

From the collection side you have the following:

  • AWS S3 buckets
  • AWS Cloud Trail
  • AWS VPC flow logs
  • Linux syslog + /var/log/secure
  • App Logs
  • Kubernetes logs

Of these the strongest that Fluentd cna help with are the last three (k8s, app + linux syslog) and that is an amazingly common use case. For the pull sources (S3 / Cloud Trail / VPC flow logs) this can also be done though I personally am not sure what the performance of that will be, and you may need to measure to ensure it can keep up. If you are only looking at the last three (k8s, app, syslog) - I would also recommend looking at Fluent Bit instead.

The transformation side you have a few options. You can use the inbuilt parsing libraries that Fluentd has, such as Syslog (RFC 3164 + 5424) as well as logfmt and other popular standards. For the app logs though you may need to use custom parsing which can be achieved with the regular expression parsers.

It is also important to note that you could also enrich / redact logs in flight if you need to depending on the use case with record_transformer - Fluentd.

Hope this helps and if you need more info I’m @ Anurag in the Fluent Slack channel

Appreciate the insight! Thank you.