So this blog will be dedicated to reviewing the path I followed in my learning - I hope there are a few "gold nuggets of wisdom" that can be gleaned along the way.
Frustration & Misinformation
After looking at a few different products, their capabilities and extensibility, I became frustrated with the lack of information the vendor "technical leads" really had on how the products worked, what their limitations were, and the fact that they all said their products were "infinitely scalable". Does anyone actually believe that a product is "infinitely scalable"? I hope it comes with a silver bullet.
If the "technical leads" can't explain, in terms I can understand, what differentiates their product from another, how can I ever make an educated decision? After all, they're the ones who understand this log management space, right?
Lack of Understanding
I took this on more as a frustrated challenge than actual "fun". I wanted to prove, to myself at least, that I could probably do as much (or nearly so) as any of these vendors could with the $1 Million dollar price-tag software they were pushing. After all, Syslog has been in use for a couple of decades, right? How tough could this be?
As it turns out, you CAN do a lot with free tools - but not everything - or at least not yet. Even if I can't do everything, I've learned a lot (and am still learning). With all the work I've put into this so far, I hope this info is worth sharing - so if you are buying $1Million dollar piece of software, hopefully this will help you know what to look for, determine what you need, and be able to tell the vendor when you think they're full of crap.
As I said, this posting is going to delve into the complexities of Log Management. I'll explain, briefly, what log correlation is, but I won't get into that just yet - that'll be another post. There's a lot of info to share already.
Correlation vs Consolidation
Log correlation is the act of taking logs from disparate sources, and combining the information together in some meaningful way to help determine the underlying problem. For example, a simple failure event on a firewall may be commonplace, and something you'd typically ignore as background noise; but if it's coming from a UNIX box that just had someone successfully logon remotely, it may be more meaningful - perhaps even a compromised box. That's rather simplified, but I hope you get the idea.
Log consolidation is simply that - getting all logs into a common repository.
They are complementary technologies, but the primary goals are vastly different since. With correlation, not all of the informatin is necessarily maintained, just the correlated result. Most IDS/IPS systems work in a similar manner - entire packets aren't maintained, just the fact that the packed matched some signature.
A common misconception is that log correlation is a superset of log consolidation. After all, if correlating logs, the system doing the correlation has to get data from all the sources it's correlating, so it's very similar to a repository of data. Unfortunately, it's not that simple - but I'll have to address that in my future correlation blog. I will say that I believe correlation to be much more complex than collection and leave it at that for now.
Consolidating the Logs
My journey along the the log management path started with the relatively simple (though non-trivial) task of log consolidation. In it's simplest terms, all that's necessary for log consolidation is to get the data from a remote source and write it to disk.
From a performance standpoint, a ridiculous amount of data can be handled by pushing all of this data to a flat file. I can easily handle tens of gigs per day, even on a relatively slow server using this simplified architecture. The primary bottleneck in this scenario is Disk I/O - assuming the disk is not big$$ disk arrays. Typical server RAID arrays can handle Megs of data per second. From a performance standpoint, no big problems so far.
It's when trying to READ this data, however, things start to go downhill; and, let's face it, the data is being collected for a reason. With this volume of data in a flat file, though, it's like looking for a needle in a haystack if to answer something as simple as "who caused this account to get locked out", or "are all these SSH sessions coming from one (misbehaving) host, or is someone trying to break in", etc.
So, being the rational sort, I ask "Why not just push all the data into a database to allow for easy querying and reporting?". This should resolve most of my READ issues by limiting the amount of data I need to pour through.
Now, with databases, a few more issues came up:
- transaction times
- indexes
- parsing
- data archival
- maintenance
As I already mentioned, relative to the rest of the system, disk is slow - databases are even slower. Why? Well, this gets even more convoluted to understand, so please bear with me as I explain some technical details while I go through the description of my trials.
Transactions
Because of the transactional nature of databases, all events involved in a single "transaction" against the database either all succeed or all fail. So, to keep it simple, if I'm INSERTing a new record into the database, the database needs to first "stage" the data to a temporary location - called the transaction log - in case the insert fails (cancelled, out of disk space, database not available, etc.). Then, the database software will periodically say "all the transactions that have successfully completed in the transaction log will now be written to the actual database and expired from the transaction log". That's a lot of disk management and data movement, especially since most event records will be less than 1024 bytes; at least 3 disk writes take place, one to the transaction log, one to the actual database, and one to expire the transaction log data once written to the database. The nature of the beast means that this is SLOW, so databases reduce the number of events per second the system can handle.
Indexes
The whole point of writing the data to a database was so that we could search through it quickly rather than scanning Gigs of log files for individual events and then trying to pick out the relevant ones. That means indexes are needed for fast search, which translates to analyzing the data for sort-order, and writing the index. Index updates are also transactional, and all of this takes up CPU time and disk I/O.
Parsing
Most events found in logs are full-text items and look similar to the following:
"June 10 2006 3:02pm Event ID 32: User XXX logged on from Y.Y.Y.Y IP address" or
"July 11 2006 4:04am Event ID 108: Firewall allowed TCP port 80 to destination A.A.A.A from source B.B.B.B".
Short of performing substring searches for every query run, or creating a full-text index (not recommended when dealing with millions or billions of records), the relevant pieces of information need to be parsed out and store in the database.
So I'd want to parse out the following values for the first fictitious event above:
June 10 2006, 3:02pm, XXX, Y.Y.Y.Y
For the second event, I'd want to parse out:
July11 2006, 4:04am, 108, TCP, 80, A.A.A.A, B.B.B.B
This takes up CPU time for each and every event, and also requires architecing a database format that allows for multiple strings associated with a single event - not a flat database table. And I'd have to parse these out FAST, because there are more events coming in all the time.
There are some trade-offs that can be made, like parsing out just the Event ID number, and date/time stamps, and indexing just those. Then, when I need to look something up, I can at least narrow the criteria, and then do substring searches. In fact, I do exactly this all the time - it works very well, but does have limits when I'm creating a report making use of a very large number of data records.
Data Archival
Archival caused me quite a few problems with database management - mostly because I was using MS SQL Server 2000. I didn't understand, at the time, how important a function called "partitioning" could really be - something many database systems have, but not introduced into MSSQL until SQL 2005.
Obviously, I couldn't just let my database grow forever. I had to delete old data after a while, and I wanted to be able to put it to backup tape so I could restore it later if needed. Restoring database data without restoring an entire database is a whole new art-form that DBAs and backup software companies have been dealing with for a long time, but as it turns out, that's not something I needed to get involved with. I simply wrote the data to a flat file (remember, there's very little overhead to doing this) as I wrote it to the database. No more need for archival since I'm creating the archive as I insert data into the database.
I still had the problem of expiring/deleting the old data, though. Remember, databases are transactional - more disk writes, more CPU, more memory used - even for delete activities. What's worse, I'm trying to delete a LOT of data all at once, not just simple transactions like writing individual events to the database - I'm talking 20 million records per day.
Whenever this task ran, all the events coming in from my remote devices would get queued up (because the database is busy with a huge delete event) and I'd eventually run out of physical memory due to all the queued events and the collection process would crash. There was also a disk space issue since my transaction log was storing millions of records during this delete process. So, breaking this down into smaller chunks (let's say, one hour at a time) helped, but this could still take several minutes - causing more event queueing. I couldn't guarantee memory wouldn't run low especially if a system was really under attack - and I really didn't want to lose the data during those instances.
Here's the advantage of "partitioning". Partitioning allows chunks of data to be stored into separate database files rather than one large file like traditional databases. Then, an entire partition file can be dropped rather than trying to delete chunks of data from a single file. So, if by creating daily partitions, it's pretty easy to drop any single day of data - there's still some performance considerations, but not nearly as bad.
SQL 2005 wasn't available at the time, and I didn't have the luxury of moving to Oracle. So, I created a batch job that was scheduled to run every hour, which deleted all the records older than 30 days in chunks of 100,000 records. Even though it specified data older than 30 days, since this runs hourly, it's really only deleting 1 hour worth of data during any typical run. Further, since I'm deleting it in 100,000 event chunks, then pausing for a few seconds, it gives other processes time to access the database (less queueing). There are a few other optimizations to this technique, but I've already documented most of that here on the Adiscon site (I was using WinSyslog at the time I developed the T-SQL Script).
So far, so good - but every now and again, maintenance still needs to be run (update indexes, shrink files, make sure there's no corruption, etc).
Maintenance
Maintenance is a brutal time consumer of SQL databases. MSSQL 2000 promises being able to run maintenance on a live database, which is true, but the same maintenance run against a live database runs 18 hours vs. 2.5 hours if the same database is off-line. These times were what I experiences on my hardward - these time will vary depending on how many and which maintenance tasks are being run. I'll be honest that I haven't revisited this aspect of MSSQL 2005 yet to see if there are any significant improvements, but they would have to be very significant to improve by 15.5 hours.
The best I could do
The best I could ever get was a sustained rate of about 1,800 events/second with everything else happening. I would get periodic spikes which would cause backlog, but these would finish up during low-usage times and the system would continue running.
But this wasn't enough for me. Trying to insert data, delete data, and run maintenance with the volume of events, all in real time, did me in. I simply couldn't live with just 1,800 events/second on average. I would get hours at a time where I'd see 2,000+ events/second.
Time to Re-evaluate my Priorities
Big question: Do I need to have all of this data into the database real-time?
For me, the answer was "no". I was doing next-day reporting and forensic analysis with the data. I know I could easily pull all of this to a flat file. So, for me, the solution was to do a nightly batch INSERT of all the events in my flat file. As long as maintenance and delete scripts weren't running, this worked really well, and allowed me to run all the reports I wanted to. I even created a modified version of my import script which would allow me to do a bulk-import of the files restored from tape during an investigation. This method also allowed me to focus on database optimization so queries would run faster, rather than just keeping the database running at all.
So, I had 4 scheduled tasks on my server:
1. Hourly data delete (impact of doing this throughout the day is low, and prevents any nighttime bottlenecks). This would check to ensure none of the other 3 tasks are running first, so sometimes it would actually delete 2-3 hours worth during some cycles.
2. Nightly import script - at the end of which I update the datetime index used regularly by Step 1 - I found this make Step 1 run better each day. The script also makes sure that the next 2 scripts aren't running first.
3. Saturday mornings I do a full reindex (not just the datetime index).
4. Sunday mornings I perform a database defrag & shrink operations. This has a significant impact on improving performance.
There was no need for integrity checking since the database could always be rebuilt from the flat files, so I also set the database to a "simple" recovery model rather than the "full" recovery model when I was doing only direct-to-database logging.
But I still had a really big question...
How do the vendors manage to resolve all of these problems?
I started this article off by highlighting 3 problems I was experiencing: Misinformation, Frustration, and Lack of Understanding.
They say "forewarned is forearmed" - and after all my work - I was ready to talk to vendors again.
At this point, I knew exactly what to ask to ensure I received clear answers; I knew what I needed the product to be capable of (including data volumes which is important); I knew how to get an answer that made sense to me rather than "smoke and mirrors" answers that I'd received before. My belief, at this point, is that any database-driven solution won't handle my 10+GB of data per day (and growing) without some VERY serious hardware on the back end which would make the solution cost prohibitive.
But some vendors claim they can collect tens-of-thousands of events per second - and still allow for fast querying & reporting - even though they're just and appliance. How could that be?
They maintain flat files, with database indexes.
This makes perfect sense to me. Speed of flat files for storage, and using database-like indexes for fast searches. This also requires very little maintenance since, in a worst-case scenario, the index can simply be rebuilt. I'm sure though, that there would need to be some special handling routines for detection of corrupted files, etc, but overall a sound solution. This also makes restoring archived data very fast since it means just re-creating an index once the flat files are restored.
Some vendors - probably all of them handling large volumes - have been down the same path as I have, and learned the same mistakes I have, and have developed this hybrid approach. To my knowledge, there is no open-source solution performing in this manner, but unless very high volume collection is required, one of the above approaches may already work.
I really hope my sojourn down memory lane helps others to avoid some of the same pitfalls that I've experienced - and hope you've learned something that will help you. Keep in mind, I've only talked about log collection this time around.
Don't think, though, that this is the only thing that differentiates the vendors from home-grown, or even from each other. Very few logs are standardized, and definitely no standardization between competing vendor products, so event parsing can be a lot of work. There's also correlation, reporting & analysis, which are all hefty topics.
But, there's plenty of internet whitespace left to discuss these topics in the future...