Details
-
Type: Improvement
-
Status: Open
-
Priority: Major
-
Resolution: Unresolved
-
Affects Version/s: None
-
Fix Version/s: None
-
Component/s: Data Summarization
-
Labels:None
Description
One problem with relying on processing things hourly (by filename) is if we miss an hour (even if by 3 minutes) we need to wait an entire hour to pull the logs from the recording servers and process them. This means we pull 2 hours and need to process 2 hours, which takes longer and could put us behind yet another hour (even if by just 3 minutes) The business logic for that hourly filename convention is also duplicated in in the pixel recording code, the v2pixsync.sh code, and the pixel playback java code. It would be nice if we only had that logic in one place.
We need to be able to process an arbitrary date range of data. Pixel Playback has a sweet spot of optimal performance. If the data set is too small, there is more overhead creating temporary tables, doing database queries and joins, and reading in/moving files. If the data set is too large, memory usage or the database server become a bottleneck. It would be nice if we could batch the pixel hits for being loaded into the database by volume (so we ensure we're in the sweet spot) or by time (preferably an arbitrary time, even as much as once every 24 hours). Example: We could take a batch of 500,000 hits that has an arbitrary date range (however long it takes them to accumulate that many hits, whether 43 minutes or 18 minutes, or 6 hours) and process it every time the data accumulates to that many hits. For lower volume sites we might only need to process their data once a day. If there are low volume sites that need hourly data for the current day, we could still enable that feature and process their data at least once an hour.
One solution could be to remove all of that hourly logic from the process and stream the pixel hits to the processing server (currently xml-07) using something like syslog or flume to stream data on demand. Pulling the message queuing logic out of pixel playback into a separate app (queue listener, web service, whatever) would separate the concerns of loading the data into the database from when we run pixel summarizer and allow more flexibility.
We still need to make sure we have all data for a day (or for an hour) from all servers before summarizing that day into the warehouse, of course.