2013年8月14日 星期三

DataTubine 讀書筆記3: Sink, Real-Time

http://www.dataturbine.org/content/sink
http://www.dataturbine.org/content/real-time

---------------------------------------------------------------
DataTurbine Sink
Introduction

A DataTurbine Sink (also refereed to as an 'off-ramp') is simply a program that takes data from a DataTurbine Server and utilizes it, for example brings it up in Matlab or Real-time Data Viewer or puts it into a relational database or file for permanent storage.

Just like a source, a sink runs independently from the server as a separate application and uses the network to communicate. It can run on the same machine as the server or on a machine across the world.
The Sink's Perspective

From the sink's point of view it no longer needs to know where the data came from or how it got there. It can query all the sources and channels to find out what is available or specify a single channel via its name and name of its source.

The data is heterogeneous and the sink could access any type of data seamlessly. It makes the decision on how to display and interpret the data via its data type (byte array, 32-bit float, 32-bit int, etc) as well as the  MIME Type specified by the sink.

A sink can issue a request to pull data from the server in a timeframe. A sink could also subscribe to a specific set of channels getting data as it becomes available.

Example: For example a sink could get a listing of all the sources available on a server pick only the temperature channels, perform some analysis and based on the result bring up the images for the corresponding channels at significant time indexes
Common Types of Sinks

    Viewer: An application that can be used to access and interact with the streaming data
    Ex: Real-time Data Viewer (RDV), Google Earth, etc...
    Web Server: An application that serves the data as web content for public display
    Ex: Graphs on a public web site
    Analysis: Takes the data and performs some kind of manual or automated analysis
    Ex: Mat lab, R, ESPER, etc..
    Export: Exports the data into a file or set of files for distribution or integration
    Ex: CSV files, Excel, etc...
    Storage: Permanent storage in a database or as a series of files.
    Ex: Storage in a relational database
    Other: Easy to code any kind of sink that utilizes the data

Practical Example (Continued):

Going back to the example used in the source. Imagine a simple meteorological tower that measures temperature and humidity on top of a hill. Nearby is a field station that is also measuring temperature. We put this data into DataTurbine on a laptop at the field station and now want to view it and make sure that it is placed in permanent storage.

    Start a DataTurbine server on the laptop (rbnb.jar)
    Start a source on the laptop reading data from the meteorological tower
    Start a source on the laptop reading data from the field station
    Start a sink to view the data as it is collected in real-time. In this case we will use Real-time Data Viewer (RDV)
    Start a sink to put the data into permanent storage in a MySQL database.

Our laptop would now have five independent lightweight programs running (1 server, 2 sources, 2 sinks). We will probably keep the server, sources, and the permanent storage sink running at all times. But we will start and stop the viewer sink as we need it.

Now we have a very basic but complete deployment running. But we are not sharing the data and not really utilizing the power of a real-time system (Aside from viewing the data as it is collected). Fear not this will be discussed in further sections as we build on our example.

Power of Real-time
DataTurbine as a Real-time Data System

If you read through previous sections you can see some of the benefits of DataTurbine as a "black box" system, separating the sources from the sinks and handling heterogeneous data types in a unified system. However the primary reason to use DataTurbine is the ability to interact with data in real-time or near real-time.

DataTurbine is built around this constant and its limitations for historical data are a direct consequence of its strength and speed at working with streaming real-time data.

In addition to working with live data, DataTurbine can stream archived as if it were live, re-utilizing common data viewers and infrastructure for post-test data analysis and review.
What is Real-Time Data

Real-time data refers to delivering data as soon as it is collected. There is no delay in the timeliness of the information provided. This is in contrast to an archival system that stores data un till a later date.

DataTurbine can handle data sampled millions of times a second or as infrequently as once a century. In practice many uses are somewhere in between with data sampling every second, minute or hour.

As many remote sites can have drastic communication delays and do not require a strict time constraint, it would be more correct to refer to those systems as providing near real-time data but for the sake of simplicity they are often also grouped into the real-time category.

Also note that when we talk about real-time we are focusing on the availability of data not to be confused with real-time computing which focuses on guaranteed response within strict time constraints.
Benefits of Real-time Data

    Interactive:
        Failure:The most direct benefit of real-time data is the ability to respond to factors on the fly. If a sensor goes bad the system registers it immediately and can be fixed (before potentially months of data are ruined).
        Important Event: If an event of importance occurs a team can be dispatched immediately to gather additional samples and observe the occurrence first hand.
        Sampling: With a real-time system its possible to change sampling rates and activate and deactivate sensors based on the data they receive.
        Example: If one sensor detects an important event perhaps the sensors in that region need to increase their sampling rate temporarily or a camera needs to be activated.
    Analysis: There is a lot of analysis that can be performed on real-time data and in certain cases this is actually the more efficient route. Averages, correlations, and mathematical operations can be performed in real-time with ease. The derived data can be put back into DataTurbine and further utilized. The end result is that summary and analytic data is available on the fly giving an overview of the health of the system and the experiment.
    Public Consumption: Real-time also gives added value to the data. Data can be published publicly as it is gathered. The same sensor network that is monitoring an ecosystem for scientific research can display the tides and temperature of the water, the wind speed and direction, even a video feed showing the view of the forest.
    Portable: Streaming data is very portable. Adding destinations or applications is easy and transparent. Since data is contained as tuples (time,value, source) it is easy for any system to accept it and requires significantly less overhead then trying to read from a rigid structure such as a database. Once a streaming system is set up raw data, and automated analysis and quality assurance and quality control are available to any application and destination that the provider specifies the second it is available. Any additional analysis (which could take weeks or months) can then be amended later.
    Funding Compliance: There is an increasing pressures by funding agencies for data providers to publicly publish data in a timely manner. A real-time system can help satisfy that compliance.

Limitations of Real-Time Data

    Not a Replacement: A real-time data system would ideally be an addition not a replacement for an archival system. It should add to a system but makes a poor replacement for operations that are best suited to an archive such as a relational database.
    Data Quality: Data coming directly from sensors will have inherent imperfections which have to be cleaned away before consumption. Unlike an archival system which often just provides the cleanest most annotated data, a real-time system would ideally have multiple data levels of progressively cleaner data.
        Automated Cleaning: Automated QA/QC can be performed on a real-time stream to identify obvious inconsistencies and potentially problematic parts of the data.
        Levels of Assurance: Different applications require a different level of assurance. For example a local weather site could use nearly raw data, while an intricate carbon dioxide absorption experiment would utilize manually cleaned and validated data.
    Different Paradigm: While traditional analysis would still work on archived data, utilizing the real-time aspect of data often requires a different approach then analysis on archived data.

---------------------------------------------------

DataTurbine水槽介紹DataTurbine Sink(也作為一個'off-ramp')是一個簡單的程序,數據一個DataTurbine服務器,並利用它,例如把它在Matlab或實時數據查看器或把它放到一個關係型數據庫永久存儲或文件。就像一個源,一個接收器獨立運行,從服務器作為一個單獨的應用程序,並且使用網絡進行通信。它可以運行在同一台機器作為服務器或世界各地的一台機器上。水槽的角度從水槽的角度來看,它不再需要知道從哪裡傳來的數據或如何到達那裡。它可以查詢找出什麼是可用的,或者指定一個單一的通道,通過它的名字和其來源名稱來源和渠道。數據是異構和水槽可以無縫地訪問任何類型的數據。這使得決定如何顯示和解釋數據通過它的數據類型(字節數組,32位浮點,32位的int等),以及指定的MIME類型水槽。一個接收器可以發出請求,將數據從服務器的時間表。一個接收器還可以訂閱到一組特定的渠道獲取數據,因為它成為可用。例:例如,一個接收器可以得到一個上市的所有源服務器上可用的只有溫度的渠道,進行一些分析,並根據結果提出相應通道的圖像顯著的時間索引常見類型的水槽

    
查看器中:一個應用程序可以被用於訪問和互動的流數據
    
例如:實時數據查看器(RDV),谷歌地球等..
    
Web服務器:一個應用程序,提供Web內容的數據作為公開展示
    
例如:一個公共網站上的圖
    
分析:取數據,並執行某種手動的或自動的分析
    
例如:墊的實驗室,R,ESPER等。
    
出口:出口數據分佈或融合成一個文件或文件集
    
例如:CSV文件時,Excel,等等。
    
貯藏:永久存儲在數據庫或一系列文件。
    
例如:存儲在關係數據庫中
    
其他:便於代碼的任何一種接收器,利用數據實例(續):回去用在源的例子。想像一下,一個簡單的氣象塔,測量溫度和濕度在一個小山頂上。附近是一個場站,這也是測量溫度。我們把這個數據到DataTurbine對場站的一台筆記本電腦,現在要查看它,並確保它被放置在永久存儲。

    
啟動一個DataTurbine服務器上的筆記本電腦(rbnb.jar)
    
從氣象塔在筆記本電腦上讀取數據,啟動源
    
在筆記本電腦上讀取數據,從場站啟動源
    
啟動一個接收器來查看數據,因為它是實時採集。在這種情況下,我們將使用實時數據查看器(RDV)
    
啟動一個接收器,把數據轉換成永久存儲在MySQL數據庫中。現在,我們的筆記本電腦將有五個獨立的輕量級運行的程序(1個服務器,2個數據源,2個水槽)。我們可能會保持服務器,來源和運行在任何時候都永久存儲片。但是,我們將開始和停止觀眾片,因為我們需要它。現在我們有一個非常基本的,但完整的部署運行。但我們不會共享數據並沒有真正利用一個實時系統的力量(除了查看收集的數據,因為它)。不要害怕,這將是在進一步的章節中討論,因為我們建立我們的例子中。


Power Real-Time作為一個實時數據系統DataTurbine如果你通讀前面的章節中,你可以看到一些好處DataTurbine作為一個“黑盒子”系統,分離源匯和處理在一個統一的系統中的異構數據類型。然而,主要的原因使用DataTurbine是在實時​​或近實時的數據進行交互的能力。DataTurbine是圍繞這個常數,並在工作流的實時數據,歷史數據有其局限性的直接後果是它的力量和速度。除了工作的實時數據,可以流歸檔DataTurbine就好像它是活的,再利用常見的數據後測試數據的分析和審查的觀眾和基礎設施。什麼是實時數據實時數據是指提供數據,只要它被收集。在提供信息的時效性不存在延遲。這是檔案系統數據未存儲直到日後對比。DataTurbine可以處理數據採樣數百萬次,第二次或很少,因為一旦一個世紀。在實踐中,許多的用途是每一秒,分鐘或小時的數據採樣之間的某處。由於許多遠程站點可以有激烈的通信延遲,且不需要嚴格的時間約束,這將是更正確指這些系統提供近實時的數據,但為簡單起見,他們往往還分為實時時間類。還要注意的是,當我們談論實時我們的重點是專注於嚴格的時間限制內響應保證數據不被混淆與實時計算的可用性。實時數據的優勢

    
互動:
        
失敗:最直接的好處是實時數據的反應能力上飛的因素。如果傳感器變壞系統寄存器立即可以是固定的(潛在個月的數據破壞之前)。
        
重要事件:如果發生的重要事件,立即派出一個團隊可以收集更多的樣本,並觀察發生的第一手資料。
        
採樣:一個實時系統,它可能改變採樣率,並根據他們收到的數據的激活和停用傳感器。
        
實施例:如果一個傳感器檢測到一個重要的事件,在該區域的傳感器可能需要增加採樣率暫時或相機需要被激活。
    
分析:有大量的分析,可以執行實時數據和在某些情況下,這實際上是更有效的途徑。平均,相關性和數學運算,可以進行實時提供方便。導出的數據可以被放回到DataTurbine和進一步利用。最終的結果是,匯總和分析的數據提供給系統健康狀況和實驗的概觀上的蒼蠅。
    
公共消費:實時還提供附加價值的數據。數據可以公佈,因為它是聚集。相同的傳感器網絡,監測生態系統的科研可以顯示潮汐和溫度的水,風的速度和方向,甚至是視頻飼料森林景觀。
    
便攜式流數據是非常便攜。添加目的地或應用程序簡單和透明。由於數據包含元組(時間,價值,源)系統接受它很容易,需要明顯較少的開銷,然後試圖讀取從剛性結構(如數據庫)。一旦流系統設置原始數據,並自動分析和質量保證和質量控制,提供指定第二個它是可用的任何應用程序和目標。任何額外的分析(這可能需要數週或數月),然後可以修改。
    
資助標準:資助機構的數據提供商,及時公開發布的數據,是一個越來越大的壓力。一個實時系統可以幫助滿足合規性。實時數據的局限性

    
不能代替:一個實時數據系統,將理想的歸檔系統的補充而不是替代。它應該添加到系統中,但使一個貧窮的替代品是最適合,如關係數據庫中的歸檔操作。
    
數據質量:直接來自傳感器的數據,將有消費前要清洗的固有缺陷。不同的檔案系統,這往往只是提供了最乾淨的標註數據,理想情況下,一個實時的系統將有多個數據水平的逐步清晰的數據。
        
自動清洗:可以進行自動化的QA / QC找出明顯的不一致和潛在問題的部分數據的實時流。
        
層次的保障:不同的應用需要不同程度的保證。例如,一個當地的天氣網站可以使用接近原始數據,而一個複雜的吸收二氧化碳的實驗將利用手動清洗和驗證數據。
    
不同的模式:雖然傳統的分析仍然對歸檔數據的工作,利用實時數據方面往往需要不同的方法分析歸檔數據。

沒有留言: